psichomics is an interactive R package for integrative analyses of alternative splicing and gene expression based on The Cancer Genome Atlas (TCGA) (containing molecular data associated with 34 tumour types), the Genotype-Tissue Expression (GTEx) project (containing data for multiple normal human tissues), Sequence Read Archive and user-provided data. The data from GTEx, TCGA and select SRA projects include subject/sample-associated information and transcriptomic data, such as the quantification of RNA-Seq reads aligning to splice junctions (henceforth called junction quantification) and exons.

Installing and starting the program

Install psichomics by typing the following in an R console (the R environment is required):

install.packages("BiocManager")
BiocManager::install("psichomics")

After the installation, load psichomics by typing:

library(psichomics)

Available functions: quick reference

psichomics: Start the visual interface of psichomics
parseSplicingEvent: Parse splicing events
getSplicingEventFromGenes: Get alternative splicing events from genes
getGenesFromSplicingEvents: Get genes from alternative splicing events

Data retrieval

TCGA/Firebrowse

getDownloadsFolder: Get the user’s Downloads folder
isFirebrowseUp: Check if Firebrowse web API is online
getFirebrowseCohorts: Query the Firebrowse web API for TCGA cohorts
getFirebrowseDataTypes: Query the Firebrowse web API for TCGA data types
getFirebrowseDates: Query the Firebrowse web API for processing dates
loadFirebrowseData: Download and load TCGA data through the Firebrowse web API
parseTcgaSampleInfo: Parse sample information from TCGA samples

GTEx

getGtexTissues: Check available tissues from a file containing sample metadata
loadGtexData: Load GTEx data

SRA

recount::recount_abstract: Check available SRA projects from recount
loadSRAproject: Download and load SRA projects through the recount R package

Custom and/or local files

loadLocalFiles: Load local files from a given folder
prepareSRAmetadata: Prepare metadata from SRA (as obtained from Run Selector page)
prepareJunctionQuant: Prepare junction quantification files from splice-aware aligners (currently, psichomics supports the output of the splice-aware aligner STAR)
prepareGeneQuant: Prepare gene quantification files from splice-aware aligners (currently, psichomics supports the output of the splice-aware aligner STAR)

Gene expression pre-processing

plotRowStats: Plot statistics (median, variance, etc.) per gene
plotGeneExprPerSample: Plot distribution of gene expression per sample
filterGeneExpr: Filter genes based on their expression
normaliseGeneExpression: Normalise gene expression data
convertGeneIdentifiers: Convert between different gene identifiers

PSI quantification

getSplicingEventTypes: Get quantifiable splicing event types
listSplicingAnnotations: List available alternative splicing annotation files
loadAnnotation: Load an alternative splicing annotation file
quantifySplicing: Quantify alternative splicing
plotRowStats: Plot statistics (median, variance, etc.) per alternative splicing event
filterPSI: Filter alternative splicing quantification

Custom alternative splicing annotation preparation

prepareAnnotationFromEvents
parseMatsAnnotation: Parse splicing annotation from rMATS
parseMisoAnnotation: Parse splicing annotation from MISO
parseSuppaAnnotation: Parse splicing annotation from SUPPA
parseVastToolsAnnotation: Parse splicing annotation from VAST-TOOLS

Data Grouping

getGeneList: Get pre-created, literature-based lists of genes
createGroupByAttribute: Split elements into groups based on a given attribute
getSampleFromSubject: Get samples matching the given patients
getSubjectFromSample: Get patients matching the given samples
groupPerElem: Return a vector with one group for each element
testGroupIndependence: Test multiple contigency tables comprised by two groups (one reference group and another containing remaing elements) and provided groups
plotGroupIndependence: Plot -log10(p-values) of the results obtained after multiple group independence testing

Dimensionality reduction

Principal component analysis (PCA)

performPCA: Perform PCA
plotVariance: Render variance plot
calculateLoadingsContribution: Calculate the contribution of the variables and return values in a data frame
plotPCA: Plot PCA individuals (scores) and/or variable contributions

Independent component analysis (ICA)

performICA: Perform ICA
plotICA: Plot ICA scores

Survival analysis

getAttributesTime: Get time for given columns in a clinical dataset
assignValuePerSubject: Assign average samples values to their corresponding patients
labelBasedOnCutoff: Label groups based on a given cutoff
optimalSurvivalCutoff: Calculate optimal data cutoff that best separates survival curves
testSurvival: Test the survival difference between groups of patients
processSurvTerms: Process survival curves terms to calculate survival curves
survfit: Compute estimates of survival curves
survdiffTerms: Test differences between survival curves
plotSurvivalCurves: Plot survival curves
plotSurvivalPvaluesByCutoff: Plot p-values of survival difference between groups based on multiple cutoffs

Differential analyses

diffAnalyses: Perform statistical analyses (including differential splicing and gene expression)
plotDistribution: Plot distribution using a density plot

Gene expression and alternative splicing correlation

correlateGEandAS: Test for association between paired samples’ gene expression (for any genes of interest) and alternative splicing quantification
plotCorrelation: Scatter plots of the correlation results
as.table: Table of the correlation results

Annotation retrieval

queryEnsemblByEvent: Query Ensembl based on an alternative splicing event
queryEnsemblByGene: Query Ensembl based on a gene
ensemblToUniprot: Convert an Ensembl identifier to the respective UniProt identifier
plotProtein: Plot domains of a given protein
plotTranscripts: Plot transcripts of a given gene

Exploration of clinically-relevant, differentially spliced events in breast cancer

The following case study was adapted from psichomics’ original article:

Nuno Saraiva-Agostinho and Nuno L. Barbosa-Morais (2019). psichomics: graphical application for alternative splicing quantification and analysis. Nucleic Acids Research.

Breast cancer is the cancer type with the highest incidence and mortality in women (Torre et al., 2015) and multiple studies have suggested that transcriptome-wide analyses of alternative splicing changes in breast tumours are able to uncover tumour-specific biomarkers (Tsai et al., 2015; Danan-Gotthold et al., 2015; Anczuków et al., 2015). Given the relevance of early detection of breast cancer to patient survival, we can use psichomics to identify novel tumour stage-I-specific molecular signatures based on differentially spliced events.

Downloading and loading TCGA data

The quantification of each alternative splicing event is based on the proportion of junction reads that support the inclusion isoform, known as percent spliced-in or PSI (Wang et al., 2008).

To estimate this value for each splicing event, both alternative splicing annotation and junction quantification are required. While alternative splicing annotation is provided by the package, junction quantification may be retrieved from TCGA, GTEx, SRA or user-provided files.

Data is downloaded from Firebrowse, a service that hosts proccessed data from TCGA, as required to run the downstream analyses. Before downloading data, check the following options:

# Available tumour types
cohorts <- getFirebrowseCohorts()

# Available sample dates
date <- getFirebrowseDates()

# Available data types
dataTypes <- getFirebrowseDataTypes()

Note there is also the option for Gene expression (normalised by RSEM). However, we recommend to load the raw gene expression data instead, followed by filtering and normalisation as demonstrated afterwards.

After deciding on the options to use, download and load breast cancer data as follows:

# Set download folder
folder <- getDownloadsFolder()

# Download and load most recent junction quantification and clinical data from
# TCGA/Firebrowse for Breast Cancer
data <- loadFirebrowseData(folder=folder,
                           cohort="BRCA",
                           data=c("clinical", "junction_quantification",
                                  "RSEM_genes"),
                           date="2016-01-28")

# Select clinical and junction quantification dataset
clinical      <- data[[1]]$`Clinical data`
sampleInfo    <- data[[1]]$`Sample metadata`
junctionQuant <- data[[1]]$`Junction quantification (Illumina HiSeq)`
geneExpr      <- data[[1]]$`Gene expression`

Data is only downloaded if the files are not present in the given folder. In other words, if the files were already downloaded, the function will just load the files, so it is possible to reuse the code above just to load the requested files.

Windows limitations: If you are using Windows, note that the downloaded files have huge names that may be over Windows Maximum Path Length. A workaround would be to manually rename the downloaded files to have shorter names, move all downloaded files to a single folder and load such folder. Read how in section Load unspecified local files at the end of this document.

Filtering and normalising gene expression

As this package does not focuses on gene expression analysis, we suggest to read the RNA-seq section of limma’s user guide. Nevertheless, we present the following commands to quickly filter and normalise gene expression:

# Check genes where min. counts are available in at least N samples and filter
# out genes with mean expression and variance of 0
checkCounts <- rowSums(geneExpr >= 10) >= 10
filter <- rowMeans(geneExpr) > 0 & rowVars(geneExpr) > 0 & checkCounts
geneExprFiltered <- geneExpr[filter, ]

# Normalise gene expression and perform log2-transformation (after adding 0.5
# to avoid zeroes)
geneExprNorm <- normaliseGeneExpression(geneExprFiltered, log2transform=TRUE)

Quantifying alternative splicing

After loading the clinical and alternative splicing junction quantification data from TCGA, quantify alternative splicing by clicking the green panel Alternative splicing quantification.

As previously mentioned, alternative splicing is quantified from the previously loaded junction quantification and an alternative splicing annotation file. To check current annotation files available:

# Available alternative splicing annotation
annotList <- listSplicingAnnotations()
annotList

##                        Human hg19/GRCh37 (2017-10-20) 
## "annotationHub_alternativeSplicingEvents.hg19_V2.rda" 
##                        Human hg19/GRCh37 (2016-10-11) 
##    "annotationHub_alternativeSplicingEvents.hg19.rda" 
##                               Human hg38 (2018-04-30) 
## "annotationHub_alternativeSplicingEvents.hg38_V2.rda"

Custom splicing annotation: Additional alternative splicing annotations can be prepared for psichomics by parsing the annotation from programs like VAST-TOOLS, MISO, SUPPA and rMATS. Note that SUPPA and rMATS are able to create their splicing annotation based on transcript annotation. For more information, read this tutorial.

To quantify alternative splicing, first select the junction quantification, alternative splicing annotation and alternative splicing event type(s) of interest:

# Load Human (hg19/GRCh37 assembly) annotation
human <- listSplicingAnnotations()[[1]]
annotation <- loadAnnotation(human)

# Available alternative splicing event types (skipped exon, alternative 
# first/last exon, mutually exclusive exons, etc.)
getSplicingEventTypes()

##                                          Skipped exon 
##                                                  "SE" 
##                               Mutually exclusive exon 
##                                                 "MXE" 
##                            Alternative 5' splice site 
##                                                "A5SS" 
##                            Alternative 3' splice site 
##                                                "A3SS" 
##                                Alternative first exon 
##                                                 "AFE" 
##                                 Alternative last exon 
##                                                 "ALE" 
## Alternative first exon (exon-centred - less reliable) 
##                                            "AFE_exon" 
##  Alternative last exon (exon-centred - less reliable) 
##                                            "ALE_exon"

Afterwards, quantify alternative splicing using the previously defined parameters:

# Discard alternative splicing quantified using few reads
minReads <- 10 # default

psi <- quantifySplicing(annotation, junctionQuant, minReads=minReads)

# Check the identifier of the splicing events in the resulting table
events <- rownames(psi)
head(events)

## [1] "SE_3_+_13661331_13663275_13663415_13667945_FBLN2"   
## [2] "SE_3_+_57908750_57911572_57911661_57913023_SLMAP"   
## [3] "ALE_3_+_57908750_57911572_57913023_SLMAP"           
## [4] "SE_3_-_37136283_37133029_37132958_37125297_LRRFIP2" 
## [5] "SE_12_-_56558432_56558152_56558087_56557549_SMARCC2"
## [6] "AFE_4_+_56755098_56750094_56756389_EXOC1"

Note that the event identifier (for instance, SE_1_-_2125078_2124414_2124284_2121220_C1orf86) is composed of:

Event type (SE stands for skipped exon)
Chromosome (1)
Strand (-)
Relevant coordinates depending on event type (in this case, the first constitutive exon’s end, the alternative exon’ start and end and the second constitutive exon’s start)
Associated gene (C1orf86)

Warning: all examples shown in this case study are performed using a small, yet representative subset of the available data. Therefore, values shown here may correspond to those when performing the whole analysis.

Data grouping

Let us create groups based on available samples types (i.e. Metastatic, Primary solid Tumor and Solid Tissue Normal) and tumour stages. As tumour stages are divided by sub-stages, we will merge sub-stages so as to have only tumour samples from stages I, II, III and IV (stage X samples are discarded as they are uncharacterised tumour samples).

# Group by normal and tumour samples
types  <- createGroupByAttribute("Sample types", sampleInfo)
normal <- types$`Solid Tissue Normal`
tumour <- types$`Primary solid Tumor`

# Group by tumour stage (I, II, III or IV) or normal samples
stages <- createGroupByAttribute(
    "patient.stage_event.pathologic_stage_tumor_stage", clinical)
groups <- list()
for (i in c("i", "ii", "iii", "iv")) {
    stage <- Reduce(union,
           stages[grep(sprintf("stage %s[a|b|c]{0,1}$", i), names(stages))])
    # Include only tumour samples
    stageTumour <- names(getSubjectFromSample(tumour, stage))
    elem <- list(stageTumour)
    names(elem) <- paste("Tumour Stage", toupper(i))
    groups <- c(groups, elem)
}
groups <- c(groups, Normal=list(normal))

# Prepare group colours (for consistency across downstream analyses)
colours <- c("#6D1F95", "#FF152C", "#00C7BA", "#FF964F", "#00C65A")
names(colours) <- names(groups)
attr(groups, "Colour") <- colours

# Prepare normal versus tumour stage I samples
normalVSstage1Tumour <- groups[c("Tumour Stage I", "Normal")]
attr(normalVSstage1Tumour, "Colour") <- attr(groups, "Colour")

# Prepare normal versus tumour samples
normalVStumour <- list(Normal=normal, Tumour=tumour)
attr(normalVStumour, "Colour") <- c(Normal="#00C65A", Tumour="#EFE35C")

Principal component analysis (PCA)

PCA is a technique to reduce data dimensionality by identifying variable combinations (called principal components) that explain the variance in the data (Ringnér, 2008). Use the following commands to perform PCA:

# PCA of PSI between normal and tumour stage I samples
psi_stage1Norm    <- psi[ , unlist(normalVSstage1Tumour)]
pcaPSI_stage1Norm <- performPCA(t(psi_stage1Norm))

As PCA cannot be performed on data with missing values, missing values need to be either removed (thus discarding data from whole splicing events or genes) or impute them (i.e. attributing to missing values the median of the non-missing ones). Use the argument missingValues within function performPCA to select the number of missing values that are tolerable per event (i.e. if a splicing event or gene has less than N missing values, those missing values will be imputed; otherwise, the event is discarded from PCA).

# Explained variance across principal components
plotVariance(pcaPSI_stage1Norm)

# Score plot (clinical individuals)
plotPCA(pcaPSI_stage1Norm, groups=normalVSstage1Tumour)

# Loading plot (variable contributions)
plotPCA(pcaPSI_stage1Norm, loadings=TRUE, individuals=FALSE)

# Table of variable contributions (as used to plot PCA, also)
table <- calculateLoadingsContribution(pcaPSI_stage1Norm)
knitr::kable(head(table, 5))

	Rank	Gene	Event type	Chromosome	Strand	Event position	PC1 loading	PC2 loading	Contribution to PC1 (%)	Contribution to PC2 (%)	Contribution to PC1 and PC2 (%)
SE_3_+_13661331_13663275_13663415_13667945_FBLN2	1	FBLN2	Skipped exon	3	+	13661331, 13667945	0.1339504	-0.1403020	1.794271	1.9684643	1.814085
AFE_15_+_74466994_74466360_74467192_ISLR	2	ISLR	Alternative first exon	15	+	74466360, 74467192	0.1190302	-0.2101108	1.416820	4.4146553	1.757812
SE_3_+_57908750_57911572_57911661_57913023_SLMAP	3	SLMAP	Skipped exon	3	+	57908750, 57913023	0.1365527	-0.0591862	1.864663	0.3503006	1.692410
ALE_3_+_57908750_57911572_57913023_SLMAP	4	SLMAP	Alternative last exon	3	+	57908750, 57913023	0.1358264	-0.0608691	1.844880	0.3705053	1.677176
SE_3_-_37136283_37133029_37132958_37125297_LRRFIP2	5	LRRFIP2	Skipped exon	3	-	37125297, 37136283	0.1320250	-0.0141660	1.743061	0.0200676	1.547077

For performance reasons, the loading plot is able to exclusively render the top variables that most contribute to the select principal components by using the argument nLoadings within function plotPCA.

Hint: As most plots in psichomics, PCA plots can be zoomed-in by clicking-and-dragging within the plot (click Reset zoom to zoom-out). To toggle the visibility of the data series represented in the plot, click its respective name in the plot legend.

To perform PCA using alternative splicing quantification and gene expression data (both using all samples and only Tumour Stage I and Normal samples):

# PCA of PSI between all samples (coloured by tumour stage and normal samples)
pcaPSI_all <- performPCA(t(psi))
plotPCA(pcaPSI_all, groups=groups)
plotPCA(pcaPSI_all, loadings=TRUE, individuals=FALSE)

# PCA of gene expression between all samples (coloured by tumour stage and 
# normal samples)
pcaGE_all <- performPCA(t(geneExprNorm))
plotPCA(pcaGE_all, groups=groups)
plotPCA(pcaGE_all, loadings=TRUE, individuals=FALSE)

# PCA of gene expression between normal and tumour stage I samples
ge_stage1Norm    <- geneExprNorm[ , unlist(normalVSstage1Tumour)]
pcaGE_stage1Norm <- performPCA(t(ge_stage1Norm))
plotPCA(pcaGE_stage1Norm, groups=normalVSstage1Tumour)
plotPCA(pcaGE_stage1Norm, loadings=TRUE, individuals=FALSE)

NUMB exon 12 inclusion and correlation with QKI gene expression

One of the splicing events that most contribute the separation between tumour stage I and normal samples is NUMB exon 12 inclusion, whose protein is crucial for cell differentiation as a key regulator of the Notch pathway. The RNA-binding protein QKI has been shown to repress NUMB exon 12 inclusion in lung cancer cells by competing with core splicing factor SF1 for binding to the branch-point sequence, thereby repressing the Notch signalling pathway, which results in decreased cancer cell proliferation (Zong et al., 2014).

Differential inclusion of NUMB exon 12

Let’s check whether a significant difference in NUMB exon 12 inclusion between tumour and normal TCGA breast samples. To do so:

# Find the right event
ASevents <- rownames(psi)
(tmp     <- grep("NUMB", ASevents, value=TRUE))

## [1] "SE_14_-_73749067_73746132_73745989_73744001_NUMB"

NUMBskippedExon12 <- tmp[1]

# Plot its PSI distribution
plotDistribution(psi[NUMBskippedExon12, ], normalVStumour)

Consistent with the cited article, NUMB exon 12 inclusion is significantly increased in cancer.

Also of interest:

Hover each group in the plot to compare the respective number of samples, median and variance.
To zoom in a specific region, click-and-drag in the region of interest.
To hide or show groups, click on their name in the legend.

Correlation between NUMB exon 12 inclusion and QKI expression

To verify if NUMB exon 12 inclusion is correlated with QKI expression:

# Find the right gene
genes <- rownames(geneExprNorm)
(tmp  <- grep("QKI", genes, value=TRUE))

## [1] "QKI|9444"

QKI   <- tmp[1] # "QKI|9444"

# Plot its gene expression distribution
plotDistribution(geneExprNorm[QKI, ], normalVStumour, psi=FALSE)

plotCorrelation(correlateGEandAS(
    geneExprNorm, psi, QKI, NUMBskippedExon12, method="spearman"))

## $`SE_14_-_73749067_73746132_73745989_73744001_NUMB`
## $`SE_14_-_73749067_73746132_73745989_73744001_NUMB`$`QKI|9444`

According to the obtained results and also consistent with the previous article, the inclusion of the exon is negatively correlated with QKI expression.

Differential splicing analysis

To analyse alternative splicing between normal and tumour stage I samples:

diffSplicing <- diffAnalyses(psi, normalVSstage1Tumour)

# Filter based on |∆ Median PSI| > 0.1 and q-value < 0.01
deltaPSIthreshold <- abs(diffSplicing$`∆ Median`) > 0.1
pvalueThreshold   <- diffSplicing$`Wilcoxon p-value (BH adjusted)` < 0.01

# Plot results
library(ggplot2)
ggplot(diffSplicing, aes(`∆ Median`, 
                         -log10(`Wilcoxon p-value (BH adjusted)`))) +
    geom_point(data=diffSplicing[deltaPSIthreshold & pvalueThreshold, ],
               colour="orange", alpha=0.5, size=3) + 
    geom_point(data=diffSplicing[!deltaPSIthreshold | !pvalueThreshold, ],
               colour="gray", alpha=0.5, size=3) + 
    theme_light(16) +
    ylab("-log10(q-value)")

Performing multiple survival analysis

To study the impact of alternative splicing events on prognosis, Kaplan-Meier curves may be plotted for groups of patients separated by the optimal PSI cutoff for a given alternative splicing event that that maximises the significance of group differences in survival analysis (i.e. minimises the p-value of the log-rank tests of difference in survival between individuals whose samples have their PSI below and above that threshold).

Given the slow process of calculating the optimal splicing quantification cutoff for multiple events, it is recommended to perform this for a subset of differentially spliced events.

# Events already tested which have prognostic value
events <- c(
    "SE_9_+_6486925_6492303_6492401_6493826_UHRF2",
    "SE_4_-_87028376_87024397_87024339_87023185_MAPK10",
    "SE_2_+_152324660_152324988_152325065_152325155_RIF1",
    "SE_2_+_228205096_228217230_228217289_228220393_MFF",
    "MXE_15_+_63353138_63353397_63353472_63353912_63353987_63354414_TPM1",
    "SE_2_+_173362828_173366500_173366629_173368819_ITGA6",
    "SE_1_+_204957934_204971724_204971876_204978685_NFASC")

# Survival curves based on optimal PSI cutoff
library(survival)

# Assign alternative splicing quantification to patients based on their samples
samples <- colnames(psi)
match <- getSubjectFromSample(samples, clinical, sampleInfo=sampleInfo)

survPlots <- list()
for (event in events) {
    # Find optimal cutoff for the event
    eventPSI <- assignValuePerPatient(psi[event, ], match, clinical,
                                      samples=unlist(tumour))
    opt <- optimalSurvivalCutoff(clinical, eventPSI, censoring="right", 
                                 event="days_to_death", 
                                 timeStart="days_to_death")
    (optimalCutoff <- opt$par)    # Optimal exon inclusion level
    (optimalPvalue <- opt$value)  # Respective p-value
    
    label     <- labelBasedOnCutoff(eventPSI, round(optimalCutoff, 2), 
                                    label="PSI values")
    survTerms <- processSurvTerms(clinical, censoring="right",
                                  event="days_to_death", 
                                  timeStart="days_to_death",
                                  group=label, scale="years")
    surv <- survfit(survTerms)
    pvalue <- testSurvival(survTerms)
    plotSurvivalCurves(surv, pvalue=pvalue, mark=FALSE)
}

Differential gene expression

Detected alterations in alternative splicing may simply be a reflection of changes in gene expression levels. Therefore, to disentangle these two effects, differential expression analysis between tumour stage I and normal samples should also be performed. In order to do so:

# Prepare groups of samples to analyse and further filter unavailable samples in
# selected groups for gene expression
ge           <- geneExprNorm[ , unlist(normalVSstage1Tumour), drop=FALSE]
isFromGroup1 <- colnames(ge) %in% normalVSstage1Tumour[[1]]
design       <- cbind(1, ifelse(isFromGroup1, 0, 1))

# Fit a gene-wise linear model based on selected groups
library(limma)
fit <- lmFit(ge, design)

# Calculate moderated t-statistics and DE log-odds using limma::eBayes
ebayesFit <- eBayes(fit, trend=TRUE)

# Prepare data summary
pvalueAdjust <- "BH" # Benjamini-Hochberg p-value adjustment (FDR)
summary <- topTable(ebayesFit, number=nrow(fit), coef=2, sort.by="none",
                    adjust.method=pvalueAdjust, confint=TRUE)
names(summary) <- c("log2 Fold-Change", "CI (low)", "CI (high)", 
                    "Average expression", "moderated t-statistics", "p-value", 
                    paste0("p-value (", pvalueAdjust, " adjusted)"),
                    "B-statistics")
attr(summary, "groups") <- normalVSstage1Tumour

# Calculate basic statistics
stats <- diffAnalyses(ge, normalVSstage1Tumour, "basicStats", 
                      pvalueAdjust=NULL)

final <- cbind(stats, summary)

# Differential gene expression between breast tumour stage I and normal samples
library(ggplot2)
library(ggrepel)
cognateGenes <- unlist(parseSplicingEvent(events)$gene)
logFCthreshold  <- abs(final$`log2 Fold-Change`) > 1
pvalueThreshold <- final$`p-value (BH adjusted)` < 0.01

final$genes <- gsub("\\|.*$", "\\1", rownames(final))
ggplot(final, aes(`log2 Fold-Change`, 
                  -log10(`p-value (BH adjusted)`))) +
    geom_point(data=final[logFCthreshold & pvalueThreshold, ],
               colour="orange", alpha=0.5, size=3) + 
    geom_point(data=final[!logFCthreshold | !pvalueThreshold, ],
               colour="gray", alpha=0.5, size=3) + 
    geom_text_repel(data=final[cognateGenes, ], aes(label=genes),
                    box.padding=0.4, size=5) +
    theme_light(16) +
    ylab("-log10(q-value)")

UHRF2 exon 10 inclusion

One splicing event with prognostic value is the alternative splicing of UHRF2 exon 10. Cell-cycle regulator UHRF2 promotes cell proliferation and inhibits the expression of tumour suppressors in breast cancer (Wu et al., 2012).

Differential splicing analysis

Let’s check whether a significant difference in UHRF2 exon 10 inclusion between tumour stage I and normal samples. To do so:

# UHRF2 skipped exon 10's PSI values per tumour stage I and normal samples
UHRF2skippedExon10 <- events[1]
plotDistribution(psi[UHRF2skippedExon10, ], normalVSstage1Tumour)

Higher inclusion of UHRF2 exon 10 is associated with normal samples.

Survival analysis

To study the impact of alternative splicing events on prognosis, Kaplan-Meier curves may be plotted for groups of patients separated by a given PSI cutoff for a given alternative splicing event. The optimal PSI cutoff maximises the significance of group differences in survival analysis (i.e. minimises the p-value of the log-rank tests of difference in survival between individuals whose samples have a PSI below and above that threshold).

# Find optimal cutoff for the event
UHRF2skippedExon10 <- events[1]
eventPSI <- assignValuePerPatient(psi[UHRF2skippedExon10, ], match, clinical,
                                  samples=unlist(tumour))
opt <- optimalSurvivalCutoff(clinical, eventPSI, censoring="right", 
                             event="days_to_death", timeStart="days_to_death")
(optimalCutoff <- opt$par)    # Optimal exon inclusion level

## [1] 0.1436954

(optimalPvalue <- opt$value)  # Respective p-value

## [1] 0.0358

label     <- labelBasedOnCutoff(eventPSI, round(optimalCutoff, 2), 
                                label="PSI values")
survTerms <- processSurvTerms(clinical, censoring="right",
                              event="days_to_death", timeStart="days_to_death",
                              group=label, scale="years")
surv <- survfit(survTerms)
pvalue <- testSurvival(survTerms)
plotSurvivalCurves(surv, pvalue=pvalue, mark=FALSE)

As per the results, higher inclusion of UHRF2 exon 10 is associated with better prognosis.

Differential expression

To check whether alternative splicing changes are related with gene expression alterations, let us perform differential expression analysis on UHRF2:

plotDistribution(geneExprNorm["UHRF2", ], normalVSstage1Tumour, psi=FALSE)

It seems UHRF2 is differentially expressed between Tumour Stage I and Solid Tissue Normal. However, going back to exploratory differential gene expression, UHRF2 has a log2(fold-change) ≤ 1, low enough not to be biologically relevant. Following this criterium, the gene can thus be considered not to be differentially expressed between these conditions.

Survival analysis

To confirm if gene expression has an overall prognostic value, perform the following:

UHRF2ge <- assignValuePerPatient(geneExprNorm["UHRF2", ], match, clinical, 
                                 samples=unlist(tumour))

# Survival curves based on optimal gene expression cutoff
opt <- optimalSurvivalCutoff(clinical, UHRF2ge, censoring="right",
                             event="days_to_death", timeStart="days_to_death")
(optimalCutoff <- opt$par)    # Optimal exon inclusion level

## [1] 10.42619

(optimalPvalue <- opt$value)  # Respective p-value

## [1] 0.176

# Process again after rounding the cutoff
roundedCutoff <- round(optimalCutoff, 2)
label     <- labelBasedOnCutoff(UHRF2ge, roundedCutoff, label="Gene expression")
survTerms <- processSurvTerms(clinical, censoring="right",
                              event="days_to_death", timeStart="days_to_death",
                              group=label, scale="years")
surv   <- survfit(survTerms)
pvalue <- testSurvival(survTerms)
plotSurvivalCurves(surv, pvalue=pvalue, mark=FALSE)

There seems to be no significant difference in survival between patient groups stratified by UHRF2’s optimal gene expression cutoff in tumour samples (log-rank p-value > 0.05).

Literature support and external database information

If an event is differentially spliced and has an impact on patient survival, its association with the studied disease might be already described in the literature. To check so, go to Analyses > Gene, transcript and protein information where information regarding the associated gene (such as description and genomic position), transcripts and protein domain annotation are available.

The protein plot shows the UniProt matches for the selected transcript. Hover the protein’s rendered domains to obtain more information on them. More information about each protein can be retrieved by clicking the respective UniProt link.
Links to related research articles are also available. Click Show more articles to be directed to PubMed.
Multiple links to related external databases are available too:
- Human Protein Atlas (Cancer Atlas) allows to check the evidence of a gene at protein level for multiple cancer tissues.
- VastDB shows multi-species alternative splicing profiles for diverse tissues and cell types.
- UCSC Genome Browser may reveal protein domain disruptions caused by the alternative splicing event. To check so, activate the Pfam in UCSC Gene and UniProt tracks (in Genes and Gene Predictions) and check if any domains are annotated in the alternative and/or constitutive exons of the splicing event.

Interpretation

Higher inclusion of UHRF2 exon 10 is associated with normal samples and better prognosis, and potentially disrupts UHRF2’s SRA-YDG protein domain, related to the binding affinity to epigenetic marks. Hence, exon 10 inclusion may suppress UHRF2’s oncogenic role in breast cancer by impairing its activity through the induction of a truncated protein or a non-coding isoform. Moreover, this hypothesis is independent from gene expression changes, as UHRF2 is not differentially expressed between tumour stage I and normal samples (|log2(fold-change)| < 1) and there is no significant difference in survival between patient groups stratified by its expression in tumour samples (log-rank p-value > 0.05).

psichomics case study: command-line interface (CLI)

Nuno Saraiva-Agostinho

2019-10-07

Installing and starting the program

Available functions: quick reference

Data retrieval

Gene expression pre-processing

PSI quantification

Data Grouping

Dimensionality reduction

Survival analysis

Differential analyses

Gene expression and alternative splicing correlation

Annotation retrieval

Exploration of clinically-relevant, differentially spliced events in breast cancer

Downloading and loading TCGA data

Filtering and normalising gene expression

Quantifying alternative splicing

Data grouping

Principal component analysis (PCA)

NUMB exon 12 inclusion and correlation with QKI gene expression

Differential inclusion of NUMB exon 12

Correlation between NUMB exon 12 inclusion and QKI expression

Differential splicing analysis

Performing multiple survival analysis

Differential gene expression

UHRF2 exon 10 inclusion

Differential splicing analysis

Survival analysis

Differential expression

Survival analysis

Literature support and external database information

Interpretation

Loading data from other sources

Load GTEx files

Load SRA project data using recount

Load other SRA and user-provided local files

Feedback

References