psichomics is an interactive R package for integrative analyses of alternative splicing and gene expression based on The Cancer Genome Atlas (TCGA) (containing molecular data associated with 34 tumour types), the Genotype-Tissue Expression (GTEx) project (containing data for multiple normal human tissues), Sequence Read Archive (SRA) and user-provided data.
The following file formats are supported by psichomics. The links in the table redirect to instructions on how to load data from each source.
Source | Sample information | Subject information | Gene expression | Exon-exon junction quantification | Alternative splicing quantification |
---|---|---|---|---|---|
SRA Run Selector | Yes | ||||
STAR | Yes | Yes | |||
VAST-TOOLS | Yes | Yes | |||
TCGA (via FireBrowse) | Yes | Yes | Yes | Yes | |
SRA (via recount) | Yes | Yes | Yes | Yes | |
GTEx | Yes | Yes | Yes | Yes | |
Other sources | Yes | Yes | Yes | Yes | Limited* |
* psichomics cannot fully parse alternative splicing events (e.g. it may not identify the cognate gene and coordinates) based on tables from these sources.
The SRA Run Selector contains sample metadata that can be downloaded for all or selected samples from a SRA project. To download sample information, click the Metadata button in the Download columns. The output file is usually named SraRunTable.txt
.
To proceed loading the data, move the downloaded file to a new folder and follow the instructions in Load user-provided data into psichomics.
The following section goes through the steps required to load data based on RNA-seq data:
SRA is a repository of biological sequences that stores data from many published articles with the potential to answer pressing biological questions.
The latest versions of psichomics support automatic downloading of SRA data from recount, a resource of pre-processed data for thousands of SRA projects (including gene read counts, splice junction quantification and sample metadata). First, check if the project of your interest is available in recount, thus making it quicker to analyse gene expression and alternative splicing for your samples of interest.
Data from SRA can be downloaded using the fasterq-dump command from sra-tools. For instance, to retrieve samples from the SRP126561 project:
# List SRA samples
samples=(SRR6368612 SRR6368613 SRR6368614 SRR6368615 SRR6368616 SRR6368617)
# Download samples
fasterq-dump --split-3 ${samples}
--split-3
allows to output one or two FASTQ files for single-end or paired-end sequencing, respectively (a third FASTQ file may also be returned containing orphaned single-end reads obtained from paired-end sequencing data)
Sample-associated data is also available from the Run Selector page. Click RunInfo Table to download the whole metadata table for all samples (usually downloaded in a file named SraRunTable.txt
).
The quantification of each alternative splicing event is based on the proportion of junction reads that support the inclusion isoform, known as percent spliced-in or PSI (Wang et al., 2008).
To estimate this value for each splicing event, both alternative splicing annotation and quantification of RNA-Seq reads aligning to splice junctions (junction quantification) are required. While alternative splicing annotation is provided by the package, junction quantification will need to be prepared from user-provided data by aligning the RNA-seq reads from FASTQ files to a genome of reference. As junction reads are required to quantify alternative splicing, a splice-aware aligner will be used.
psichomics currently supports STAR output.
Before aligning FASTQ samples against a genome of reference, an index needs to be prepared.
Start by downloading a FASTA file of the whole genome and a GTF file with annotated transcripts. This command makes use of these human FASTA and GTF files (hg19 assembly).
mkdir hg19_STAR
STAR --runMode genomeGenerate \ # Generate the genome index
--genomeDir hg19_STAR \ # Path to genome index (output)
--genomeFastaFiles /path/to/hg19.fa \ # Path to genome FASTA file(s)
--sjdbGTFfile /path/to/hg19.gtf \ # Path to junction GTF annotation
--runThreadN 4 # Run in parallel using 4 threads
After the genome index is generated, the sequences in the FASTQ files need to be aligned against the annotated gene and splice junctions from the previously prepared reference. The following commands make STAR output both gene and junction read counts into files ending in ReadsPerGene.out.tab
and SJ.out.tab
, respectively.
align () {
echo "Aligning ${1} using STAR..."
STAR --readFilesIn ${1}_1.fastq ${1}_2.fastq \ # FASTQ files to align
--runThreadN 16 \ # Run in parallel using 16 threads
--genomeDir hg19_STAR \ # Path to genome index (input)
--readFilesCommand zcat \ # Use zcat to extract compressed files
--quantMode GeneCounts \ # Return gene read counts
--outFileNamePrefix ${1} \ # Prefix for output files
}
for each in ${samples}; do
align "${each}"
done
To process the resulting data files, type in R:
# Change working directory to where the STAR output is
setwd("/path/to/aligned/output/")
library(psichomics)
prepareGeneQuant(
"SRR6368612ReadsPerGene.out.tab", "SRR6368613ReadsPerGene.out.tab",
"SRR6368614ReadsPerGene.out.tab", "SRR6368615ReadsPerGene.out.tab",
"SRR6368616ReadsPerGene.out.tab", "SRR6368617ReadsPerGene.out.tab")
prepareJunctionQuant("SRR6368612SJ.out.tab", "SRR6368613SJ.out.tab",
"SRR6368614SJ.out.tab", "SRR6368615SJ.out.tab",
"SRR6368616SJ.out.tab", "SRR6368617SJ.out.tab")
To load the data, move the files (including the SRA metadata) to a new folder and follow the instructions in Load user-provided data into psichomics.
psichomics supports loading inclusion levels and gene expression tables from VAST-TOOLS (the tables available after running vast-tools combine
). Note:
vast-tools align
with argument --expr
;vast-tools combine
with argument -C
(in case of doubt, always calculate both cRPKMs and gene read counts).Any sample and/or subject information may also be useful to load. Unless the sample metadata comes from SRA Run Selector, please ensure that the table is recognised by psichomics: read Prepare generic data.
To load the data and move all files to a new folder (VAST-TOOLS alternative splicing quantification and gene expression tables and sample/subject-associated information).
Follow the instructions in Load user-provided data into psichomics to load the files in the visual interface. Otherwise, use function loadLocalFiles()
with the folder path as an argument:
library(psichomics)
data <- loadLocalFiles("/path/to/psichomics/input")
names(data)
names(data[[1]])
junctionQuant <- data[[1]]$`Junction quantification`
sampleInfo <- data[[1]]$`Sample metadata`
# Both gene read counts and cRPKMs are loaded as separate data frames
geneReadCounts <- data[[1]]$`Gene expression (read counts)`
cRPKM <- data[[1]]$`Gene expression (cRPKM)`
FireBrowse contains TCGA data for multiple tumour types and can be automatically downloaded and then loaded using psichomics.
Alternatively, manually downloaded files from FireBrowse can be moved to a folder and then loaded in psichomics by following the instructions in Load user-provided data into psichomics.
GTEx contains data for multiple normal tissues. GTEx data can be automatically downloaded and then loaded using psichomics.
Alternatively, manually downloaded files from GTEx can be moved to a folder and then loaded in psichomics by following the instructions in Load user-provided data into psichomics.
psichomics supports importing generic data from any source as long as the tables are prepared as detailed below.
Please make sure that sample and subject identifiers are exactly the same between all datasets.
If you are working with sample metadata from SRA Run Selector, see how to prepare SRA Run Selector data.
Sample ID
Subject ID
(subject identifiers must be the same as the ones used in subject information)Sample ID | Type | Tissue | Subject ID |
---|---|---|---|
SMP-01 | Tumour | Lung | SUBJ-03 |
SMP-02 | Normal | Blood | SUBJ-12 |
SMP-03 | Normal | Blood | SUBJ-25 |
Subject ID
Subject ID | Age | Gender | Race |
---|---|---|---|
SUBJ-01 | 34 | Female | Black |
SUBJ-02 | 22 | Male | Black |
SUBJ-03 | 58 | Female | Asian |
Gene ID
Gene ID | SMP-18 | SMP-03 | SMP-54 |
---|---|---|---|
AMP1 | 24 | 10 | 43 |
BRCA1 | 38 | 46 | 32 |
BRCA2 | 43 | 65 | 21 |
Junction ID
10_18748_21822
chromosome 10 (18748 to 21822)
chr10:18748-21822
+
or -
at the end of the junction identifier:
10:3213:9402:+
chr10:3213-9402 -
alt
, random
or Un
( i.e. alternative sequences) are discardedJunction ID | SMP-18 | SMP-03 |
---|---|---|
10:6752-7393 | 4 | 0 |
10:18748-21822 | 8 | 46 |
10:24257-25325 | 83 | 65 |
Note that psichomics cannot currently parse alternative splicing events ( e.g. identify the cognate gene and coordinates) from generic, user-provided tables.
AS Event ID
AS Event ID | SMP-18 | SMP-03 |
---|---|---|
someASevent001 | 0.71 | 0.30 |
anotherASevent653 | 0.63 | 0.37 |
yetAnother097 | 0.38 | 0.62 |
To load the data, move the files to a new folder and follow the instructions in Load user-provided data into psichomics.
Start psichomics with the following commands in an R console or RStudio:
Then, click Load user files. Click the Folder input tab and select the appropriate folder. Finally, click Load files to automatically scan and load all supported files from that folder.
Use function loadLocalFiles()
with the folder path as an argument:
All feedback on the program, documentation and associated material (including this tutorial) is welcome. Please send any comments and questions to:
Nuno Saraiva-Agostinho ([email protected])
Disease Transcriptomics Lab, Instituto de Medicina Molecular (Portugal)
Wang,E.T. et al. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature, 456, 470–476.