cTRAP
is an R package designed to compare differential gene expression results with those from known cellular perturbations (such as gene knock-down, overexpression or small molecules) derived from the Connectivity Map (Subramanian et al., Cell 2017). Such analyses allow not only to infer the molecular causes of the observed difference in gene expression but also to identify small molecules that could drive or revert specific transcriptomic alterations.
To illustrate the package functionalities, we will use an example based on a gene knock-down dataset from the ENCODE project for which there is available RNA-seq data. After performing differential expression analyses to the macthed-control sample, we will compare the respective transcriptomic changes with the ones caused by all Connectivity Map’s gene knock-down perturbations to identify which ones have similar or inverse transcriptomic changes to the observed ones. As a positive control, we expect to find the knock-down of the gene depleted in the ENCODE experiment as one of the most similar transcriptomic perturbations.
To load the cTRAP
package into your R environment type:
## Registered S3 method overwritten by 'R.oo':
## method from
## throw.default R.methodsS3
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## Registered S3 method overwritten by 'sets':
## method from
## print.element ggplot2
In this example, we will use the EIF4G1 shRNA knockdown followed by RNA-seq experiment in HepG2 cell line from the ENCODE project as the dataset of interest. The RNA-seq processed data (gene quantifications from RSEM method) for the EIF4G1 knock-down and respective controls (two replicates each) can be downloaded by typing:
gene <- "EIF4G1"
cellLine <- "HepG2"
ENCODEmetadata <- downloadENCODEknockdownMetadata(cellLine, gene)
table(ENCODEmetadata$`Experiment target`)
length(unique(ENCODEmetadata$`Experiment target`))
ENCODEsamples <- downloadENCODEsamples(ENCODEmetadata)
counts <- prepareENCODEgeneExpression(ENCODEsamples)
Gene expression data (read counts) were quantile-normalized using voom
and differential expression analysis was performed using the limma
R package.
# Remove low coverage (at least 10 counts shared across two samples)
minReads <- 10
minSamples <- 2
filter <- rowSums(counts[ , -c(1, 2)] >= minReads) >= minSamples
counts <- counts[filter, ]
# Convert ENSEMBL identifier to gene symbol
library(biomaRt)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- sapply(strsplit(counts$gene_id, "\\."), `[`, 1)
geneConversion <- getBM(filters="ensembl_gene_id", values=genes, mart=mart,
attributes=c("ensembl_gene_id", "hgnc_symbol"))
counts$gene_id <- geneConversion$hgnc_symbol[
match(genes, geneConversion$ensembl_gene_id)]
# Perform differential gene expression analysis
diffExpr <- performDifferentialExpression(counts)
As our metric of differential expression after EIF4G1 shRNA knock-down, we will use the respective t-statistic.
We will compare our differential gene expression metric with all Connectivity Map’s gene knock-down perturbations in the same cell line (HepG2). Note that this comparison can also be done to perturbations in a different cell line or in all cell lines, using in the latter the average result across cell lines. Differential gene expression data for each Connectivity Map’s perturbation are available in z-score normalised values (see Subramanian et al., Cell 2017 for more details).
# Check available conditions for L1000 perturbations
l1000metadata <- downloadL1000data("l1000metadata.txt", "metadata")
## Loading L1000 metadata...
## $`Perturbation type`
## [1] "Compound"
## [2] "Peptides and other biological agents (e.g. cytokine)"
## [3] "shRNA for loss of function (LoF) of gene"
## [4] "Consensus signature from shRNAs targeting the same gene"
## [5] "cDNA for overexpression of wild-type gene"
## [6] "cDNA for overexpression of mutated gene"
##
## $`Cell line`
## [1] "CD34" "HL60" "PC3" "U937" "MCF7" "A375"
## [7] "HEK293T" "A549" "ASC" "HA1E" "HCC515" "HEKTE"
## [13] "HEPG2" "HT29" "NCIH716" "NPC" "SHSY5Y" "SKL"
## [19] "SW480" "VCAP" "BT20" "FIBRNPC" "HS578T" "MCF10A"
## [25] "MCH58" "MDAMB231" "NEU" "NKDBA" "NOMO1" "PHH"
## [31] "SKBR3" "SKB" "SKM1" "THP1" "A673" "AGS"
## [37] "CL34" "CORL23" "COV644" "DV90" "EFO27" "H1299"
## [43] "HCC15" "HCT116" "HEC108" "HT115" "JHUEM2" "LOVO"
## [49] "MDST8" "NCIH1694" "NCIH1836" "NCIH2073" "NCIH508" "NCIH596"
## [55] "OV7" "PL21" "RKO" "RMGI" "RMUGS" "SKLU1"
## [61] "SKMEL1" "SKMEL28" "SNGM" "SNU1040" "SNUC4" "SNUC5"
## [67] "SW620" "SW948" "T3M10" "TYKNU" "WSUDLCL2" "HUH7"
## [73] "HS27A" "JURKAT" "U266" "U2OS"
##
## $Dosage
## [1] "0.1 %" "500 nM" "1 µM"
## [4] "10 µM" "3 µM" "5 µM"
## [7] "100 nM" NA "1 µL"
## [10] "6 µL" "1.5 µL" "2 µL"
## [13] "10 µL" "5 µL" "60 µM"
## [16] "100 µM" "40 µM" "30 µM"
## [19] "20 µM" "80 µM" "90 µM"
## [22] "70 µM" "50 µM" "10 nM"
## [25] "0.03 ng/mL" "20 ng/mL" "500 ng/mL"
## [28] "10000 ng/mL" "50 ng/mL" "200 ng/mL"
## [31] "2000 ng/mL" "5 ng/mL" "100000 ng/mL"
## [34] "1000 ng/mL" "30 ng/mL" "100 ng/mL"
## [37] "10 ng/mL" "15 ng/mL" "2 ng/mL"
## [40] "800 ng/mL" "400 ng/mL" "25 ng/mL"
## [43] "80 ng/mL" "1 ng/mL" "40 ng/mL"
## [46] "0.3 ng/mL" "50000 ng/mL" "150 ng/mL"
## [49] "0 ng/mL" "0.2 ng/mL" "0.15 ng/mL"
## [52] "200000 ng/mL" "3000 ng/mL" "0.5 ng/mL"
## [55] "0.1 ng/mL" "250 ng/mL" "8300 ng/mL"
## [58] "5000 ng/mL" "1.65 ng/mL" "0.01 ng/mL"
## [61] "16 ng/mL" "2500 ng/mL" "45 ng/mL"
## [64] "20 µL" "1 nM" "-666 -666"
## [67] "200 ng" "150 ng" "-666 -666|-666 -666"
## [70] "300 ng" "300 ng|300 ng" "3 µL"
## [73] "1 ng/µL" "100 ng/µL" "0.1 ng/µL"
## [76] "10 ng/µL" "3 ng/µL" "300 ng/µL"
## [79] "4 µL" "0.04 µM" "0.12 µM"
## [82] "0.37 µM" "1.11 µM" "3.33 µM"
## [85] "0.41 µM" "1.23 µM" "11.11 µM"
## [88] "3.7 µM" "33.33 µM"
##
## $`Time points`
## [1] "24 h" "6 h" "96 h" "144 h" "120 h" "168 h" "48 h" "1 h"
## [9] "3 h" "2 h" "4 h" "72 h"
# Code for loading CMap gene KD HepG2 data
l1000metadataKnockdown <- filterL1000metadata(
l1000metadata, cellLine="HepG2",
perturbationType="Consensus signature from shRNAs targeting the same gene")
l1000zscores <- downloadL1000data("l1000zscores.gctx", "zscores",
l1000metadataKnockdown$sig_id)
l1000geneInfo <- downloadL1000data("l1000geneInfo.txt", "geneInfo")
l1000perturbationsKnockdown <- loadL1000perturbations(
l1000metadataKnockdown, l1000zscores, l1000geneInfo)
If the interest is in small molecules, one can download the differential gene expression z-scores for each small molecule perturbation in HepG2 (given a specific concentration and time point of RNA extraction) through the following command:
l1000metadataSmallMolecules <- filterL1000metadata(
l1000metadata, cellLine="HepG2", timepoint="24 h",
dosage="5 \U00B5M", # \U00B5 is the unicode code for the micro symbol
perturbationType="Compound")
l1000zscores <- downloadL1000data("l1000zscores.gctx", "zscores",
l1000metadataSmallMolecules$sig_id)
l1000geneInfo <- downloadL1000data("l1000geneInfo.txt")
l1000perturbationsSmallMolecules <- loadL1000perturbations(
l1000metadataSmallMolecules, l1000zscores, l1000geneInfo,
sanitizeCompoundNames=TRUE)
This is the main function of the package. Here we compare the statistic for differential expression (the t-statistic, in this case) with the z-scores for the perturbations we have loaded using the three different methods available (Spearman’s correlation, Pearson’s correlation and Gene Set Enrichment Analysis (GSEA)). For GSEA, the default option is to use the top and bottom 150 genes (ranked by the user’s t-statistic) as gene sets, but this can be changed by the user.
For small molecules:
compareSmallMolecule <- list()
# Compare against L1000 using Spearman correlation
compareSmallMolecule$spearman <- compareAgainstL1000(
diffExprStat, l1000perturbationsSmallMolecules, cellLine, method="spearman")
## Comparing with cell line HepG2
# Compare against L1000 using Pearson correlation
compareSmallMolecule$pearson <- compareAgainstL1000(
diffExprStat, l1000perturbationsSmallMolecules, cellLine, method="pearson")
## Comparing with cell line HepG2
# Compare against L1000 using gene set enrichment analysis (GSEA) with the top
# and bottom 150 genes
compareSmallMolecule$gsea <- compareAgainstL1000(
diffExprStat, l1000perturbationsSmallMolecules, cellLine, method="gsea",
geneSize=150)
## Performing GSA using perturbation signatures...
For each method used, this function will return a table with the results of the comparison with each perturbation tested, including the test statistics (depending on the method used: Spearman’s correlation coefficient, Pearson’s correlation coefficient or GSEA Enrichment Score), the respective p-value and the Benjamini-Hochberg-adjusted p-value.
# Order knockdown perturbations according to similarity
compareKnockdown$spearman_ordered <- compareKnockdown$spearman[
order(compareKnockdown$spearman$HepG2_spearman_coef, decreasing=TRUE)]
compareKnockdown$pearson_ordered <- compareKnockdown$pearson[
order(compareKnockdown$pearson$HepG2_pearson_coef, decreasing=TRUE)]
compareKnockdown$gsea_ordered <- compareKnockdown$gsea[
order(compareKnockdown$gsea$HepG2_WTCS, decreasing=TRUE)]
# Most positively associated perturbations (note that EIF4G1 knockdown is the
# 6th, 1st and 2nd most positively associated perturbation based on Spearman
# correlation, Pearson correlation and GSEA, respectively)
head(compareKnockdown$spearman_ordered)
## genes HepG2_spearman_coef HepG2_spearman_pvalue
## 1: CGS001_HEPG2_96H:MECP2:1.5 0.1930335 1.506067e-74
## 2: CGS001_HEPG2_96H:KIAA0196:1.5 0.1894301 8.237226e-72
## 3: CGS001_HEPG2_96H:SQRDL:1.5 0.1883210 5.589838e-71
## 4: CGS001_HEPG2_96H:PPIH:1.5 0.1863954 1.509735e-69
## 5: CGS001_HEPG2_96H:STAT1:1.5 0.1833382 2.626628e-67
## 6: CGS001_HEPG2_96H:COPS5:1.5 0.1775880 3.364592e-63
## HepG2_spearman_qvalue Average_spearman_coef
## 1: 1.305258e-73 0.1930335
## 2: 5.354197e-71 0.1894301
## 3: 2.906716e-70 0.1883210
## 4: 6.542184e-69 0.1863954
## 5: 9.756048e-67 0.1833382
## 6: 1.093493e-62 0.1775880
## genes HepG2_pearson_coef HepG2_pearson_pvalue
## 1: CGS001_HEPG2_96H:EIF4G1:1.5 0.1915142 2.182595e-73
## 2: CGS001_HEPG2_96H:STAT1:1.5 0.1894661 7.739198e-72
## 3: CGS001_HEPG2_96H:MEST:1.5 0.1858475 3.830708e-69
## 4: CGS001_HEPG2_96H:KIAA0196:1.5 0.1847352 2.514633e-68
## 5: CGS001_HEPG2_96H:COPS5:1.5 0.1794790 1.553956e-64
## 6: CGS001_HEPG2_96H:KIF20A:1.5 0.1786967 5.568601e-64
## HepG2_pearson_qvalue Average_pearson_coef
## 1: 1.418686e-72 0.1915142
## 2: 4.024383e-71 0.1894661
## 3: 1.659973e-68 0.1858475
## 4: 9.340065e-68 0.1847352
## 5: 5.050356e-64 0.1794790
## 6: 1.608707e-63 0.1786967
## genes HepG2_WTCS Average_WTCS
## 1: CGS001_HEPG2_96H:SKIV2L:1.5 0.5070024 0.5070024
## 2: CGS001_HEPG2_96H:EIF4G1:1.5 0.4861485 0.4861485
## 3: CGS001_HEPG2_96H:GTPBP8:1.5 0.4813972 0.4813972
## 4: CGS001_HEPG2_96H:HFE:1.5 0.4813046 0.4813046
## 5: CGS001_HEPG2_96H:UBAP2L:1.5 0.4722054 0.4722054
## 6: CGS001_HEPG2_96H:TMEM5:1.5 0.4721782 0.4721782
## genes HepG2_spearman_coef HepG2_spearman_pvalue
## 1: CGS001_HEPG2_96H:MAF:1.5 -0.1581847 2.351511e-50
## 2: CGS001_HEPG2_96H:ZBTB24:1.5 -0.1632375 1.500986e-53
## 3: CGS001_HEPG2_96H:EHMT2:1.5 -0.1640525 4.479303e-54
## 4: CGS001_HEPG2_96H:NDUFB6:1.5 -0.1677880 1.618399e-56
## 5: CGS001_HEPG2_96H:CDCA8:1.5 -0.1941862 1.951544e-75
## 6: CGS001_HEPG2_96H:EYA1:1.5 -0.1993531 1.747554e-79
## HepG2_spearman_qvalue Average_spearman_coef
## 1: 4.075952e-50 -0.1581847
## 2: 2.787546e-53 -0.1632375
## 3: 8.958605e-54 -0.1640525
## 4: 3.825307e-56 -0.1677880
## 5: 2.537007e-74 -0.1941862
## 6: 4.543641e-78 -0.1993531
## genes HepG2_pearson_coef HepG2_pearson_pvalue
## 1: CGS001_HEPG2_96H:NDUFB6:1.5 -0.1646057 1.964288e-54
## 2: CGS001_HEPG2_96H:MAF:1.5 -0.1717844 3.406704e-59
## 3: CGS001_HEPG2_96H:SULT1A2:1.5 -0.1742848 6.666242e-61
## 4: CGS001_HEPG2_96H:CDCA8:1.5 -0.1926944 2.740646e-74
## 5: CGS001_HEPG2_96H:ZBTB24:1.5 -0.1948077 6.448588e-76
## 6: CGS001_HEPG2_96H:EYA1:1.5 -0.1952303 3.030978e-76
## HepG2_pearson_qvalue Average_pearson_coef
## 1: 3.004205e-54 -0.1646057
## 2: 6.326737e-59 -0.1717844
## 3: 1.444352e-60 -0.1742848
## 4: 2.375227e-73 -0.1926944
## 5: 8.383165e-75 -0.1948077
## 6: 7.880542e-75 -0.1952303
## genes HepG2_WTCS Average_WTCS
## 1: CGS001_HEPG2_96H:PLOD2:1.5 -0.4130739 -0.4130739
## 2: CGS001_HEPG2_96H:SHB:1.5 -0.4131576 -0.4131576
## 3: CGS001_HEPG2_96H:GNAI2:1.5 -0.4166373 -0.4166373
## 4: CGS001_HEPG2_96H:SIAH2:1.5 -0.4224346 -0.4224346
## 5: CGS001_HEPG2_96H:DHX16:1.5 -0.4379221 -0.4379221
## 6: CGS001_HEPG2_96H:CDCA8:1.5 -0.4739211 -0.4739211
For small molecules:
# Order small molecule perturbations according to similarity
compareSmallMolecule$spearman_ordered <- compareSmallMolecule$spearman[
order(compareSmallMolecule$spearman$HepG2_spearman_coef, decreasing=TRUE)]
compareSmallMolecule$pearson_ordered <- compareSmallMolecule$pearson[
order(compareSmallMolecule$pearson$HepG2_pearson_coef, decreasing=TRUE)]
compareSmallMolecule$gsea_ordered <- compareSmallMolecule$gsea[
order(compareSmallMolecule$gsea$HepG2_WTCS, decreasing=TRUE)]
# Most positively associated perturbations
head(compareSmallMolecule$spearman_ordered)
## genes HepG2_spearman_coef HepG2_spearman_pvalue
## 1: BRD-A14014306 0.225335666 1.361508e-101
## 2: strophanthidin 0.085133810 1.294659e-15
## 3: BRD-K41172353 0.005161915 6.284641e-01
## 4: caffeic-acid-phenethyl-ester -0.037174082 4.903648e-04
## 5: PD-98059 -0.050201852 2.489663e-06
## 6: YC-1 -0.077832390 2.727173e-13
## HepG2_spearman_qvalue Average_spearman_coef
## 1: 1.497659e-100 0.225335666
## 2: 2.373541e-15 0.085133810
## 3: 6.284641e-01 0.005161915
## 4: 5.394013e-04 -0.037174082
## 5: 3.042921e-06 -0.050201852
## 6: 3.749863e-13 -0.077832390
## genes HepG2_pearson_coef HepG2_pearson_pvalue
## 1: BRD-A14014306 0.22640668 1.442316e-102
## 2: strophanthidin 0.06834399 1.415829e-10
## 3: BRD-K41172353 -0.01267514 2.347404e-01
## 4: PD-98059 -0.02551843 1.673267e-02
## 5: caffeic-acid-phenethyl-ester -0.04580625 1.737439e-05
## 6: YC-1 -0.06254059 4.400100e-09
## HepG2_pearson_qvalue Average_pearson_coef
## 1: 1.586548e-101 0.22640668
## 2: 2.224875e-10 0.06834399
## 3: 2.347404e-01 -0.01267514
## 4: 1.840594e-02 -0.02551843
## 5: 2.123537e-05 -0.04580625
## 6: 6.050137e-09 -0.06254059
## genes HepG2_WTCS Average_WTCS
## 1: BRD-A14014306 0.4932836 0.4932836
## 2: PD-98059 0.2304579 0.2304579
## 3: strophanthidin 0.2179143 0.2179143
## 4: BRD-K31030218 0.0000000 0.0000000
## 5: BRD-K41172353 0.0000000 0.0000000
## 6: BRD-K84389640 0.0000000 0.0000000
## genes HepG2_spearman_coef HepG2_spearman_pvalue
## 1: YC-1 -0.07783239 2.727173e-13
## 2: BRD-K84389640 -0.07833181 1.919930e-13
## 3: BRD-K77508012 -0.09420887 8.674931e-19
## 4: BRD-K94818765 -0.11341805 1.461350e-26
## 5: BRD-K31030218 -0.12553145 3.293334e-32
## 6: BRD-A65142661 -0.16468634 1.741438e-54
## HepG2_spearman_qvalue Average_spearman_coef
## 1: 3.749863e-13 -0.07783239
## 2: 3.017033e-13 -0.07833181
## 3: 1.908485e-18 -0.09420887
## 4: 4.018714e-26 -0.11341805
## 5: 1.207556e-31 -0.12553145
## 6: 9.577908e-54 -0.16468634
## genes HepG2_pearson_coef HepG2_pearson_pvalue
## 1: YC-1 -0.06254059 4.400100e-09
## 2: BRD-K84389640 -0.08264025 8.485519e-15
## 3: BRD-K77508012 -0.08283071 7.364690e-15
## 4: BRD-K94818765 -0.11053910 2.640853e-25
## 5: BRD-K31030218 -0.12054061 8.215231e-30
## 6: BRD-A65142661 -0.14797060 3.242185e-44
## HepG2_pearson_qvalue Average_pearson_coef
## 1: 6.050137e-09 -0.06254059
## 2: 1.555679e-14 -0.08264025
## 3: 1.555679e-14 -0.08283071
## 4: 7.262345e-25 -0.11053910
## 5: 3.012251e-29 -0.12054061
## 6: 1.783202e-43 -0.14797060
## genes HepG2_WTCS Average_WTCS
## 1: BRD-K84389640 0.0000000 0.0000000
## 2: YC-1 0.0000000 0.0000000
## 3: BRD-K77508012 -0.2442628 -0.2442628
## 4: BRD-K94818765 -0.2621164 -0.2621164
## 5: caffeic-acid-phenethyl-ester -0.2821337 -0.2821337
## 6: BRD-A65142661 -0.3381341 -0.3381341
To analyse the relationship between the user-provided differential expression profile with that associated with a specific perturbation, we provide the option to create a scatter plot (for Spearman and Pearson analyses) or a GSEA plot.
Below you can see how to plot the relationship between the ENCODE’s EIF4G1 shRNA knockdown with the perturbations of interest.
# Plot relationship with EIF4G1 knockdown
EIF4G1knockdown <- grep("EIF4G1", compareKnockdown$gsea_ordered$genes,
value=TRUE)
plotL1000comparison(compareKnockdown$spearman, EIF4G1knockdown)
plotL1000comparison(compareKnockdown$pearson, EIF4G1knockdown)
# Plot relationship with most negatively associated perturbation
plotL1000comparison(compareKnockdown$spearman,
tail(compareKnockdown$spearman_ordered, 1)$genes)
plotL1000comparison(compareKnockdown$pearson,
tail(compareKnockdown$pearson_ordered, 1)$genes)
plotL1000comparison(compareKnockdown$gsea,
tail(compareKnockdown$gsea_ordered, 1)$genes,
topGenes=TRUE)
plotL1000comparison(compareKnockdown$gsea,
tail(compareKnockdown$gsea_ordered, 1)$genes,
topGenes=FALSE)
For small molecules:
# Plot relationship with most positively associated perturbation
plotL1000comparison(compareSmallMolecule$spearman,
head(compareSmallMolecule$spearman_ordered, 1)$genes)
plotL1000comparison(compareSmallMolecule$pearson,
head(compareSmallMolecule$pearson_ordered, 1)$genes)
All feedback on the program, documentation and associated material (including this tutorial) is welcome. Please send any suggestions and comments to:
Nuno Saraiva-Agostinho ([email protected])
Bernardo P. de Almeida ([email protected])
Disease Transcriptomics Lab, Instituto de Medicina Molecular (Portugal)