ABAEnrichment: Gene Expression Enrichment in Human Brain Regions

Package: ABAEnrichment
Author: Steffi Grote
Date: October 14, 2016

Overview

ABAEnrichment is an R package for the enrichment analysis of user defined candidate genes (in the set of expressed protein coding genes) in different human brain regions. The package integrates the expression of the candidate gene set (averaged across donors) and the structural information of the brain using an ontology, both provided by the Allen Brain Atlas project [1-4]. The statistical analysis is performed by the core function aba_enrich which interfaces the ontology enrichment software FUNC [5]. Additional functions provided in this package are get_expression, plot_expression, get_name, get_sampled_substructures and get_superstructures supporting the exploration and visualization of the expression data.

The enrichment analysis for candidate genes is performed by using either the hypergeometric or the Wilcoxon rank test implemented in the ontology enrichment software FUNC [5]. The hypergeometric test evaluates the enrichment of expressed candidate genes compared to a set of background genes for each brain region. The background genes can be defined explicitly like the candidate genes or, by default, consist of all expressed genes from the dataset outside the candidate genes. Using the Wilcoxon rank test, scores assigned to candidate genes are tested for an enrichment in brain regions. The boundary between 'expressed' and 'not expressed' is defined by different expression quantiles (e.g. the lowest 40% of gene expression are 'not expressed' and the upper 60% are 'expressed' for a quantile of 0.4). These cutoffs are set with the parameter cutoff_quantiles and an analysis is run for every cutoff separately.

To account for multiple testing FUNC uses random permutations of scores assigned to genes for the Wilcoxon rank test. For the hypergeometric test candidate and background genes are permuted (see Schematic 1 below). This is also the default behaviour in ABAEnrichment. In addition, ABAEnrichment provides the option to condition the chance of a background gene to be selected as a random candidate gene on the length of the background gene (option gene_len).
Furthermore, whole genomic regions can be provided as input instead of defining genes explicitly. ABAEnrichment then tests the expressed genes in the candidate regions for enrichment in brain regions compared to the expressed genes in the background regions. The randomsets then also consist of randomly chosen candidate regions inside the background regions, either as a whole block in one background region (default), or on the same chromosome allowing to overlap multiple background regions on that chromosome (option circ_chrom, see Schematic 2 below).

The package incorporates three different brain expression datasets: first, microarray data from adult individuals, second, RNA-seq data from individuals of five different developmental stages (prenatal, infant, child, adolescent, adult) and third, a developmental effect score measuring the age effect on expression for given genes. In the latter case the data are not divided into 'expressed' and 'not expressed', but into 'developmental effect score above cutoff' or not. However, for simplicity we only refer to 'expression' in that documentation. For details on the datasets see the ABAData vignette.

Overview of the functions included in ABAEnrichment:

function	description
`aba_enrich`	core function for performing enrichment analyses given a candidate gene set.
`get_expression`	returns expression data or developmental effect scores for a given set of genes and brain structures.
`plot_expression`	plots a heatmap with expression data or developmental effect scores for a given set of genes and brain structures.
`get_name`	returns the full name of a brain region given a structure id.
`get_sampled_substructures`	returns the substructures of a given brain region that have expression data available.
`get_superstructures`	returns the superstructures of a given brain region.

Examples

Test gene expression enrichment with hypergeometric test

For a random set of 13 genes a test for expression enrichment in human brain regions for different developmental stages is performed. A binary vector with '1' for a candidate gene and '0' for a background gene and names as gene identifiers (Entrez-ID, Ensembl-ID or HGNC-symbol) needs to be defined. In this example no background genes are defined, in which case all remaining protein coding genes of the dataset are used as background.

## load ABAEnrichment package
require(ABAEnrichment)
## create input vector with candidate genes 
genes = rep(1,13)
names(genes) = c('NCAPG', 'APOL4', 'NGFR', 'NXPH4', 'C21orf59', 'CACNG2', 'AGTR1', 'ANO1', 'BTBD3', 'MTUS1', 'CALB1', 'GYG1', 'PAX2')
genes

##    NCAPG    APOL4     NGFR    NXPH4 C21orf59   CACNG2    AGTR1     ANO1    BTBD3    MTUS1    CALB1     GYG1 
##        1        1        1        1        1        1        1        1        1        1        1        1 
##     PAX2 
##        1

In order to test the 13 random genes for enrichment in brain regions at different developmental stages using aba_enrich the following parameters have to be defined: the vector 'genes' and the dataset '5_stages'. Additionally, in this example two optional parameters are set: the cutoff_quantiles 0.5, 0.7 and 0.9 for an example of using the 50%, 70% and 90% expression quantile across all genes as the boundary between 'expressed' and 'not expressed' genes and n_randsets 100 to use 100 random permutations to calculate the FWER. cutoff_quantiles and n_randsets have default values seq(0.1,0.9,0.1) and '1000' respectively. The values in the examples are reduced to lower the computation time.

## run enrichment analysis
res = aba_enrich(genes,dataset='5_stages', cutoff_quantiles=c(0.5,0.7,0.9), n_randsets=100)

The function aba_enrich returns a list, the first element of which contains the results of the statistical analysis for each brain region and age category, since for the '5_stages' dataset the analyses are performed independently for each developmental stage:

## extract first element from the output list, which contains the statistics
fwers = res[[1]]
## see results for the brain regions with highest enrichment for children (3-11 yrs, age_category 3)
head(fwers[fwers[,1]==3,])

##    age_category structure_id                                     structure times_FWER_under_0.05 mean_FWER
## 55            3  Allen:10657                         CBC_cerebellar cortex                     0 0.5000000
## 56            3  Allen:10361                        AMY_amygdaloid complex                     0 0.9500000
## 57            3  Allen:10163    M1C_primary motor cortex (area M1, area 4)                     0 0.9566667
## 58            3  Allen:10225 IPC_posteroventral (inferior) parietal cortex                     0 0.9733333
## 59            3  Allen:10173            DFC_dorsolateral prefrontal cortex                     0 0.9833333
## 60            3  Allen:10161                         FCx_frontal neocortex                     0 0.9866667
##    min_FWER                                       equivalent_structures          FWERs
## 55     0.36 Allen:10657;Allen:10656;Allen:10655;Allen:10654;Allen:10653 0.66;0.48;0.36
## 56     0.85                                                 Allen:10361       0.85;1;1
## 57     0.87                                     Allen:10163;Allen:10162       0.87;1;1
## 58     0.92                                     Allen:10225;Allen:10214       0.92;1;1
## 59     0.96                                                 Allen:10173    0.96;1;0.99
## 60     0.96                                                 Allen:10161       0.96;1;1

The rows in the output data frame are ordered by age_category, times_FWER_under_0.05, mean_FWER and min_FWER; with min_FWER for example denoting the minimum FWER for expression enrichment of the candidate genes in this brain region across all expression cutoffs. The column FWERs lists the individual FWERs for each cutoff. The column equivalent_structures lists structures with identical expression data due to lack of independent expression measurements in all regions. Nodes (brain regions) in the ontology inherit data from their children (substructures), and in the case of only one child node with expression data, the parent node inherits the childs' data leading to identical enrichment statistics .

In addition to the statistics, the list that is returned from aba_enrich also contains the input genes for which expression data is available and for each age category the gene expression values that correspond to the requested cutoff_quantiles:

res[2:3]

## $genes
##    NCAPG     NGFR    NXPH4 C21orf59   CACNG2    AGTR1     ANO1    BTBD3    MTUS1    CALB1     GYG1     PAX2 
##        1        1        1        1        1        1        1        1        1        1        1        1 
## 
## $cutoffs
##     age_category_1 age_category_2 age_category_3 age_category_4 age_category_5
## 50%       3.144481       2.854802       2.716617       2.776235       2.862117
## 70%       7.823920       7.017616       6.897414       6.842193       7.118609
## 90%      23.768641      22.478328      23.124388      21.625395      22.680811

Choose random candidate regions dependent on gene length

The default behaviour of aba_enrich is to permute candidate and background regions randomly to compute the FWER. With the option gene_len=TRUE, random selection of background genes as candidate genes is dependent on the gene length, i.e. a gene twice as long as another gene also is twice as likely selected as a candidate gene in a randomset.

## run enrichment analysis, with randomsets dependent on gene length
aba_enrich(genes, gene_len = TRUE)

Test gene expression enrichment for genomic regions

Instead of defining candindate and background genes explicitly in the genes input vector, it is also possible to define entire chromosomal regions as candidate and background regions. The expression enrichment is then tested for all protein coding genes located in or overlapping the candidate region on the plus or the minus strand. The gene coordinates used to identify those genes were obtained from [http://grch37.ensembl.org/biomart/martview/]. For the random permutations used to compute the FWER, blocks as long as candidate regions are chosen from the background regions and genes contained in these blocks are considered candidate genes (Schematic 2).
To define chromosomal regions in the input vector, the names of the 1/0 vector have to be of the form chr:start-stop, where start always has to be smaller than stop. Note that this option requires the input of background regions. If multiple candidate regions are provided, in the randomsets they are placed randomly, but non-overlapping into the background regions. The output of aba_enrich is identical to the one that is produced for single genes. The second element of the output list contains the candidate and background genes located in the user defined regions:

## create input vector with a candidate region on chromosome 8
## and background regions on chromosome 7, 8 and 9
genes = c(1, rep(0,6))
names(genes) = c('8:82000000-83000000', '7:1300000-56800000', '7:74900000-148700000','8:7400000-44300000', '8:47600000-146300000', '9:0-39200000', '9:69700000-140200000')
genes

##  8:82000000-83000000   7:1300000-56800000 7:74900000-148700000   8:7400000-44300000 8:47600000-146300000 
##                    1                    0                    0                    0                    0 
##         9:0-39200000 9:69700000-140200000 
##                    0                    0

## run enrichment analysis for the adult human brain
res = aba_enrich(genes, dataset='adult', cutoff_quantiles=c(0.5,0.7,0.9), n_randsets=100)

## look at the results from the enrichment analysis
fwers = res[[1]]
head(fwers)

##   age_category structure_id                                  structure times_FWER_under_0.05 mean_FWER
## 1            5   Allen:9150                   LC_locus ceruleus, Right                     0 0.1700000
## 2            5   Allen:9077 SNC_Substantia Nigra, pars compacta, Right                     0 0.2166667
## 3            5   Allen:9149                    LC_locus ceruleus, Left                     0 0.2233333
## 4            5   Allen:4675        LM_Lateral Mammillary Nucleus, Left                     0 0.2233333
## 5            5   Allen:9148                          LC_locus ceruleus                     0 0.2266667
## 6            5   Allen:9138                  6_abducens nucleus, Right                     0 0.2300000
##   min_FWER equivalent_structures          FWERs
## 1     0.11            Allen:9150 0.17;0.11;0.23
## 2     0.14            Allen:9077 0.24;0.14;0.27
## 3     0.13            Allen:9149 0.27;0.13;0.27
## 4     0.14            Allen:4675 0.26;0.14;0.27
## 5     0.13            Allen:9148 0.28;0.13;0.27
## 6     0.14            Allen:9138 0.28;0.14;0.27

## see which genes are located in the candidate regions
input_genes = res[[2]]
candidate_genes = input_genes[input_genes==1]
candidate_genes

##    PAG1   FABP5    PMP2   FABP9   FABP4  FABP12   IMPA1 SLC10A5  ZFAND1  CHMP4C   SNX16 
##       1       1       1       1       1       1       1       1       1       1       1

An alternative method to choose random blocks from the background regions can be used with the option circ_chrom=TRUE. Every candidate region is then compared to background regions on the same chromosome (Schematic 2). And in contrast to the default circ_chrom=FALSE, randomly chosen blocks do not have to be located inside a single background region, but are allowed to overlap multiple background regions. This means that a randomly chose block can consist of the end of the last background region and the beginning of the first background region on a given chromosome.

## create input vector with a candidate region on chromosome 8
## and background regions on chromosome 8
genes = c(1,0,0)
names(genes) = c('8:82000000-82700000', '8:7400000-44300000', '8:47600000-146300000')
genes

##  8:82000000-82700000   8:7400000-44300000 8:47600000-146300000 
##                    1                    0                    0

## run enrichment analysis for the 5 developmental stages using the "circ_chrom" option
res = aba_enrich(genes, dataset='5_stages', cutoff_quantiles=c(0.5,0.7,0.9), n_randsets=100, circ_chrom=TRUE)

## look at the results from the enrichment analysis for children (age_category 3)
fwers = res[[1]]
head(fwers[fwers$age_category==3,])

##    age_category structure_id                                                   structure
## 55            3  Allen:10252       ITC_inferolateral temporal cortex (area TEv, area 20)
## 56            3  Allen:10173                          DFC_dorsolateral prefrontal cortex
## 57            3  Allen:10333                                                STR_striatum
## 58            3  Allen:10235                                      TCx_temporal neocortex
## 59            3  Allen:10160                                   NCx_neocortex (isocortex)
## 60            3  Allen:10278 MFC_anterior (rostral) cingulate (medial prefrontal) cortex
##    times_FWER_under_0.05 mean_FWER min_FWER   equivalent_structures          FWERs
## 55                     0 0.3600000     0.18 Allen:13324;Allen:10252 0.64;0.18;0.26
## 56                     0 0.3800000     0.16             Allen:10173 0.54;0.44;0.16
## 57                     0 0.3866667     0.22 Allen:10333;Allen:10332  0.5;0.44;0.22
## 58                     0 0.3866667     0.24             Allen:10235 0.65;0.24;0.27
## 59                     0 0.4033333     0.27             Allen:10160 0.65;0.29;0.27
## 60                     0 0.4366667     0.22 Allen:10278;Allen:10277  0.6;0.49;0.22

Explore expression data

Explore data from the last expression analysis

The function get_expression enables the output of gene and brain region specific expression data averaged across donors. By only setting the parameter structure_ids defining the brain regions, the gene_ids and dataset are automatically set to the genes and dataset used in the last enrichment analysis. In comparison to defining genes and brain regions explicitly this saves some time since some pre compuations on the original dataset, e.g. aggregation of expression per gene do not have to be redone. By setting the parameter background to TRUE, the gene expression data for both candidate genes and background genes is returned. For the '5_stages' dataset the output of get_expression is a list with a data.frame for each developmental stages, where the first element corresponds to the first stage and so on:

## get expression data (list) for the first 5 brain structures of the enrichment analysis output
expr = get_expression(fwers[1:5,"structure_id"], background=FALSE)
## look at the structure of the expression data
str(expr)

## List of 5
##  $ age_category_1: num [1:15, 1:10] 0.0793 0.0413 0.0237 0.0262 0.0146 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:15] "Allen:10398" "Allen:10173" "Allen:10185" "Allen:10194" ...
##   .. ..$ : chr [1:10] "CHMP4C" "FABP12" "FABP4" "FABP5" ...
##  $ age_category_2: num [1:15, 1:10] 0.189 0.216 0.307 0.15 0.167 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:15] "Allen:10398" "Allen:10173" "Allen:10185" "Allen:10194" ...
##   .. ..$ : chr [1:10] "CHMP4C" "FABP12" "FABP4" "FABP5" ...
##  $ age_category_3: num [1:15, 1:10] 0.198 0.167 0.291 0.122 0.357 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:15] "Allen:10398" "Allen:10173" "Allen:10185" "Allen:10194" ...
##   .. ..$ : chr [1:10] "CHMP4C" "FABP12" "FABP4" "FABP5" ...
##  $ age_category_4: num [1:15, 1:10] 0.4184 0.1333 0.2298 0.0749 0.3778 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:15] "Allen:10398" "Allen:10173" "Allen:10185" "Allen:10194" ...
##   .. ..$ : chr [1:10] "CHMP4C" "FABP12" "FABP4" "FABP5" ...
##  $ age_category_5: num [1:15, 1:10] 0.419 0.182 0.14 0.164 0.312 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:15] "Allen:10398" "Allen:10173" "Allen:10185" "Allen:10194" ...
##   .. ..$ : chr [1:10] "CHMP4C" "FABP12" "FABP4" "FABP5" ...

## see expression for children (age_category 3 -> 3rd element of expression list)
head(expr[[3]])

##                CHMP4C     FABP12      FABP4     FABP5 FABP9     IMPA1     PAG1     PMP2   SLC10A5   ZFAND1
## Allen:10398 0.1982083 0.00000000 0.08022633 1.0350543     0 10.898816 1.136328 82.25636 0.3301700 4.893033
## Allen:10173 0.1674704 0.00000000 0.10582660 0.5633024     0  7.314495 1.203258 36.48997 0.3180618 4.082521
## Allen:10185 0.2907800 0.03865167 0.20189667 0.9398860     0  9.215973 1.015788 53.95269 0.4798123 4.970340
## Allen:10194 0.1220463 0.06602800 0.11968367 0.7609747     0  7.227753 1.402974 64.43418 0.2882860 4.007515
## Allen:10163 0.3574060 0.00000000 0.08532375 0.8578052     0  8.014620 1.572593 76.04565 0.3165335 4.905598
## Allen:10209 0.4306487 0.00000000 0.12425200 0.6276387     0 11.279969 1.190435 66.25536 0.4107553 6.319595

The function plot_expression enables the visualization of expression data. The input parameters of plot_expression are identical to those of get_expression except for the additional arguments dendro and age_category which determine whether or not a dendrogram should be added to the heatplot and the developmental stage to be plotted, respectively.

## plot expression in 5 structures with dendrogram for the prenatal stage (age_category 1)
plot_expression(fwers[1:5,"structure_id"], age_category=1)

plot of chunk unnamed-chunk-13 The colored side bar in the plot without dendrogram colors candidate genes (red) and background genes (black) independently. In this case only candidate gene expression was plotted (with the default option background=FALSE):

## plot expression for the 10 highest scoring brain structures in age category 3 (children, 3-11 yrs) without dendrogram
plot_expression(fwers[fwers$age_category==3, 'structure_id'][1:10], dendro=FALSE, age_category=3)

plot of chunk unnamed-chunk-14

Explore data independently from enrichment analysis

Both get_expression and plot_expression can also be used independently from a previous statistical analysis with aba_enrich. In this case the gene_ids and the dataset must be specified. gene_ids again can be Entrez-ID, Ensembl-ID or HGNC-symbol. For example: obtaining and visualizing the expression of a specific set of genes in the precentral gyrus, which corresponds to the structure id 'Allen:4010', is accomplished by:

## get expression data for six genes in the brain structure 'Allen:4010' 
get_expression(structure_ids=c('Allen:4010'), gene_ids=c(324,8312,673,1029,64764,1499), dataset='adult')

##                 324     8312      673     1029    64764     1499
## Allen:4012 7.014323 4.967949 7.350482 6.597084 6.603828 8.722930
## Allen:4013 7.045578 5.003370 7.298783 6.527379 6.691717 8.584229
## Allen:4014 6.900812 4.874076 7.302336 6.517605 6.368069 8.348260
## Allen:4015 7.016911 4.918875 7.443680 6.726566 6.964441 8.463246
## Allen:4017 6.881426 5.075758 7.234570 6.228161 6.573122 8.397123
## Allen:4018 6.999388 5.113792 7.508438 6.303073 6.692209 8.374960
## Allen:4019 6.929731 4.954951 7.141389 6.392160 6.690830 8.408290
## Allen:4020 6.949528 5.088940 7.308328 6.216858 6.514342 8.446041

## plot this expression 
plot_expression(structure_ids=c('Allen:4010'), gene_ids=c(324,8312,673,1029,64764,1499), dataset='adult')

plot of chunk unnamed-chunk-15

In this example the structure 'Allen:4010' does not directly have independet expression data available in the data set, but some of its substructures do have. Plotting or requesting expression data for 'Allen:4010' automatically obtains the data for all its substructures. The function get_sampled_substructures returns all the substructures for which expression data is available. The function get_name is useful to see the name of a brain region that corresponds to a structure id:

## get ids of the substructres of 'Allen:4010' that have expression data annotated
subs = get_sampled_substructures('Allen:4010')
## get the full name of those substructures
get_name(subs)

##                                                         Allen:4012 
##    "PrG-prc_precentral gyrus, Left, bank of the precentral sulcus" 
##                                                         Allen:4013 
##  "PrG-sl_precentral gyrus, Left, superior lateral aspect of gyrus" 
##                                                         Allen:4014 
##  "PrG-il_precentral gyrus, Left, inferior lateral aspect of gyrus" 
##                                                         Allen:4015 
##        "PrG-cs_precentral gyrus, Left, bank of the central sulcus" 
##                                                         Allen:4017 
##   "PrG-prc_precentral gyrus, Right, bank of the precentral sulcus" 
##                                                         Allen:4018 
## "PrG-sl_precentral gyrus, Right, superior lateral aspect of gyrus" 
##                                                         Allen:4019 
## "PrG-il_precentral gyrus, Right, inferior lateral aspect of gyrus" 
##                                                         Allen:4020 
##       "PrG-cs_precentral gyrus, Right, bank of the central sulcus"

Besides the function get_sampled_substructures the package also contains a function get_superstructures which returns all superstructures of the requested structure and the structure itself. The output is ordered according to the hierarchy of brain structures beginning with the top structure:

## get ids of the superstructures of 'Allen:4010'
supers = get_superstructures('Allen:4010')
## get the full name of those superstructures
get_name(supers)

##             Allen:4005             Allen:4006             Allen:4007             Allen:4008 
##             "Br_Brain"       "GM_Grey Matter"    "Tel_Telencephalon"   "Cx_Cerebral Cortex" 
##             Allen:4009             Allen:4010 
##      "FL_Frontal Lobe" "PrG_precentral gyrus"

Test gene expression enrichment with Wilcoxon rank test

The Wilcoxon rank test requires a score for given sets of genes. It then evaluates the ranks of the genes (based on their score) that are expressed in a given brain structure compared to all the candidate genes that are expressed anywhere in the brain. Again, the genes vector defines the scores, ranking the genes on specific features, while the names of the vector are the corresponding gene identifiers:

## create input vector with random scores associated with the candidate genes
genes = sample(1:50, 12)
names(genes) = c('ENSG00000168036', 'ENSG00000157764', 'ENSG00000163041', 'ENSG00000182158', 'ENSG00000147889', 'ENSG00000103126', 'ENSG00000182054', 'ENSG00000184634', 'ENSG00000134982', 'ENSG00000138413', 'ENSG00000133703', 'ENSG00000132475')
genes

## ENSG00000168036 ENSG00000157764 ENSG00000163041 ENSG00000182158 ENSG00000147889 ENSG00000103126 
##               8              27              49              36              15               9 
## ENSG00000182054 ENSG00000184634 ENSG00000134982 ENSG00000138413 ENSG00000133703 ENSG00000132475 
##               1              38              45              20              43              23

## run enrichment analysis
res = aba_enrich(genes, dataset='adult', test='wilcoxon', cutoff_quantiles=c(0.2,0.5,0.8), n_randsets=100)

## see the 5 brain regions with the lowest FWERs
res[[1]][1:5,]

##   age_category structure_id           structure times_FWER_under_0.05 mean_FWER min_FWER
## 1            5   Allen:4255 CA2_CA2 field, Left                     0 0.5633333     0.47
## 2            5   Allen:4257 CA4_CA4 field, Left                     0 0.6166667     0.47
## 3            5   Allen:4256 CA3_CA3 field, Left                     0 0.6166667     0.47
## 4            5  Allen:12893       CA2_CA2 field                     0 0.6166667     0.47
## 5            5  Allen:12894       CA3_CA3 field                     0 0.6166667     0.47
##   equivalent_structures         FWERs
## 1            Allen:4255 0.7;0.52;0.47
## 2            Allen:4257 0.7;0.68;0.47
## 3            Allen:4256 0.7;0.68;0.47
## 4           Allen:12893 0.7;0.68;0.47
## 5           Allen:12894 0.7;0.68;0.47

Plotting the expression for the output brain regions with the option dendro=FALSE results in a side bar with genes and its scores that were used for the initial enrichment test:

## plot expression of the 5 brain strcutures with the lowest FWERs, with genes ordered by user defined score from Wilcoxon rank test
plot_expression(res[[1]][1:5,"structure_id"], dendro=FALSE)

plot of chunk unnamed-chunk-21

Test developmental effect enrichment with Wilcoxon rank test

In the previous examples genes got annotated to brain regions based on their expression. Besides the two gene expression datasets 'adult' and '5_stages', the dataset 'dev_effect' can be used, which provides scores for an age effect for genes based on their expression change during development. Using this dataset the same analyses as above are performed, except that a gene is annotated to a brain region when its developmental effect score in that region is a above the cutoff_quantiles.
To test the same genes and scores as used above for enrichment of high scoring genes in the set of all genes with a developmental effect score above cutoff, the dataset parameter has to be set to 'dev_effect':

## run enrichment analysis with developmental effect score
res = aba_enrich(genes, dataset='dev_effect', test='wilcoxon', cutoff_quantiles=c(0.2,0.5,0.8), n_randsets=100)

The output of the developmental effect enrichment analysis is equal to that of the expression enrichment analysis:

## see the 5 brain regions with the lowest FWERs
res[[1]][1:5,]

##   age_category structure_id                                               structure times_FWER_under_0.05
## 1            0  Allen:10163              M1C_primary motor cortex (area M1, area 4)                     0
## 2            0  Allen:10657                                   CBC_cerebellar cortex                     0
## 3            0  Allen:10209 S1C_primary somatosensory cortex (area S1, areas 3,1,2)                     0
## 4            0  Allen:10252   ITC_inferolateral temporal cortex (area TEv, area 20)                     0
## 5            0  Allen:10236                      A1C_primary auditory cortex (core)                     0
##   mean_FWER min_FWER                                       equivalent_structures          FWERs
## 1 0.3900000     0.20                                     Allen:10163;Allen:10162  0.24;0.73;0.2
## 2 0.6300000     0.20 Allen:10657;Allen:10656;Allen:10655;Allen:10654;Allen:10653   0.89;0.8;0.2
## 3 0.7033333     0.20                                                 Allen:10209  0.95;0.96;0.2
## 4 0.7033333     0.20                                     Allen:13324;Allen:10252  0.95;0.96;0.2
## 5 0.7800000     0.46                                                 Allen:10236 0.92;0.96;0.46

Again, the developmental effect scores can be retrieved with the functions get_expression and plotted with plot_expression:

## plot developmental effect score of the 5 brain strcutures with the lowest FWERs
plot_expression(res[[1]][1:5,"structure_id"])

plot of chunk unnamed-chunk-24

Schematics

Schematic 1: Hypergeometric test and the FWER calculation

$FWER calculation$

The FWER for the Wilcoxon rank test is computed in a similar way, with random permutations of the scores that are assigned to genes.

Schematic 2: `circ_chrom` option for genomic regions input

genomic regions

To use genomic regions as input, the names of the genes input vector have to be of the form chr:start-stop. The option circ_chrom defines how candidate regions are randomly moved inside the background regions for computing the FWER. When circ_chrom = FALSE (default), candidate regions can be moved to any background region on any chromosome, but are not allowed to overlap multiple background regions. When circ_chrom = TRUE, candidate regions are only moved on the same chromosome and are allowed to overlap multiple background regions. The chromosome is “circularized”, that is a randomly placed candidate region may start at the end of the chromosome and continue on the the beginning.

Session Info

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ABAEnrichment_1.4.0 BiocStyle_2.2.0    
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5       gplots_3.0.1       formatR_1.4        tools_3.3.1        ABAData_1.3.3     
##  [6] Rcpp_0.12.7        KernSmooth_2.23-15 stringi_1.1.2      gdata_2.17.0       knitr_1.14        
## [11] caTools_1.17.1     stringr_1.1.0      bitops_1.0-6       gtools_3.5.0       evaluate_0.10

References

[1] Hawrylycz, M.J. et al. (2012) An anatomically comprehensive atlas of the adult human brain transcriptome, Nature 489: 391-399. [doi:10.1038/nature11405]

[2] Miller, J.A. et al. (2014) Transcriptional landscape of the prenatal human brain, Nature 508: 199-206. [doi:10.1038/nature13185]

[3] Allen Institute for Brain Science. Allen Human Brain Atlas (Internet). Available from: [http://human.brain-map.org/]

[4] Allen Institute for Brain Science. BrainSpan Atlas of the Developing Human Brain (Internet). Available from: [http://brainspan.org/]

[5] Pruefer, K. et al. (2007) FUNC: A package for detecting significant associations between gene sets and ontological annotations, BMC Bioinformatics 8: 41. [doi:10.1186/1471-2105-8-41]