1 Introduction

The conumee package provides tools for performing copy-number variation (CNV) analysis using Illumina 450k methylation arrays. Although the primary purpose of 450k arrays is the detection of genome-wide DNA methylation levels [1], it can additionally be used to extract useful information about copy-number alterations, e.g. in clinical cancer samples. The approach was initially described in Sturm et al., 2012 [2]. Other tools following a similar strategy include ChAMP and CopyNumber450k [3, 4].

CNV analysis is performed using a two-step approach. First, the combined intensity values of both ‘methylated’ and ‘unmethylated’ channel of each CpG are normalized using a set of normal controls (i.e. with a flat genome not showing any copy-number alterations). This step is required for correcting for probe and sample bias (e.g. caused by GC-content, type I/II differences or technical variability). Secondly, neighboring probes are combined in a hybrid approach, resulting in bins of a minimum size and a minimum number of probes. This step is required to reduce remaining technical variability and enable meaningful segmentation results.

The conumee package has been written for seamless integration with frequently-used Bioconductor packages. It is recommended that 450k data is loaded using the minfi package [5]. Segmentation is performed using the circular binary segmentation (CBS) algorithm implemented in the DNAcopy package [6]. Processing time is typically less than a minute per sample.

Finally, the conumee package also provides a set of plotting methods to create publication-grade CNV plots of the whole genome, selected chromosomes or previously defined regions of interest. Writing of text-based output files for visualization in other tools (e.g. the IGV browser) or downstream processing (e.g. using GISTIC) is also supported.

2 Load data

The recommended input format are Mset objects generated from raw IDAT files using the minfi package. Depending on the analysis workflow, these objects might be already available. A good example data-set for testing the conumee package can be downloaded from TCGA (971.6 MB):

https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/brca/cgcc/jhu-usc.edu/humanmethylation450/methylation/jhu-usc.edu_BRCA.HumanMethylation450.Level_1.8.8.0.tar.gz

This data-set comprises of 42 primary breast cancer samples, 2 breast cancer cell lines and 16 control samples. Make sure to unpack the archive. For the purpose of this vignette, only two illustrative examples are downloaded using the read.450k.url method.

library("minfi")
library("conumee")
#RGsetTCGA <- read.450k.exp(base = "~/conumee_analysis/jhu-usc.edu_BRCA.HumanMethylation450.Level_1.8.8.0")  # adjust path
RGsetTCGA <- read.450k.url()  # use default parameters for vignette examples
MsetTCGA <- preprocessIllumina(RGsetTCGA)
MsetTCGA
## MethylSet (storageMode: lockedEnvironment)
## assayData: 485512 features, 2 samples 
##   element names: Meth, Unmeth 
## An object of class 'AnnotatedDataFrame': none
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## Preprocessing
##   Method: Illumina, bg.correct = TRUE, normalize = controls, reference = 1
##   minfi version: 1.18.2
##   Manifest version: 0.4.0

Alternatively, you can use the minfiData example data-set, comprising of 3 cancer samples and 3 normal controls.

library("minfiData")
data(MsetEx)
MsetEx
## MethylSet (storageMode: lockedEnvironment)
## assayData: 485512 features, 6 samples 
##   element names: Meth, Unmeth 
## An object of class 'AnnotatedDataFrame'
##   sampleNames: 5723646052_R02C02 5723646052_R04C01 ...
##     5723646053_R06C02 (6 total)
##   varLabels: Sample_Name Sample_Well ... filenames (13 total)
##   varMetadata: labelDescription
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## Preprocessing
##   Method: Raw (no normalization or bg correction)
##   minfi version: 1.7.7
##   Manifest version: 0.4.0

The CopyNumber450k package provides a large data-set of 52 control samples which can be used for normalization.

library("CopyNumber450kData")
data(RGcontrolSetEx)
MsetControls <- preprocessIllumina(RGcontrolSetEx)

If raw IDAT files are unavailable, data can also be supplied by importing text-based input files, e.g. as generated by GenomeStudio or downloaded from the GEO repository. More details are given later.

3 Create annotation object

To begin with the CNV analysis, an annotation object, generated by the CNV.create_anno function, is required. This object holds information which only needs to be generated once, irrespective of the number of samples that are processed.

Arguments bin_minprobes and bin_minsize define the minimum number of probes per bin and the minimum bin size (default values that were optimized for 450k data are 15 and 50000, respectively). Argument exclude_regions defines regions to be excluded (e.g. polymorphic regions, an example is given in data(exclude_regions)). Please see ?CNV.create_anno for more details.

Argument detail_regions defines regions to be examined in detail (e.g. dedicated detail plots or text output, see below). For example, detail regions can contain known oncogenes or tumor suppressor genes. These regions should either be supplied as a path to a BED file or as a GRanges object (an example is given in data(detail_regions)). The start and stop coordinates indicate the regions in which probes are analyzed in detail. The plotting range of detail plots are defined in a second set of start and stop coordinates in the GRanges object values (or the 7^th and 8^th column in the BED file).

data(exclude_regions)
data(detail_regions)
head(detail_regions, n = 2)
## GRanges object with 2 ranges and 5 metadata columns:
##       seqnames               ranges strand |        name
##          <Rle>            <IRanges>  <Rle> | <character>
##   [1]    chr11 [69453874, 69469242]      + |       CCND1
##   [2]    chr19 [30300902, 30315215]      + |       CCNE1
##                      thick     score probes_gene probes_promoter
##                  <IRanges> <integer>   <integer>       <integer>
##   [1] [68961558, 69961557]        70          38              32
##   [2] [29808059, 30808058]        16          10               6
##   -------
##   seqinfo: 11 sequences from an unspecified genome; no seqlengths

anno <- CNV.create_anno(exclude_regions = exclude_regions, detail_regions = detail_regions)
## using genome annotations from UCSC
## getting 450k annotations
##  - 470870 probes used
## importing regions to exclude from analysis
## importing regions for detailed analysis
## creating bins
##  - 53891 bins created
## merging bins
##  - 15820 bins remaining

anno
## CNV annotation object
##    created  : Sun May 15 22:25:48 2016
##   @genome   : 22 chromosomes
##   @gap      : 313 regions
##   @probes   : 470870 probes
##   @exclude  : 3 regions (overlapping 570 probes)
##   @detail   : 20 regions (overlapping 768 probes)
##   @bins     : 15820 bins (min/avg/max size: 50/168.5/5000kb, probes: 15/29.7/478)

4 Combine intensity values

Intensity values of the ‘methylated’ and ‘unmethylated’ channel are combined using the CNV.load function. Input can be either an Mset object (recommended, see above), or a data.frame or matrix object containing intensities generated by GenomeStudio or downloaded from GEO (imported using e.g. read.table, should work without reformatting for most tables, please refer to ?CNV.load for more details).

data(tcgaBRCA.sentrix2name)  # TCGA sample IDs are supplied with the conumee package
sampleNames(MsetTCGA) <- tcgaBRCA.sentrix2name[sampleNames(MsetTCGA)]
tcga.data <- CNV.load(MsetTCGA)
tcga.controls <- grep("-11A-", names(tcga.data))
names(tcga.data)
## [1] "TCGA-AR-A1AU-01A-11D-A12R-05" "TCGA-AR-A1AY-01A-21D-A12R-05"
tcga.data
## CNV data object
##    created   : 
##   @intensity : available (2 samples, 485512 probes)

minfi.data <- CNV.load(MsetEx)
minfi.controls <- pData(MsetEx)$status == "normal"

controls.data <- CNV.load(MsetControls)

5 Perform CNV analysis

The main CNV analysis is separated into four parts:

First, CNV.fit is used to normalize a single query sample to a set of control samples by multiple linear regression. For best results control samples of matched normal tissues that are profiled within the same experiment are used (which are likely to have the same technical bias). Essentially this regression analysis yields the linear combination of controls samples that most closely fits the intensities of the query sample. Subsequently, the log2-ratio of probe intensities of the query sample versus the combination of control samples are calculated and used for further analysis. More details are given in the publication.
Secondly, CNV.bin is used to combine probes within predefined genomic bins. Bins are previously generated using CNV.create_anno. Intensity values are shifted to minimize the median absolute deviation from all bins to zero to determine the copy-number neutral state.
Thirdly, CNV.detail is used to analyze detail regions in detail. This step is optional, but required if detail regions should be outputted in plots and text files. Detail regions are defined using CNV.create_anno.
Finally, CNV.segment is used to segment the genome into regions of the same copy-number state. This function is a wrapper of the CNA, segment, segments.summary and segments.p functions of the DNAcopy package. Default parameters were optimized for 450k data, but can be changed. See ?CNV.segment for more details.

x <- CNV.fit(minfi.data["GroupB_1"], minfi.data[minfi.controls], anno)
y <- CNV.fit(tcga.data["TCGA-AR-A1AU-01A-11D-A12R-05"], controls.data, anno)  # use TCGA control samples for better results
z <- CNV.fit(tcga.data["TCGA-AR-A1AY-01A-21D-A12R-05"], controls.data, anno)

x <- CNV.bin(x)
x <- CNV.detail(x)
x <- CNV.segment(x)

y <- CNV.segment(CNV.detail(CNV.bin(y)))
z <- CNV.segment(CNV.detail(CNV.bin(z)))

x
## CNV analysis object
##    created   : Sun May 15 22:26:36 2016
##   @name      : GroupB_1
##   @anno      : 22 chromosomes, 470870 probes, 15820 bins
##   @fit       : available (noise: 0.237)
##   @bin       : available (shift: -0.018)
##   @detail    : available (20 regions)
##   @seg       : available (48 segments)

6 Output plots and text files

The conumee package supports two types of plots:

The CNV.genomeplot method produces plots of the complete genome or of one or multiple chromosomes. Intensity values of each bin are plotted in colored dots. Segments are shown as blue lines. See ?CNV.genomeplot for more details.
The CNV.detailplot methods produces plots of individual detail regions, as defined in CNV.create_anno. Intensity values of individual probes are plotted in colored crosses. Bins are shown as blue lines. CNV.detailplot_wrap is a wrapper function that produces a single plot of all detail regions.

Text output is generated using the CNV.write method. Parameter what specifies if “probes”, “bins”, “detail” or “segments” should be returned. If parameter file is specified, the output is written into a file, otherwise a data.frame is returned. See ?CNV.write for more details.

#pdf("~/conumee_analysis/CNVplots.pdf", height = 9, width = 18)
CNV.genomeplot(x)

CNV.genomeplot(y)
CNV.genomeplot(y, chr = "chr6")

CNV.genomeplot(z)
CNV.genomeplot(z, chr = "chr10")
CNV.detailplot(z, name = "PTEN")
CNV.detailplot_wrap(z)
#dev.off()

head(CNV.write(y, what = "segments"), n = 5)
##                             ID chrom loc.start   loc.end num.mark    bstat
## 1 TCGA-AR-A1AU-01A-11D-A12R-05  chr1    635684  89150000      728 55.16184
## 2 TCGA-AR-A1AU-01A-11D-A12R-05  chr1  89400000 118475000      161 44.01507
## 3 TCGA-AR-A1AU-01A-11D-A12R-05  chr1 119000000 249195311      704       NA
## 4 TCGA-AR-A1AU-01A-11D-A12R-05 chr10    105000 135462374      840       NA
## 5 TCGA-AR-A1AU-01A-11D-A12R-05 chr11    130000  50516927      326 39.93951
##   pval seg.mean seg.median
## 1    0    0.008      0.006
## 2    0   -0.252     -0.248
## 3   NA    0.103      0.112
## 4   NA   -0.007     -0.007
## 5    0    0.150      0.143

#CNV.write(y, what = "segments", file = "~/conumee_analysis/TCGA-AR-A1AU.CNVsegments.seg")  # adjust path
#CNV.write(y, what = "bins", file = "~/conumee_analysis/TCGA-AR-A1AU.CNVbins.igv")
#CNV.write(y, what = "detail", file = "~/conumee_analysis/TCGA-AR-A1AU.CNVdetail.txt")
#CNV.write(y, what = "probes", file = "~/conumee_analysis/TCGA-AR-A1AU.CNVprobes.igv")

7 Contact & citation

For bug-reports, comments and feature requests please write to [email protected].

When using conumee in your work, please cite as:

citation("conumee")
## 
##   Hovestadt V, Zapatka M (2015). "conumee: Enhanced copy-number
##   variation analysis using Illumina 450k methylation arrays.", R
##   package version 0.99.4. <URL:
##   http://www.bioconductor.org/packages/release/bioc/html/conumee.html>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {conumee: Enhanced copy-number variation analysis using Illumina 450k methylation arrays},
##     author = {Volker Hovestadt and Marc Zapatka},
##     address = {Division of Molecular Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany},
##     note = {R package version 0.99.4},
##     url = {http://www.bioconductor.org/packages/release/bioc/html/conumee.html},
##   }

8 Session info

sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] CopyNumber450kData_1.8.0                          
##  [2] minfiData_0.14.0                                  
##  [3] conumee_1.4.2                                     
##  [4] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.2.1
##  [5] IlluminaHumanMethylation450kmanifest_0.4.0        
##  [6] minfi_1.18.2                                      
##  [7] bumphunter_1.12.0                                 
##  [8] locfit_1.5-9.1                                    
##  [9] iterators_1.0.8                                   
## [10] foreach_1.4.3                                     
## [11] Biostrings_2.40.0                                 
## [12] XVector_0.12.0                                    
## [13] SummarizedExperiment_1.2.2                        
## [14] GenomicRanges_1.24.0                              
## [15] GenomeInfoDb_1.8.2                                
## [16] IRanges_2.6.0                                     
## [17] S4Vectors_0.10.0                                  
## [18] lattice_0.20-33                                   
## [19] Biobase_2.32.0                                    
## [20] BiocGenerics_0.18.0                               
## [21] BiocStyle_2.0.2                                   
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.1.0              nor1mix_1.2-1          
##  [3] splines_3.3.0           doRNG_1.6              
##  [5] Rsamtools_1.24.0        yaml_2.1.13            
##  [7] RSQLite_1.0.0           limma_3.28.4           
##  [9] quadprog_1.5-5          chron_2.3-47           
## [11] digest_0.6.9            RColorBrewer_1.1-2     
## [13] htmltools_0.3.5         preprocessCore_1.34.0  
## [15] Matrix_1.2-6            plyr_1.8.3             
## [17] GEOquery_2.38.4         siggenes_1.46.0        
## [19] XML_3.98-1.4            biomaRt_2.28.0         
## [21] genefilter_1.54.2       zlibbioc_1.18.0        
## [23] xtable_1.8-2            BiocParallel_1.6.2     
## [25] openssl_0.9.3           annotate_1.50.0        
## [27] beanplot_1.2            pkgmaker_0.22          
## [29] GenomicFeatures_1.24.2  survival_2.39-4        
## [31] magrittr_1.5            mclust_5.2             
## [33] evaluate_0.9            nlme_3.1-128           
## [35] MASS_7.3-45             tools_3.3.0            
## [37] registry_0.3            data.table_1.9.6       
## [39] formatR_1.4             matrixStats_0.50.2     
## [41] stringr_1.0.0           rngtools_1.2.4         
## [43] AnnotationDbi_1.34.2    base64_2.0             
## [45] grid_3.3.0              RCurl_1.95-4.8         
## [47] bitops_1.0-6            rmarkdown_0.9.6        
## [49] DNAcopy_1.46.0          codetools_0.2-14       
## [51] multtest_2.28.0         DBI_0.4-1              
## [53] reshape_0.8.5           R6_2.1.2               
## [55] illuminaio_0.14.0       GenomicAlignments_1.8.0
## [57] knitr_1.13              rtracklayer_1.32.0     
## [59] stringi_1.0-1           Rcpp_0.12.5

References

1. Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, Delano D, Zhang L, Schroth GP, Gunderson KL, Fan JB, Shen R: High density DNA methylation array with single CpG site resolution. Genomics 2011, 98:288–295.

2. Sturm D, Witt H, Hovestadt V, Khuong-Quang D-A, Jones DT, Konermann C, Pfaff E, Toenjes M, Sill M, Bender S, Kool M, Zapatka M, Becker N, Zucknick M, Hielscher T, Liu XY, Fontebasso AM, Ryzhova M, Albrecht S, Jacob K, Wolter M, Ebinger M, Schuhmann MU, Meter T van, Fruehwald MC, Hauch H, Pekrun A, Radlwimmer B, Niehues T, Komorowski G von, et al.: Hotspot mutations in H3F3A and IDH1 define distinct epigenetic and biological subgroups of glioblastoma. Cancer Cell 2012, 22:425–437.

3. Feber A, Guilhamon P, Lechner M, Fenton T, Wilson GA, Thirlwell C, Morris TJ, Flanagan AM, Teschendorff AE, Kelly JD, Beck S: Using high-density DNA methylation arrays to profile copy number alterations. Genome Biology 2014, 15.

4. Papillon-Cavanagh S, Fortin J, De Jay N, Majewski J: CopyNumber450k: An R package for CNV inference using illumina 450k DNA methylation assay. Submitted.

5. Aryee MJ, Jaffe AE, Hector C, Christine L, Feinberg AP, Hansen KD, Irizarry RA: Minfi: A flexible and comprehensive bioconductor package for the analysis of infinium DNA methylation microarrays. Bioinformatics 2014, 30:1363–1369.

6. Olshen AB, Venkatraman E, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 2004, 5:557–572.

The conumee vignette

Volker Hovestadt, Marc Zapatka (Division of Molecular Genetics, German Cancer Research Center, Heidelberg, Germany)

2016-05-15

Contents