Package: AnnotationHub
Authors: Bioconductor Package Maintainer [cre],
Martin Morgan [aut],
Marc Carlson [ctb],
Dan Tenenbaum [ctb],
Sonali Arora [ctb],
Valerie Oberchain [ctb],
Kayla Morrell [ctb],
Lori Shepherd [aut]
Modified: Mon March 18 2024
Compiled: Tue Apr 30 22:31:16 2024
Bioconductor offers pre-built org.*
annotation packages for model
organisms, with their use described in the
OrgDb
section of the Annotation work flow. Here we discover available OrgDb
objects for less-model organisms
library(AnnotationHub)
ah <- AnnotationHub()
query(ah, "OrgDb")
## AnnotationHub with 1988 records
## # snapshotDate(): 2024-04-29
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, NCBI
## # $species: Escherichia coli, Coffea arabica, greater Indian_fruit_bat, Zootoca vivipara, Zooter...
## # $rdataclass: OrgDb, TxDb
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
## # rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH111588"]]'
##
## title
## AH111588 | OrgDb Sqlite file for Coffea arabica
## AH114195 | org.Triticum_aestivum.eg.sqlite
## AH114196 | org.Triticum_aestivum_subsp._aestivum.eg.sqlite
## AH114197 | org.Triticum_vulgare.eg.sqlite
## AH114198 | org.Brassica_napus.eg.sqlite
## ... ...
## AH116714 | org.Mmu.eg.db.sqlite
## AH116715 | org.Ce.eg.db.sqlite
## AH116716 | org.Xl.eg.db.sqlite
## AH116717 | org.Sc.sgd.db.sqlite
## AH116718 | org.Dr.eg.db.sqlite
orgdb <- query(ah, c("OrgDb", "[email protected]"))[[1]]
## loading from cache
The object returned by AnnotationHub is directly usable with the
select()
interface, e.g., to discover the available keytypes for
querying the object, the columns that these keytypes can map to, and
finally selecting the SYMBOL and GENENAME corresponding to the first 6
ENTREZIDs
library(AnnotationDbi)
AnnotationDbi::keytypes(orgdb)
## [1] "ACCNUM" "ALIAS" "ENTREZID" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [7] "GID" "GO" "GOALL" "ONTOLOGY" "ONTOLOGYALL" "PMID"
## [13] "REFSEQ" "SYMBOL" "UNIGENE"
AnnotationDbi::columns(orgdb)
## [1] "ACCNUM" "ALIAS" "CHR" "ENTREZID" "EVIDENCE" "EVIDENCEALL"
## [7] "GENENAME" "GID" "GO" "GOALL" "ONTOLOGY" "ONTOLOGYALL"
## [13] "PMID" "REFSEQ" "SYMBOL" "UNIGENE"
egid <- head(keys(orgdb, "ENTREZID"))
AnnotationDbi::select(orgdb, egid, c("SYMBOL", "GENENAME"), "ENTREZID")
## 'select()' returned 1:1 mapping between keys and columns
## ENTREZID SYMBOL GENENAME
## 1 100037518 LOC100037518 chitinase CLP
## 2 100037521 LOC100037521 soluble starch synthase 2-3, chloroplastic/amyloplastic
## 3 100037522 LOC100037522 ubiquinol oxidase 1a, mitochondrial
## 4 100037523 LOC100037523 glutathione S-transferase 1
## 5 100037524 LOC100037524 sucrose-phosphatase 2
## 6 100037525 LOC100037525 cyclophilin
All Roadmap Epigenomics files are hosted here. If one had to download these files on their own, one would navigate through the web interface to find useful files, then use something like the following R code.
url <- "http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E001-H3K4me1.broadPeak.gz"
filename <- basename(url)
download.file(url, destfile=filename)
if (file.exists(filename))
data <- import(filename, format="bed")
This would have to be repeated for all files, and the onus would lie on the user to identify, download, import, and manage the local disk location of these files.
AnnotationHub reduces this task to just a few lines of R code
library(AnnotationHub)
ah = AnnotationHub()
epiFiles <- query(ah, "EpigenomeRoadMap")
A look at the value returned by epiFiles
shows us that
18250 roadmap resources are available via
AnnotationHub. Additional information about
the files is also available, e.g., where the files came from
(dataprovider), genome, species, sourceurl, sourcetypes.
epiFiles
## AnnotationHub with 18250 records
## # snapshotDate(): 2024-04-29
## # $dataprovider: BroadInstitute, NA
## # $species: Homo sapiens
## # $rdataclass: BigWigFile, GRanges, data.frame
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
## # rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH28856"]]'
##
## title
## AH28856 | E001-H3K4me1.broadPeak.gz
## AH28857 | E001-H3K4me3.broadPeak.gz
## AH28858 | E001-H3K9ac.broadPeak.gz
## AH28859 | E001-H3K9me3.broadPeak.gz
## AH28860 | E001-H3K27me3.broadPeak.gz
## ... ...
## AH49542 | E061_mCRF_FractionalMethylation.bigwig
## AH49543 | E081_mCRF_FractionalMethylation.bigwig
## AH49544 | E082_mCRF_FractionalMethylation.bigwig
## AH116724 | TENET_consensus_enhancer_regions
## AH116726 | TENET_consensus_promoter_regions
A good sanity check to ensure that we have files only from the Roadmap Epigenomics project is to check that all the files in the returned smaller hub object come from Homo sapiens and the hg19, hg38 genome
unique(epiFiles$species)
## [1] "Homo sapiens"
unique(epiFiles$genome)
## [1] "hg19" "hg38"
Broadly, one can get an idea of the different files from this project looking at the sourcetype
table(epiFiles$sourcetype)
##
## BED BigWig GTF Multiple tab Zip
## 8298 9932 3 2 1 14
To get a more descriptive idea of these different files one can use:
sort(table(epiFiles$description), decreasing=TRUE)
##
## Bigwig File containing -log10(p-value) signal tracks from EpigenomeRoadMap Project
## 6881
## Bigwig File containing fold enrichment signal tracks from EpigenomeRoadMap Project
## 2947
## Narrow ChIP-seq peaks for consolidated epigenomes from EpigenomeRoadMap Project
## 2894
## Broad ChIP-seq peaks for consolidated epigenomes from EpigenomeRoadMap Project
## 2534
## Gapped ChIP-seq peaks for consolidated epigenomes from EpigenomeRoadMap Project
## 2534
## Narrow DNasePeaks for consolidated epigenomes from EpigenomeRoadMap Project
## 131
## 15 state chromatin segmentations from EpigenomeRoadMap Project
## 127
## Broad domains on enrichment for DNase-seq for consolidated epigenomes from EpigenomeRoadMap Project
## 78
## RRBS fractional methylation calls from EpigenomeRoadMap Project
## 51
## Whole genome bisulphite fractional methylation calls from EpigenomeRoadMap Project
## 37
## MeDIP/MRE(mCRF) fractional methylation calls from EpigenomeRoadMap Project
## 16
## GencodeV10 gene/transcript coordinates and annotations corresponding to hg19 version of the human genome
## 3
## RNA-seq read count matrix for intronic protein-coding RNA elements
## 2
## RNA-seq read counts matrix for ribosomal gene exons
## 2
## RPKM expression matrix for ribosomal gene exons
## 2
## A composite GRanges object containing regions of putative enhancer elements from a variety of sources, primarily for use in the TENET Biocondcutor package. This dataset is composed of regions of strong enhancers as annotated by the EpigenomeRoadMap ChromHMM expanded 18-state model based on 98 reference epigenomes, lifted over to the hg38 genome, as well as regions of human permissive enhancers identified by the FANTOM5 project. For additional information on component datasets, see the manifest file hosted at https://github.com/rhielab/TENET.AnnotationHub/data-raw/TENET_consensus_datasets_manifest.tsv
## 1
## A composite GRanges object containing regions of putative promoter elements from a variety of sources, primarily for use in the TENET Biocondcutor package. This dataset is composed of regions flanking transciption start sites as annotated by the EpigenomeRoadMap ChromHMM expanded 18-state model based on 98 reference epigenomes, lifted over to the hg38 genome. For additional information on component datasets, see the manifest file hosted at https://github.com/rhielab/TENET.AnnotationHub/data-raw/TENET_consensus_datasets_manifest.tsv
## 1
## Metadata for EpigenomeRoadMap Project
## 1
## RNA-seq read counts matrix for non-coding RNAs
## 1
## RNA-seq read counts matrix for protein coding exons
## 1
## RNA-seq read counts matrix for protein coding genes
## 1
## RNA-seq read counts matrix for ribosomal genes
## 1
## RPKM expression matrix for non-coding RNAs
## 1
## RPKM expression matrix for protein coding exons
## 1
## RPKM expression matrix for protein coding genes
## 1
## RPKM expression matrix for ribosomal RNAs
## 1
The ‘metadata’ provided by the Roadmap Epigenomics Project is also available. Note that the information displayed about a hub with a single resource is quite different from the information displayed when the hub references more than one resource.
metadata.tab <- query(ah , c("EpigenomeRoadMap", "Metadata"))
metadata.tab
## AnnotationHub with 1 record
## # snapshotDate(): 2024-04-29
## # names(): AH41830
## # $dataprovider: BroadInstitute
## # $species: Homo sapiens
## # $rdataclass: data.frame
## # $rdatadateadded: 2015-05-11
## # $title: EID_metadata.tab
## # $description: Metadata for EpigenomeRoadMap Project
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: tab
## # $sourceurl: http://egg2.wustl.edu/roadmap/data/byFileType/metadata/EID_metadata.tab
## # $sourcesize: 18035
## # $tags: c("EpigenomeRoadMap", "Metadata")
## # retrieve record with 'object[["AH41830"]]'
So far we have been exploring information about resources, without
downloading the resource to a local cache and importing it into R.
One can retrieve the resource using [[
as indicated at the
end of the show method
## loading from cache
metadata.tab <- ah[["AH41830"]]
## loading from cache
The metadata.tab file is returned as a data.frame. The first 6 rows of the first 5 columns are shown here:
metadata.tab[1:6, 1:5]
## EID GROUP COLOR MNEMONIC STD_NAME
## 1 E001 ESC #924965 ESC.I3 ES-I3 Cells
## 2 E002 ESC #924965 ESC.WA7 ES-WA7 Cells
## 3 E003 ESC #924965 ESC.H1 H1 Cells
## 4 E004 ES-deriv #4178AE ESDR.H1.BMP4.MESO H1 BMP4 Derived Mesendoderm Cultured Cells
## 5 E005 ES-deriv #4178AE ESDR.H1.BMP4.TROP H1 BMP4 Derived Trophoblast Cultured Cells
## 6 E006 ES-deriv #4178AE ESDR.H1.MSC H1 Derived Mesenchymal Stem Cells
One can keep constructing different queries using multiple arguments to trim down these 18250 to get the files one wants. For example, to get the ChIP-Seq files for consolidated epigenomes, one could use
bpChipEpi <- query(ah , c("EpigenomeRoadMap", "broadPeak", "chip", "consolidated"))
To get all the bigWig signal files, one can query the hub using
allBigWigFiles <- query(ah, c("EpigenomeRoadMap", "BigWig"))
To access the 15 state chromatin segmentations, one can use
seg <- query(ah, c("EpigenomeRoadMap", "segmentations"))
If one is interested in getting all the files related to one sample
E126 <- query(ah , c("EpigenomeRoadMap", "E126", "H3K4ME2"))
E126
## AnnotationHub with 6 records
## # snapshotDate(): 2024-04-29
## # $dataprovider: BroadInstitute
## # $species: Homo sapiens
## # $rdataclass: GRanges, BigWigFile
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
## # rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH29817"]]'
##
## title
## AH29817 | E126-H3K4me2.broadPeak.gz
## AH30868 | E126-H3K4me2.narrowPeak.gz
## AH31801 | E126-H3K4me2.gappedPeak.gz
## AH32990 | E126-H3K4me2.fc.signal.bigwig
## AH34022 | E126-H3K4me2.pval.signal.bigwig
## AH40177 | E126-H3K4me2.imputed.pval.signal.bigwig
Hub resources can also be selected using $
, subset()
, and
BiocHubsShiny()
; see the main
AnnotationHub vignette for additional detail.
Hub resources are imported as the appropriate Bioconductor object for use in further analysis. For example, peak files are returned as GRanges objects.
## loading from cache
## require("rtracklayer")
##
## Attaching package: 'BiocIO'
## The following object is masked from 'package:rtracklayer':
##
## FileForFormat
peaks <- E126[['AH29817']]
## loading from cache
seqinfo(peaks)
## Seqinfo object with 298 sequences (2 circular) from hg19 genome:
## seqnames seqlengths isCircular genome
## chr1 249250621 FALSE hg19
## chr2 243199373 FALSE hg19
## chr3 198022430 FALSE hg19
## chr4 191154276 FALSE hg19
## chr5 180915260 FALSE hg19
## ... ... ... ...
## chrUn_gl000245 36651 FALSE hg19
## chrUn_gl000246 38154 FALSE hg19
## chrUn_gl000247 36422 FALSE hg19
## chrUn_gl000248 39786 FALSE hg19
## chrUn_gl000249 38502 FALSE hg19
BigWig files are returned as BigWigFile objects. A BigWigFile is a
reference to a file on disk; the data in the file can be read in using
rtracklayer::import()
, perhaps querying these large files for
particular genomic regions of interest as described on the help page
?import.bw
.
Each record inside AnnotationHub is associated with a unique identifier. Most GRanges objects returned by AnnotationHub contain the unique AnnotationHub identifier of the resource from which the GRanges is derived. This can come handy when working with the GRanges object for a while, and additional information about the object (e.g., the name of the file in the cache, or the original sourceurl for the data underlying the resource) that is being worked with.
metadata(peaks)
## $AnnotationHubName
## [1] "AH29817"
##
## $`File Name`
## [1] "E126-H3K4me2.broadPeak.gz"
##
## $`Data Source`
## [1] "http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E126-H3K4me2.broadPeak.gz"
##
## $Provider
## [1] "BroadInstitute"
##
## $Organism
## [1] "Homo sapiens"
##
## $`Taxonomy ID`
## [1] 9606
ah[metadata(peaks)$AnnotationHubName]$sourceurl
## [1] "http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E126-H3K4me2.broadPeak.gz"
Bioconductor represents gene models using ‘transcript’
databases. These are available via packages such as
TxDb.Hsapiens.UCSC.hg38.knownGene
or can be constructed using functions such as
txdbmaker::makeTxDbFromBiomart()
.
AnnotationHub provides an easy way to work with gene models published by Ensembl. Let’s see what Ensembl’s Release-94 has in terms of data for pufferfish, Takifugu rubripes.
query(ah, c("Takifugu", "release-94"))
## AnnotationHub with 7 records
## # snapshotDate(): 2024-04-29
## # $dataprovider: Ensembl
## # $species: Takifugu rubripes
## # $rdataclass: TwoBitFile, GRanges
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
## # rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH64856"]]'
##
## title
## AH64856 | Takifugu_rubripes.FUGU5.94.abinitio.gtf
## AH64857 | Takifugu_rubripes.FUGU5.94.chr.gtf
## AH64858 | Takifugu_rubripes.FUGU5.94.gtf
## AH66114 | Takifugu_rubripes.FUGU5.cdna.all.2bit
## AH66115 | Takifugu_rubripes.FUGU5.dna_rm.toplevel.2bit
## AH66116 | Takifugu_rubripes.FUGU5.dna_sm.toplevel.2bit
## AH66117 | Takifugu_rubripes.FUGU5.ncrna.2bit
We see that there is a GTF file descrbing gene models, as well as various DNA sequences. Let’s retrieve the GTF and top-level DNA sequence files. The GTF file is imported as a GRanges instance, the DNA sequence as a twobit file.
gtf <- ah[["AH64858"]]
## loading from cache
## Importing File into R ..
dna <- ah[["AH66116"]]
## loading from cache
head(gtf, 3)
## GRanges object with 3 ranges and 22 metadata columns:
## seqnames ranges strand | source type score phase gene_id
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer> <character>
## [1] 1 217531-252954 + | ensembl gene NA <NA> ENSTRUG00000009922
## [2] 1 217531-252954 + | ensembl transcript NA <NA> ENSTRUG00000009922
## [3] 1 217531-217702 + | ensembl exon NA <NA> ENSTRUG00000009922
## gene_version gene_name gene_source gene_biotype transcript_id transcript_version
## <character> <character> <character> <character> <character> <character>
## [1] 2 sdk2b ensembl protein_coding <NA> <NA>
## [2] 2 sdk2b ensembl protein_coding ENSTRUT00000025027 2
## [3] 2 sdk2b ensembl protein_coding ENSTRUT00000025027 2
## transcript_name transcript_source transcript_biotype exon_number exon_id
## <character> <character> <character> <character> <character>
## [1] <NA> <NA> <NA> <NA> <NA>
## [2] sdk2b-201 ensembl protein_coding <NA> <NA>
## [3] sdk2b-201 ensembl protein_coding 1 ENSTRUE00000325931
## exon_version protein_id protein_version projection_parent_gene projection_parent_transcript
## <character> <character> <character> <character> <character>
## [1] <NA> <NA> <NA> <NA> <NA>
## [2] <NA> <NA> <NA> <NA> <NA>
## [3] 1 <NA> <NA> <NA> <NA>
## tag
## <character>
## [1] <NA>
## [2] <NA>
## [3] <NA>
## -------
## seqinfo: 1627 sequences (1 circular) from FUGU5 genome; no seqlengths
dna
## TwoBitFile object
## resource: /home/biocbuild/.cache/R/AnnotationHub/2342347417badb_72862
head(seqlevels(dna))
## [1] "1" "2" "3" "4" "5" "6"
Let’s identify the 25 longest DNA sequences, and keep just the annotations on these scaffolds.
keep <- names(tail(sort(seqlengths(dna)), 25))
gtf_subset <- gtf[seqnames(gtf) %in% keep]
It is trivial to make a TxDb instance of this subset (or of the entire gtf)
library(txdbmaker) # for makeTxDbFromGRanges
txdb <- makeTxDbFromGRanges(gtf_subset)
## Warning in .get_cds_IDX(mcols0$type, mcols0$phase): The "phase" metadata column contains non-NA values for features of type stop_codon. This
## information was ignored.
and to use that in conjunction with the DNA sequences, e.g., to find exon sequences of all annotated genes.
library(Rsamtools) # for getSeq,FaFile-method
exons <- exons(txdb)
length(exons)
## [1] 178769
getSeq(dna, exons)
## DNAStringSet object of length 178769:
## width seq
## [1] 172 CGATACGGCGCGCTCCGTTTGCCTCCGCCCCCCCCGTGGCG...GCGTTTCTGGGCCCCGCCCCCCTCGCCTCCCTCCGTGGCAG
## [2] 28 TTGGGATTATTCTCACACGCTGATCGGT
## [3] 160 ACGACGTGCCCCCCTACTTCAAGACGGAGCCGGCCCGGAGC...CACAACAACACGGAGCTGACGCGCTTCTCGCTGGAGTACAG
## [4] 107 GTACGTGATCCCGTCTTTGGACCGCTCCCACGCCGGATTCT...GGGCGCCCTGCTGCAGAGACGCACCGAAGTCCAGGTGGTCT
## [5] 148 TTATGGGAAGCTTCGAGGAGGGCGAGCGAGCCCAGTCCGTC...TGGTACCGGGATGGACGCAAGATTCCCCCGAGCAGCCGCAT
## ... ... ...
## [178765] 54 ATGCCCTCAATTACACTACCGCAGAAGGAGAACGCTCTCTTCAAAAGAATATTG
## [178766] 863 CTCTTGGTGAGGGGAAGGATGAATTTATCCGATGTCCAGTG...GTGATATAAGTTTTAGGGAAGAGCCCCATAGGCTGATGTAG
## [178767] 270 TTTGTGCAATGGGTGGCACCAGCAGCACCAGCAGGTTGTTT...CCCGTCTATCCGGATCATGCAGTGGAACATACTGGCACAAG
## [178768] 982 CAGTTGTACAGAAATCGTTGGAGCAGACCTGGAGGCTGTTG...CCCGTCTATCCGGATCATGCAGTGGAACATACTGGCACAAG
## [178769] 627 GGGGGAGATTCCGATGGTGGTATATTTAAAAAGTTGAAACT...GCCAAAGTGTTCCAGTTCCACCCATCGTGGCGGCCCGCCAG
There is a one-to-one mapping between the genomic ranges contained in
exons
and the DNA sequences returned by getSeq()
.
Some difficulties arise when working with this partly assembled genome that require more advanced GenomicRanges skills, see the GenomicRanges vignettes, especially “GenomicRanges HOWTOs” and “An Introduction to GenomicRanges”.
Suppose we wanted to lift features from one genome build to another, e.g., because annotations were generated for hg19 but our experimental analysis used hg18. We know that UCSC provides ‘liftover’ files for mapping between genome builds.
In this example, we will take our broad Peak GRanges from E126 which comes from the ‘hg19’ genome, and lift over these features to their ‘hg38’ coordinates.
chainfiles <- query(ah , c("hg38", "hg19", "chainfile"))
chainfiles
## AnnotationHub with 4 records
## # snapshotDate(): 2024-04-29
## # $dataprovider: UCSC, NCBI
## # $species: Homo sapiens
## # $rdataclass: ChainFile
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
## # rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH14108"]]'
##
## title
## AH14108 | hg38ToHg19.over.chain.gz
## AH14150 | hg19ToHg38.over.chain.gz
## AH78915 | Chain file for Homo sapiens rRNA hg19 to hg38
## AH78916 | Chain file for Homo sapiens rRNA hg38 to hg19
We are interested in the file that lifts over features from hg19 to hg38 so lets download that using
## loading from cache
chain <- chainfiles[['AH14150']]
## loading from cache
chain
## Chain of length 25
## names(25): chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 ... chr18 chr19 chr20 chr21 chr22 chrX chrY chrM
Perform the liftOver operation using rtracklayer::liftOver()
:
library(rtracklayer)
gr38 <- liftOver(peaks, chain)
This returns a GRangeslist; update the genome of the result to get the final result
genome(gr38) <- "hg38"
gr38
## GRangesList object of length 153266:
## [[1]]
## GRanges object with 1 range and 5 metadata columns:
## seqnames ranges strand | name score signalValue pValue qValue
## <Rle> <IRanges> <Rle> | <character> <numeric> <numeric> <numeric> <numeric>
## [1] chr1 28667912-28670147 * | Rank_1 189 10.5585 22.0132 18.9991
## -------
## seqinfo: 23 sequences from hg38 genome; no seqlengths
##
## [[2]]
## GRanges object with 1 range and 5 metadata columns:
## seqnames ranges strand | name score signalValue pValue qValue
## <Rle> <IRanges> <Rle> | <character> <numeric> <numeric> <numeric> <numeric>
## [1] chr4 54090990-54092984 * | Rank_2 188 8.11483 21.8044 18.8066
## -------
## seqinfo: 23 sequences from hg38 genome; no seqlengths
##
## [[3]]
## GRanges object with 1 range and 5 metadata columns:
## seqnames ranges strand | name score signalValue pValue qValue
## <Rle> <IRanges> <Rle> | <character> <numeric> <numeric> <numeric> <numeric>
## [1] chr14 75293392-75296621 * | Rank_3 180 8.89834 20.9771 18.0282
## -------
## seqinfo: 23 sequences from hg38 genome; no seqlengths
##
## ...
## <153263 more elements>
One may also be interested in working with common germline variants with evidence of medical interest. This information is available at NCBI.
Query the dbDNP files in the hub:
This returns a VcfFile which can be read in using r Biocpkg("VariantAnnotation")
; because VCF files can be large, readVcf()
supports several strategies for importing only relevant parts of the file
(e.g., particular genomic locations, particular features of the variants), see
?readVcf
for additional information.
variants <- readVcf(vcf, genome="hg19")
variants
## class: CollapsedVCF
## dim: 111138 0
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 58 columns: RS, RSPOS, RV, VP, GENEINFO, dbSNPBuildID, SAO, SSR, WGT, VC, PM, T...
## info(header(vcf)):
## Number Type Description
## RS 1 Integer dbSNP ID (i.e. rs number)
## RSPOS 1 Integer Chr position reported in dbSNP
## RV 0 Flag RS orientation is reversed
## VP 1 String Variation Property. Documentation is at ftp://ftp.ncbi.nlm.nih.g...
## GENEINFO 1 String Pairs each of gene symbol:gene id. The gene symbol and id are de...
## dbSNPBuildID 1 Integer First dbSNP Build for RS
## SAO 1 Integer Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic...
## SSR 1 Integer Variant Suspect Reason Codes (may be more than one value added to...
## WGT 1 Integer Weight, 00 - unmapped, 1 - weight 1, 2 - weight 2, 3 - weight 3 o...
## VC 1 String Variation Class
## PM 0 Flag Variant is Precious(Clinical,Pubmed Cited)
## TPA 0 Flag Provisional Third Party Annotation(TPA) (currently rs from PHARMG...
## PMC 0 Flag Links exist to PubMed Central article
## S3D 0 Flag Has 3D structure - SNP3D table
## SLO 0 Flag Has SubmitterLinkOut - From SNP->SubSNP->Batch.link_out
## NSF 0 Flag Has non-synonymous frameshift A coding region variation where one...
## NSM 0 Flag Has non-synonymous missense A coding region variation where one a...
## NSN 0 Flag Has non-synonymous nonsense A coding region variation where one a...
## REF 0 Flag Has reference A coding region variation where one allele in the s...
## SYN 0 Flag Has synonymous A coding region variation where one allele in the ...
## U3 0 Flag In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53
## U5 0 Flag In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55
## ASS 0 Flag In acceptor splice site FxnCode = 73
## DSS 0 Flag In donor splice-site FxnCode = 75
## INT 0 Flag In Intron FxnCode = 6
## R3 0 Flag In 3' gene region FxnCode = 13
## R5 0 Flag In 5' gene region FxnCode = 15
## OTH 0 Flag Has other variant with exactly the same set of mapped positions o...
## CFL 0 Flag Has Assembly conflict. This is for weight 1 and 2 variant that ma...
## ASP 0 Flag Is Assembly specific. This is set if the variant only maps to one...
## MUT 0 Flag Is mutation (journal citation, explicit fact): a low frequency va...
## VLD 0 Flag Is Validated. This bit is set if the variant has 2+ minor allele...
## G5A 0 Flag >5% minor allele frequency in each and all populations
## G5 0 Flag >5% minor allele frequency in 1+ populations
## HD 0 Flag Marker is on high density genotyping kit (50K density or greater)...
## GNO 0 Flag Genotypes available. The variant has individual genotype (in SubI...
## KGPhase1 0 Flag 1000 Genome phase 1 (incl. June Interim phase 1)
## KGPhase3 0 Flag 1000 Genome phase 3
## CDA 0 Flag Variation is interrogated in a clinical diagnostic assay
## LSD 0 Flag Submitted from a locus-specific database
## MTP 0 Flag Microattribution/third-party annotation(TPA:GWAS,PAGE)
## OM 0 Flag Has OMIM/OMIA
## NOC 0 Flag Contig allele not present in variant allele list. The reference s...
## WTD 0 Flag Is Withdrawn by submitter If one member ss is withdrawn by submit...
## NOV 0 Flag Rs cluster has non-overlapping allele sets. True when rs set has ...
## CAF . String An ordered, comma delimited list of allele frequencies based on 1...
## COMMON 1 Integer RS is a common SNP. A common SNP is one that has at least one 10...
## CLNHGVS . String Variant names from HGVS. The order of these variants correspon...
## CLNALLE . Integer Variant alleles from REF or ALT columns. 0 is REF, 1 is the firs...
## CLNSRC . String Variant Clinical Chanels
## CLNORIGIN . String Allele Origin. One or more of the following values may be added: ...
## CLNSRCID . String Variant Clinical Channel IDs
## CLNSIG . String Variant Clinical Significance, 0 - Uncertain significance, 1 - no...
## CLNDSDB . String Variant disease database name
## CLNDSDBID . String Variant disease database ID
## CLNDBN . String Variant disease name
## CLNREVSTAT . String no_assertion - No assertion provided, no_criteria - No assertion ...
## CLNACC . String Variant Accession and Versions
## geno(vcf):
## List of length 0:
rowRanges()
returns information from the CHROM, POS and ID fields of the VCF
file, represented as a GRanges instance
rowRanges(variants)
## GRanges object with 111138 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID REF ALT QUAL
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet> <DNAStringSetList> <numeric>
## rs786201005 1 1014143 * | NA C T NA
## rs672601345 1 1014316 * | NA C CG NA
## rs672601312 1 1014359 * | NA G T NA
## rs115173026 1 1020217 * | NA G T NA
## rs201073369 1 1020239 * | NA G C NA
## ... ... ... ... . ... ... ... ...
## rs527236200 MT 15943 * | NA T C NA
## rs118203890 MT 15950 * | NA G A NA
## rs199474700 MT 15965 * | NA A G NA
## rs199474701 MT 15967 * | NA G A NA
## rs199474699 MT 15990 * | NA C T NA
## FILTER
## <character>
## rs786201005 .
## rs672601345 .
## rs672601312 .
## rs115173026 .
## rs201073369 .
## ... ...
## rs527236200 .
## rs118203890 .
## rs199474700 .
## rs199474701 .
## rs199474699 .
## -------
## seqinfo: 25 sequences from hg19 genome; no seqlengths
Note that the broadPeaks files follow the UCSC chromosome naming convention, and the vcf data follows the NCBI style of chromosome naming convention. To bring these ranges in the same chromosome naming convention (ie UCSC), we would use
seqlevelsStyle(variants) <-seqlevelsStyle(peaks)
And then finally to find which variants overlap these broadPeaks we would use:
overlap <- findOverlaps(variants, peaks)
## Warning in .merge_two_Seqinfo_objects(x, y): The 2 combined objects have no sequence levels in common. (Use
## suppressWarnings() to suppress this warning.)
overlap
## Hits object with 0 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## -------
## queryLength: 111138 / subjectLength: 153266
Some insight into how these results can be interpretted comes from looking a particular peak, e.g., the 3852nd peak
idx <- subjectHits(overlap) == 3852
overlap[idx]
## Hits object with 0 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## -------
## queryLength: 111138 / subjectLength: 153266
There are three variants overlapping this peak; the coordinates of the peak and the overlapping variants are
peaks[3852]
## GRanges object with 1 range and 5 metadata columns:
## seqnames ranges strand | name score signalValue pValue qValue
## <Rle> <IRanges> <Rle> | <character> <numeric> <numeric> <numeric> <numeric>
## [1] chr22 50622494-50626143 * | Rank_3852 79 6.06768 10.1894 7.99818
## -------
## seqinfo: 298 sequences (2 circular) from hg19 genome
rowRanges(variants)[queryHits(overlap[idx])]
## GRanges object with 0 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID REF ALT QUAL FILTER
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet> <DNAStringSetList> <numeric> <character>
## -------
## seqinfo: 25 sequences from hg19 genome; no seqlengths
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.72.0
## [3] BiocIO_1.14.0 rtracklayer_1.64.0
## [5] VariantAnnotation_1.50.0 SummarizedExperiment_1.34.0
## [7] MatrixGenerics_1.16.0 matrixStats_1.3.0
## [9] Rsamtools_2.20.0 Biostrings_2.72.0
## [11] XVector_0.44.0 txdbmaker_1.0.0
## [13] GenomicFeatures_1.56.0 AnnotationDbi_1.66.0
## [15] Biobase_2.64.0 GenomicRanges_1.56.0
## [17] GenomeInfoDb_1.40.0 IRanges_2.38.0
## [19] S4Vectors_0.42.0 AnnotationHub_3.12.0
## [21] BiocFileCache_2.12.0 dbplyr_2.5.0
## [23] BiocGenerics_0.50.0 BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4 blob_1.2.4
## [4] filelock_1.0.3 bitops_1.0-7 fastmap_1.1.1
## [7] RCurl_1.98-1.14 GenomicAlignments_1.40.0 XML_3.99-0.16.1
## [10] digest_0.6.35 mime_0.12 lifecycle_1.0.4
## [13] KEGGREST_1.44.0 RSQLite_2.3.6 magrittr_2.0.3
## [16] compiler_4.4.0 progress_1.2.3 rlang_1.1.3
## [19] sass_0.4.9 tools_4.4.0 utf8_1.2.4
## [22] yaml_2.3.8 knitr_1.46 prettyunits_1.2.0
## [25] S4Arrays_1.4.0 bit_4.0.5 curl_5.2.1
## [28] DelayedArray_0.30.0 xml2_1.3.6 abind_1.4-5
## [31] BiocParallel_1.38.0 withr_3.0.0 purrr_1.0.2
## [34] grid_4.4.0 fansi_1.0.6 biomaRt_2.60.0
## [37] cli_3.6.2 rmarkdown_2.26 crayon_1.5.2
## [40] generics_0.1.3 httr_1.4.7 rjson_0.2.21
## [43] DBI_1.2.2 cachem_1.0.8 stringr_1.5.1
## [46] zlibbioc_1.50.0 parallel_4.4.0 BiocManager_1.30.22
## [49] restfulr_0.0.15 vctrs_0.6.5 Matrix_1.7-0
## [52] jsonlite_1.8.8 bookdown_0.39 hms_1.1.3
## [55] bit64_4.0.5 jquerylib_0.1.4 glue_1.7.0
## [58] codetools_0.2-20 stringi_1.8.3 BiocVersion_3.19.1
## [61] UCSC.utils_1.0.0 tibble_3.2.1 pillar_1.9.0
## [64] rappdirs_0.3.3 htmltools_0.5.8.1 GenomeInfoDbData_1.2.12
## [67] httr2_1.0.1 R6_2.5.1 evaluate_0.23
## [70] lattice_0.22-6 png_0.1-8 memoise_2.0.1
## [73] bslib_0.7.0 SparseArray_1.4.0 xfun_0.43
## [76] pkgconfig_2.0.3