Import ontology files

Zuguang Gu ( [email protected] )

2024-02-06

The .obo format

There are several formats for ontology data. The most compact and readable format is the .obo format, which was initially developed by the GO consortium. A lot of ontologies in .obo format can be found from the OBO Foundry or BioPortal. A description of the .obo format can be found from https://owlcollab.github.io/oboformat/doc/GO.format.obo-1_4.html.

In the simona package, the function import_obo() can be used to import an .obo file to an ontology_DAG object. The input is a path on local computer or an URL. In the following example, we use the Plant Ontology as an example.

The link of po.obo can be found from that web package. You can download it or directly provide it as an URL.

library(simona)
dag1 = import_obo("https://raw.githubusercontent.com/Planteome/plant-ontology/master/po.obo")
dag1
## An ontology_DAG object:
##   Source: po, releases/2023-07-13 
##   1656 terms / 1776 relations
##   Root: ~~all~~ 
##   Terms: PO:0000001, PO:0000002, PO:0000003, PO:0000004, ...
##   Max depth: 11 
##   Avg number of parents: 1.07
##   Avg number of children: 1.06
##   Aspect ratio: 39:1 (based on the longest distance from root)
##                 38.2:1 (based on the shortest distance from root)
##   Relations: is_a
## 
## With the following columns in the metadata data frame:
##   id, short_id, name, namespace, definition

There are also several meta columns attached to the object, such as the name and the long definition of terms in the ontology.

head(mcols(dag1))
##                    id   short_id                     name     namespace
## PO:0000001 PO:0000001 PO:0000001      plant embryo proper plant_anatomy
## PO:0000002 PO:0000002 PO:0000002              anther wall plant_anatomy
## PO:0000003 PO:0000003 PO:0000003              whole plant plant_anatomy
## PO:0000004 PO:0000004 PO:0000004 in vitro plant structure plant_anatomy
## PO:0000005 PO:0000005 PO:0000005      cultured plant cell plant_anatomy
## PO:0000006 PO:0000006 PO:0000006         plant protoplast plant_anatomy
##                                                                                                                                                                                                  definition
## PO:0000001 An embryonic plant structure (PO:0025099) that is the body of a developing plant embryo (PO:0009009) attached to the maternal tissue in an plant ovule (PO:0020003) by a suspensor (PO:0020108).
## PO:0000002                                                                                                                      A microsporangium wall (PO:0025307) that is part of an anther (PO:0009066).
## PO:0000003                                                                                                                                        A plant structure (PO:0005679) which is a whole organism.
## PO:0000004                                                                                                                             A plant structure (PO:0009011) that is grown or maintained in vitro.
## PO:0000005                                                                                                                                  A plant cell (PO:0009002) that is grown or maintained in vitro.
## PO:0000006                                                                                                                    A cultured plant cell from which the entire plant cell wall has been removed.

Note rows in mcols(dag1) corresponds to terms in dag_all_terms(dag).

The is_a relation between classes is of course saved in the DAG object (specified in the is_a tag in the .obo file). Additional relation types can also be selected (specified in the relationship tag). By default only the relation type part_of is used. You can check other values associated with the relationship tag and the [Typedef] section in the .obo file to select proper additional relation types. Just make sure that the selected relation types are transitive and not inversed (e.g. you cannot select has_part which is a reversed relation of part_of).

Relations can also have a DAG structure. In import_obo(), if a parent relation type is selected, all its offspring types are automatically selected. For example, in GO, besides relations of is_a and part_of, there are also regulates, positively_regulates and negatively_regulates, where the latter two are child relations of regulates. So if regulates is selected as an additional relation type, the other two are automatically selected.

The DAG of relation types is automatically recognized and saved from the ontology files.

import_obo("file_for_go.obo", relation_type = c("part_of", "regulates"))

Finally, all the spaces specified in relation_type will be converted to underlines. So it is the same if you specify "part of" or "part_of".

Other ontology formats

For ontologies in other formats, simona uses an external tool ROBOT to convert them to .obo format and later internally uses import_obo() to import them. ROBOT is already doing a great and professional job of converting between different ontology formats. The file robot.jar is needed and it can be downloaded from https://github.com/ontodev/robot/releases (Since this is a tool in Java, you should have Java already available on your machine).

The file po.owl can also be found from the Plant Ontology web page.

dag2 = import_ontology("https://raw.githubusercontent.com/Planteome/plant-ontology/master/po.owl", 
    robot_jar = "~/Downloads/robot.jar")
dag2
## An ontology_DAG object:
##   Source: po, releases/2021-08-13
##   1654 terms / 2510 relations
##   Root: _all_
##   Terms: PO:0000001, PO:0000002, PO:0000003, PO:0000004, ...
##   Max depth: 13
##   Aspect ratio: 24.85:1 (based on the longest distance to root)
##                 39.6:1 (based on the shortest distance to root)
##   Relations: is_a, part_of
## 
## With the following columns in the metadata data frame:
##   id, short_id, name, namespace, definition

More conveniently, the path of robot.jar can be set as a global option:

simona_opt$robot_jar = "~/Downloads/robot.jar"
import_ontology("https://raw.githubusercontent.com/Planteome/plant-ontology/master/po.owl")

ROBOT supports the following ontology formats and they are automatically identified according to the file contents.

The .owl format

For some huge ontologies, ROBOT requires a huge amount of memory to convert to the .obo format. If the ontology is in the .owl format (in the RDF/XML seriation format), the function import_owl() can be optionally used. import_owl() directly parses the .owl file and returns an ontology_DAG object. The import_owl() is written from scratch and it is recommended to use only when import_ontology() does not work.

dag3 = import_owl("https://raw.githubusercontent.com/Planteome/plant-ontology/master/po.owl")
dag3
## An ontology_DAG object:
##   Source: Plant Ontology, http://purl.obolibrary.org/obo/po/releases/2023-07-13/po.owl 
##   1656 terms / 1776 relations
##   Root: ~~all~~ 
##   Terms: PO:0000001, PO:0000002, PO:0000003, PO:0000004, ...
##   Max depth: 11 
##   Avg number of parents: 1.07
##   Avg number of children: 1.06
##   Aspect ratio: 39:1 (based on the longest distance from root)
##                 38.2:1 (based on the shortest distance from root)
##   Relations: is_a
## 
## With the following columns in the metadata data frame:
##   id, short_id, name, namespace, definition

The .ttl format

Similarly, some ontologies may only provide large .ttl format files (the Turtle format). simona also provides a function import_ttl() which can recognize .ttl file with owl:Class as objects. The internal parsing script is written in Perl, so you need to make sure Perl is installed on your machine.

# https://bioportal.bioontology.org/ontologies/MSTDE
dag4 = import_ttl("https://jokergoo.github.io/simona/MSTDE.ttl")
dag4

Session info

sessionInfo()
## R version 4.3.2 Patched (2023-11-13 r85521)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] org.Hs.eg.db_3.18.0  AnnotationDbi_1.64.1 IRanges_2.36.0      
## [4] S4Vectors_0.40.2     Biobase_2.62.0       BiocGenerics_0.48.1 
## [7] igraph_2.0.1.1       simona_1.0.10        knitr_1.45          
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.42.0         circlize_0.4.15         shape_1.4.6            
##  [4] rjson_0.2.21            xfun_0.41               bslib_0.6.1            
##  [7] GlobalOptions_0.1.2     bitops_1.0-7            vctrs_0.6.5            
## [10] tools_4.3.2             curl_5.2.0              parallel_4.3.2         
## [13] Polychrome_1.5.1        RSQLite_2.3.5           highr_0.10             
## [16] cluster_2.1.6           blob_1.2.4              pkgconfig_2.0.3        
## [19] RColorBrewer_1.1-3      scatterplot3d_0.3-44    GenomeInfoDbData_1.2.11
## [22] lifecycle_1.0.4         compiler_4.3.2          Biostrings_2.70.2      
## [25] codetools_0.2-19        ComplexHeatmap_2.18.0   clue_0.3-65            
## [28] GenomeInfoDb_1.38.5     httpuv_1.6.14           htmltools_0.5.7        
## [31] sass_0.4.8              RCurl_1.98-1.14         yaml_2.3.8             
## [34] later_1.3.2             crayon_1.5.2            jquerylib_0.1.4        
## [37] GO.db_3.18.0            ellipsis_0.3.2          cachem_1.0.8           
## [40] iterators_1.0.14        foreach_1.5.2           mime_0.12              
## [43] digest_0.6.34           fastmap_1.1.1           grid_4.3.2             
## [46] colorspace_2.1-0        cli_3.6.2               magrittr_2.0.3         
## [49] promises_1.2.1          bit64_4.0.5             rmarkdown_2.25         
## [52] XVector_0.42.0          httr_1.4.7              matrixStats_1.2.0      
## [55] bit_4.0.5               png_0.1-8               GetoptLong_1.0.5       
## [58] memoise_2.0.1           shiny_1.8.0             evaluate_0.23          
## [61] doParallel_1.0.17       rlang_1.1.3             Rcpp_1.0.12            
## [64] xtable_1.8-4            glue_1.7.0              DBI_1.2.1              
## [67] xml2_1.3.6              jsonlite_1.8.8          R6_2.5.1               
## [70] zlibbioc_1.48.0