The Rhisat2 package provides an interface to the HISAT2 short-read aligner software (Kim, Langmead, and Salzberg 2015).

1 Installation

Use the BiocManager package to download and install the package from Bioconductor as follows:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Rhisat2")

If required, the latest development version of the package can also be installed from GitHub.

BiocManager::install("fmicompbio/Rhisat2")

Once the package is installed, load it into your R session:

library(Rhisat2)

2 Building a genome index

To build an index for the alignment, use the hisat2_build function. You need to provide one or more fasta file with reference sequences, as well as an output directory where the index will be stored, and a “prefix” (that will determine the name of the index files in the output directory). Any additional arguments to the hisat2-build binary can also be supplied to the function (see hisat2_build_usage() or https://ccb.jhu.edu/software/hisat2/manual.shtml#the-hisat2-build-indexer for an overview of the available options).

The package contains three example reference sequences, corresponding to short pieces of three different chromosomes. We show how to create an index (named myindex) based on these sequences.

list.files(system.file("extdata/refs", package="Rhisat2"), pattern="\\.fa$")
#> [1] "chr1.fa" "chr2.fa" "chr3.fa"
refs <- list.files(system.file("extdata/refs", package="Rhisat2"), 
                   full.names=TRUE, pattern="\\.fa$")
td <- tempdir()
hisat2_build(references=refs, outdir=td, prefix="myindex", 
             force=TRUE, strict=TRUE, execute=TRUE)

By instead setting execute=FALSE in the command above, hisat2_build() will return the index building shell command for inspection, without executing it.

print(hisat2_build(references=refs, outdir=td, prefix="myindex",
                   force=TRUE, strict=TRUE, execute=FALSE))
#> [1] "'/tmp/Rtmpv5m6vC/Rinst6fe4218cf25a/Rhisat2/hisat2-build' '/tmp/Rtmpv5m6vC/Rinst6fe4218cf25a/Rhisat2/extdata/refs/chr1.fa','/tmp/Rtmpv5m6vC/Rinst6fe4218cf25a/Rhisat2/extdata/refs/chr2.fa','/tmp/Rtmpv5m6vC/Rinst6fe4218cf25a/Rhisat2/extdata/refs/chr3.fa' '/tmp/RtmpOWnjsc/myindex'"

3 Aligning reads to the genome index

After creating the index, reads can be aligned using the hisat2 wrapper function. Most commonly, the reads will be provided in fastq files (one file for single-end reads, two files for paired-end reads). The names of these files can be provided directly to the hisat2 function, either as a vector (for single-end reads) or as a list of length 2 (for paired-end reads, each list element corresponds to one mate). You also need to provide the path to the index generated by hisat2_build (the output directory combined with the prefix), and the output file name where the alignments should be written.

list.files(system.file("extdata/reads", package="Rhisat2"), 
           pattern="\\.fastq$")
#> [1] "reads1.fastq" "reads2.fastq"
reads <- list.files(system.file("extdata/reads", package="Rhisat2"),
                    pattern="\\.fastq$", full.names=TRUE)
hisat2(sequences=as.list(reads), index=file.path(td, "myindex"), 
       type="paired", outfile=file.path(td, "out.sam"), 
       force=TRUE, strict=TRUE, execute=TRUE)

As for hisat2_build(), any additional arguments to the hisat2 binary can be provided to the hisat2() function (see hisat2_usage() or https://ccb.jhu.edu/software/hisat2/manual.shtml#running-hisat2 for an overview of the available options).

With HISAT2, you can provide a file with known splice sites at the alignment step, which can help in finding the correct alignments across known splice junctions. A text file in the required format can be generated using the extract_splice_sites() function, starting from an annotation file in gtf or gff3 format, or from a GRanges or TxDb object.

spsfile <- tempfile()
gtf <- system.file("extdata/refs/genes.gtf", package="Rhisat2")
extract_splice_sites(features=gtf, outfile=spsfile)
#> Import genomic features from the file as a GRanges object ... OK
#> Prepare the 'metadata' data frame ... OK
#> Make the TxDb object ... OK
hisat2(sequences=as.list(reads), index=file.path(td, "myindex"),
       type="paired", outfile=file.path(td, "out_sps.sam"),
       `known-splicesite-infile`=spsfile, 
       force=TRUE, strict=TRUE, execute=TRUE)

4 Miscellaneous helper functions

To get the version of HISAT2:

hisat2_version()
#> [1] "/tmp/Rtmpv5m6vC/Rinst6fe4218cf25a/Rhisat2/hisat2 version 2.1.0"         
#> [2] "64-bit"                                                                 
#> [3] "Built on malbec2"                                                       
#> [4] "Mon Apr 27 23:53:45 EDT 2020"                                           
#> [5] "Compiler: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) "             
#> [6] "Options: -O3 -m64 -msse2 -funroll-loops -g3 -DPOPCNT_CAPABILITY"        
#> [7] "Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}"

To see the usage of hisat2_build():

hisat2_build_usage()
#>  [1] "HISAT2 version 2.1.0 by Daehwan Kim ([email protected], http://www.ccb.jhu.edu/people/infphilo)"
#>  [2] "Usage: hisat2-build [options]* <reference_in> <ht2_index_base>"                                  
#>  [3] "    reference_in            comma-separated list of files with ref sequences"                    
#>  [4] "    hisat2_index_base       write ht2 data to files with this dir/basename"                      
#>  [5] "Options:"                                                                                        
#>  [6] "    -c                      reference sequences given on cmd line (as"                           
#>  [7] "                            <reference_in>)"                                                     
#>  [8] "    -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting"                    
#>  [9] "    -p                      number of threads"                                                   
#> [10] "    --bmax <int>            max bucket sz for blockwise suffix-array builder"                    
#> [11] "    --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)"                    
#> [12] "    --dcv <int>             diff-cover period for blockwise (default: 1024)"                     
#> [13] "    --nodc                  disable diff-cover (algorithm becomes quadratic)"                    
#> [14] "    -r/--noref              don't build .3/.4.ht2 (packed reference) portion"                    
#> [15] "    -3/--justref            just build .3/.4.ht2 (packed reference) portion"                     
#> [16] "    -o/--offrate <int>      SA is sampled every 2^offRate BWT chars (default: 5)"                
#> [17] "    -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)"                 
#> [18] "    --localoffrate <int>    SA (local) is sampled every 2^offRate BWT chars (default: 3)"        
#> [19] "    --localftabchars <int>  # of chars consumed in initial lookup in a local index (default: 6)" 
#> [20] "    --snp <path>            SNP file name"                                                       
#> [21] "    --haplotype <path>      haplotype file name"                                                 
#> [22] "    --ss <path>             Splice site file name"                                               
#> [23] "    --exon <path>           Exon file name"                                                      
#> [24] "    --seed <int>            seed for random number generator"                                    
#> [25] "    -q/--quiet              verbose output (for debugging)"                                      
#> [26] "    -h/--help               print detailed description of tool and its options"                  
#> [27] "    --usage                 print this usage message"                                            
#> [28] "    --version               print version information and quit"

And similarly to see the usage of hisat2():

hisat2_usage()
#>   [1] "HISAT2 version 2.1.0 by Daehwan Kim ([email protected], www.ccb.jhu.edu/people/infphilo)"                  
#>   [2] "Usage: "                                                                                                    
#>   [3] "  hisat2-align [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]"                               
#>   [4] ""                                                                                                           
#>   [5] "  <ht2-idx>  Index filename prefix (minus trailing .X.ht2)."                                                
#>   [6] "  <m1>       Files with #1 mates, paired with files in <m2>."                                               
#>   [7] "  <m2>       Files with #2 mates, paired with files in <m1>."                                               
#>   [8] "  <r>        Files with unpaired reads."                                                                    
#>   [9] "  <sam>      File for SAM output (default: stdout)"                                                         
#>  [10] ""                                                                                                           
#>  [11] "  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be"                                  
#>  [12] "  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'."                                          
#>  [13] ""                                                                                                           
#>  [14] "Options (defaults in parentheses):"                                                                         
#>  [15] ""                                                                                                           
#>  [16] " Input:"                                                                                                    
#>  [17] "  -q                 query input files are FASTQ .fq/.fastq (default)"                                      
#>  [18] "  --qseq             query input files are in Illumina's qseq format"                                       
#>  [19] "  -f                 query input files are (multi-)FASTA .fa/.mfa"                                          
#>  [20] "  -r                 query input files are raw one-sequence-per-line"                                       
#>  [21] "  -c                 <m1>, <m2>, <r> are sequences themselves, not files"                                   
#>  [22] "  -s/--skip <int>    skip the first <int> reads/pairs in the input (none)"                                  
#>  [23] "  -u/--upto <int>    stop after first <int> reads/pairs (no limit)"                                         
#>  [24] "  -5/--trim5 <int>   trim <int> bases from 5'/left end of reads (0)"                                        
#>  [25] "  -3/--trim3 <int>   trim <int> bases from 3'/right end of reads (0)"                                       
#>  [26] "  --phred33          qualities are Phred+33 (default)"                                                      
#>  [27] "  --phred64          qualities are Phred+64"                                                                
#>  [28] "  --int-quals        qualities encoded as space-delimited integers"                                         
#>  [29] ""                                                                                                           
#>  [30] " Alignment:"                                                                                                
#>  [31] "  --n-ceil <func>    func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)"                               
#>  [32] "  --ignore-quals     treat all quality values as 30 on Phred scale (off)"                                   
#>  [33] "  --nofw             do not align forward (original) version of read (off)"                                 
#>  [34] "  --norc             do not align reverse-complement version of read (off)"                                 
#>  [35] ""                                                                                                           
#>  [36] " Spliced Alignment:"                                                                                        
#>  [37] "  --pen-cansplice <int>              penalty for a canonical splice site (0)"                               
#>  [38] "  --pen-noncansplice <int>           penalty for a non-canonical splice site (12)"                          
#>  [39] "  --pen-canintronlen <func>          penalty for long introns (G,-8,1) with canonical splice sites"         
#>  [40] "  --pen-noncanintronlen <func>       penalty for long introns (G,-8,1) with noncanonical splice sites"      
#>  [41] "  --min-intronlen <int>              minimum intron length (20)"                                            
#>  [42] "  --max-intronlen <int>              maximum intron length (500000)"                                        
#>  [43] "  --known-splicesite-infile <path>   provide a list of known splice sites"                                  
#>  [44] "  --novel-splicesite-outfile <path>  report a list of splice sites"                                         
#>  [45] "  --novel-splicesite-infile <path>   provide a list of novel splice sites"                                  
#>  [46] "  --no-temp-splicesite               disable the use of splice sites found"                                 
#>  [47] "  --no-spliced-alignment             disable spliced alignment"                                             
#>  [48] "  --rna-strandness <string>          specify strand-specific information (unstranded)"                      
#>  [49] "  --tmo                              reports only those alignments within known transcriptome"              
#>  [50] "  --dta                              reports alignments tailored for transcript assemblers"                 
#>  [51] "  --dta-cufflinks                    reports alignments tailored specifically for cufflinks"                
#>  [52] "  --avoid-pseudogene                 tries to avoid aligning reads to pseudogenes (experimental option)\023"
#>  [53] "  --no-templatelen-adjustment        disables template length adjustment for RNA-seq reads"                 
#>  [54] ""                                                                                                           
#>  [55] " Scoring:"                                                                                                  
#>  [56] "  --mp <int>,<int>   max and min penalties for mismatch; lower qual = lower penalty <6,2>"                  
#>  [57] "  --sp <int>,<int>   max and min penalties for soft-clipping; lower qual = lower penalty <2,1>"             
#>  [58] "  --no-softclip      no soft-clipping"                                                                      
#>  [59] "  --np <int>         penalty for non-A/C/G/Ts in read/ref (1)"                                              
#>  [60] "  --rdg <int>,<int>  read gap open, extend penalties (5,3)"                                                 
#>  [61] "  --rfg <int>,<int>  reference gap open, extend penalties (5,3)"                                            
#>  [62] "  --score-min <func> min acceptable alignment score w/r/t read length"                                      
#>  [63] "                     (L,0.0,-0.2)"                                                                          
#>  [64] ""                                                                                                           
#>  [65] " Reporting:"                                                                                                
#>  [66] "  -k <int> (default: 5) report up to <int> alns per read"                                                   
#>  [67] ""                                                                                                           
#>  [68] " Paired-end:"                                                                                               
#>  [69] "  -I/--minins <int>  minimum fragment length (0), only valid with --no-spliced-alignment"                   
#>  [70] "  -X/--maxins <int>  maximum fragment length (500), only valid with --no-spliced-alignment"                 
#>  [71] "  --fr/--rf/--ff     -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)"                                       
#>  [72] "  --no-mixed         suppress unpaired alignments for paired reads"                                         
#>  [73] "  --no-discordant    suppress discordant alignments for paired reads"                                       
#>  [74] ""                                                                                                           
#>  [75] " Output:"                                                                                                   
#>  [76] "  -t/--time          print wall-clock time taken by search phases"                                          
#>  [77] "  --summary-file     print alignment summary to this file."                                                 
#>  [78] "  --new-summary      print alignment summary in a new style, which is more machine-friendly."               
#>  [79] "  --quiet            print nothing to stderr except serious errors"                                         
#>  [80] "  --met-file <path>  send metrics to file at <path> (off)"                                                  
#>  [81] "  --met-stderr       send metrics to stderr (off)"                                                          
#>  [82] "  --met <int>        report internal counters & metrics every <int> secs (1)"                               
#>  [83] "  --no-head          supppress header lines, i.e. lines starting with @"                                    
#>  [84] "  --no-sq            supppress @SQ header lines"                                                            
#>  [85] "  --rg-id <text>     set read group id, reflected in @RG line and RG:Z: opt field"                          
#>  [86] "  --rg <text>        add <text> (\"lab:value\") to @RG line of SAM header."                                 
#>  [87] "                     Note: @RG line only printed when --rg-id is set."                                      
#>  [88] "  --omit-sec-seq     put '*' in SEQ and QUAL fields for secondary alignments."                              
#>  [89] ""                                                                                                           
#>  [90] " Performance:"                                                                                              
#>  [91] "  -o/--offrate <int> override offrate of index; must be >= index's offrate"                                 
#>  [92] "  -p/--threads <int> number of alignment threads to launch (1)"                                             
#>  [93] "  --reorder          force SAM output order to match order of input reads"                                  
#>  [94] "  --mm               use memory-mapped I/O for index; many 'hisat2's can share"                             
#>  [95] ""                                                                                                           
#>  [96] " Other:"                                                                                                    
#>  [97] "  --qc-filter        filter out reads that are bad according to QSEQ filter"                                
#>  [98] "  --seed <int>       seed for random number generator (0)"                                                  
#>  [99] "  --non-deterministic seed rand. gen. arbitrarily instead of using read attributes"                         
#> [100] "  --remove-chrname   remove 'chr' from reference names in alignment"                                        
#> [101] "  --add-chrname      add 'chr' to reference names in alignment "                                            
#> [102] "  --version          print version information and quit"                                                    
#> [103] "  -h/--help          print this usage message"

5 Session info

sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
#>  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
#> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] Rhisat2_1.4.0    BiocStyle_2.16.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4.6                lattice_0.20-41             prettyunits_1.1.1          
#>  [4] Rsamtools_2.4.0             Biostrings_2.56.0           assertthat_0.2.1           
#>  [7] digest_0.6.25               SGSeq_1.22.0                BiocFileCache_1.12.0       
#> [10] R6_2.4.1                    GenomeInfoDb_1.24.0         stats4_4.0.0               
#> [13] RSQLite_2.2.0               evaluate_0.14               httr_1.4.1                 
#> [16] pillar_1.4.3                zlibbioc_1.34.0             rlang_0.4.5                
#> [19] GenomicFeatures_1.40.0      progress_1.2.2              curl_4.3                   
#> [22] blob_1.2.1                  S4Vectors_0.26.0            RUnit_0.4.32               
#> [25] Matrix_1.2-18               rmarkdown_2.1               BiocParallel_1.22.0        
#> [28] stringr_1.4.0               igraph_1.2.5                RCurl_1.98-1.2             
#> [31] bit_1.1-15.2                biomaRt_2.44.0              DelayedArray_0.14.0        
#> [34] compiler_4.0.0              rtracklayer_1.48.0          xfun_0.13                  
#> [37] pkgconfig_2.0.3             askpass_1.1                 BiocGenerics_0.34.0        
#> [40] htmltools_0.4.0             openssl_1.4.1               tidyselect_1.0.0           
#> [43] SummarizedExperiment_1.18.0 tibble_3.0.1                GenomeInfoDbData_1.2.3     
#> [46] bookdown_0.18               matrixStats_0.56.0          IRanges_2.22.0             
#> [49] XML_3.99-0.3                crayon_1.3.4                dplyr_0.8.5                
#> [52] dbplyr_1.4.3                GenomicAlignments_1.24.0    bitops_1.0-6               
#> [55] rappdirs_0.3.1              grid_4.0.0                  lifecycle_0.2.0            
#> [58] DBI_1.1.0                   magrittr_1.5                stringi_1.4.6              
#> [61] XVector_0.28.0              ellipsis_0.3.0              vctrs_0.2.4                
#> [64] tools_4.0.0                 bit64_0.9-7                 Biobase_2.48.0             
#> [67] glue_1.4.0                  purrr_0.3.4                 hms_0.5.3                  
#> [70] parallel_4.0.0              yaml_2.2.1                  AnnotationDbi_1.50.0       
#> [73] BiocManager_1.30.10         GenomicRanges_1.40.0        memoise_1.1.0              
#> [76] knitr_1.28

References

Kim, Daehwan, Ben Langmead, and Steven L Salzberg. 2015. “HISAT: A Fast Spliced Aligner with Low Memory Requirements.” Nat. Methods 12:357.

Aligning reads with Rhisat2

2020-04-27

Package

Contents