Rhisat2 1.4.0
The Rhisat2 package provides an interface to the HISAT2 short-read aligner software (Kim, Langmead, and Salzberg 2015).
Use the BiocManager
package to download and install the package from
Bioconductor as follows:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Rhisat2")
If required, the latest development version of the package can also be installed from GitHub.
BiocManager::install("fmicompbio/Rhisat2")
Once the package is installed, load it into your R session:
library(Rhisat2)
To build an index for the alignment, use the hisat2_build
function. You need
to provide one or more fasta file with reference sequences, as well as an output
directory where the index will be stored, and a “prefix” (that will determine
the name of the index files in the output directory). Any additional arguments
to the hisat2-build
binary can also be supplied to the function (see
hisat2_build_usage()
or
https://ccb.jhu.edu/software/hisat2/manual.shtml#the-hisat2-build-indexer
for an overview of the available options).
The package contains three example reference sequences, corresponding to short
pieces of three different chromosomes. We show how to create an index (named
myindex
) based on these sequences.
list.files(system.file("extdata/refs", package="Rhisat2"), pattern="\\.fa$")
#> [1] "chr1.fa" "chr2.fa" "chr3.fa"
refs <- list.files(system.file("extdata/refs", package="Rhisat2"),
full.names=TRUE, pattern="\\.fa$")
td <- tempdir()
hisat2_build(references=refs, outdir=td, prefix="myindex",
force=TRUE, strict=TRUE, execute=TRUE)
By instead setting execute=FALSE
in the command above, hisat2_build()
will
return the index building shell command for inspection, without executing it.
print(hisat2_build(references=refs, outdir=td, prefix="myindex",
force=TRUE, strict=TRUE, execute=FALSE))
#> [1] "'/tmp/Rtmpv5m6vC/Rinst6fe4218cf25a/Rhisat2/hisat2-build' '/tmp/Rtmpv5m6vC/Rinst6fe4218cf25a/Rhisat2/extdata/refs/chr1.fa','/tmp/Rtmpv5m6vC/Rinst6fe4218cf25a/Rhisat2/extdata/refs/chr2.fa','/tmp/Rtmpv5m6vC/Rinst6fe4218cf25a/Rhisat2/extdata/refs/chr3.fa' '/tmp/RtmpOWnjsc/myindex'"
After creating the index, reads can be aligned using the hisat2
wrapper
function. Most commonly, the reads will be provided in fastq
files (one file
for single-end reads, two files for paired-end reads). The names of these files
can be provided directly to the hisat2
function, either as a vector (for
single-end reads) or as a list of length 2 (for paired-end reads, each list
element corresponds to one mate). You also need to provide the path to the index
generated by hisat2_build
(the output directory combined with the prefix), and
the output file name where the alignments should be written.
list.files(system.file("extdata/reads", package="Rhisat2"),
pattern="\\.fastq$")
#> [1] "reads1.fastq" "reads2.fastq"
reads <- list.files(system.file("extdata/reads", package="Rhisat2"),
pattern="\\.fastq$", full.names=TRUE)
hisat2(sequences=as.list(reads), index=file.path(td, "myindex"),
type="paired", outfile=file.path(td, "out.sam"),
force=TRUE, strict=TRUE, execute=TRUE)
As for hisat2_build()
, any additional arguments to the hisat2
binary can be
provided to the hisat2()
function (see hisat2_usage()
or
https://ccb.jhu.edu/software/hisat2/manual.shtml#running-hisat2
for an overview of the available options).
With HISAT2, you can provide a file with known splice sites at the alignment
step, which can help in finding the correct alignments across known splice
junctions. A text file in the required format can be generated using the
extract_splice_sites()
function, starting from an annotation file in gtf or
gff3 format, or from a GRanges
or TxDb
object.
spsfile <- tempfile()
gtf <- system.file("extdata/refs/genes.gtf", package="Rhisat2")
extract_splice_sites(features=gtf, outfile=spsfile)
#> Import genomic features from the file as a GRanges object ... OK
#> Prepare the 'metadata' data frame ... OK
#> Make the TxDb object ... OK
hisat2(sequences=as.list(reads), index=file.path(td, "myindex"),
type="paired", outfile=file.path(td, "out_sps.sam"),
`known-splicesite-infile`=spsfile,
force=TRUE, strict=TRUE, execute=TRUE)
To get the version of HISAT2
:
hisat2_version()
#> [1] "/tmp/Rtmpv5m6vC/Rinst6fe4218cf25a/Rhisat2/hisat2 version 2.1.0"
#> [2] "64-bit"
#> [3] "Built on malbec2"
#> [4] "Mon Apr 27 23:53:45 EDT 2020"
#> [5] "Compiler: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) "
#> [6] "Options: -O3 -m64 -msse2 -funroll-loops -g3 -DPOPCNT_CAPABILITY"
#> [7] "Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}"
To see the usage of hisat2_build()
:
hisat2_build_usage()
#> [1] "HISAT2 version 2.1.0 by Daehwan Kim ([email protected], http://www.ccb.jhu.edu/people/infphilo)"
#> [2] "Usage: hisat2-build [options]* <reference_in> <ht2_index_base>"
#> [3] " reference_in comma-separated list of files with ref sequences"
#> [4] " hisat2_index_base write ht2 data to files with this dir/basename"
#> [5] "Options:"
#> [6] " -c reference sequences given on cmd line (as"
#> [7] " <reference_in>)"
#> [8] " -a/--noauto disable automatic -p/--bmax/--dcv memory-fitting"
#> [9] " -p number of threads"
#> [10] " --bmax <int> max bucket sz for blockwise suffix-array builder"
#> [11] " --bmaxdivn <int> max bucket sz as divisor of ref len (default: 4)"
#> [12] " --dcv <int> diff-cover period for blockwise (default: 1024)"
#> [13] " --nodc disable diff-cover (algorithm becomes quadratic)"
#> [14] " -r/--noref don't build .3/.4.ht2 (packed reference) portion"
#> [15] " -3/--justref just build .3/.4.ht2 (packed reference) portion"
#> [16] " -o/--offrate <int> SA is sampled every 2^offRate BWT chars (default: 5)"
#> [17] " -t/--ftabchars <int> # of chars consumed in initial lookup (default: 10)"
#> [18] " --localoffrate <int> SA (local) is sampled every 2^offRate BWT chars (default: 3)"
#> [19] " --localftabchars <int> # of chars consumed in initial lookup in a local index (default: 6)"
#> [20] " --snp <path> SNP file name"
#> [21] " --haplotype <path> haplotype file name"
#> [22] " --ss <path> Splice site file name"
#> [23] " --exon <path> Exon file name"
#> [24] " --seed <int> seed for random number generator"
#> [25] " -q/--quiet verbose output (for debugging)"
#> [26] " -h/--help print detailed description of tool and its options"
#> [27] " --usage print this usage message"
#> [28] " --version print version information and quit"
And similarly to see the usage of hisat2()
:
hisat2_usage()
#> [1] "HISAT2 version 2.1.0 by Daehwan Kim ([email protected], www.ccb.jhu.edu/people/infphilo)"
#> [2] "Usage: "
#> [3] " hisat2-align [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]"
#> [4] ""
#> [5] " <ht2-idx> Index filename prefix (minus trailing .X.ht2)."
#> [6] " <m1> Files with #1 mates, paired with files in <m2>."
#> [7] " <m2> Files with #2 mates, paired with files in <m1>."
#> [8] " <r> Files with unpaired reads."
#> [9] " <sam> File for SAM output (default: stdout)"
#> [10] ""
#> [11] " <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be"
#> [12] " specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'."
#> [13] ""
#> [14] "Options (defaults in parentheses):"
#> [15] ""
#> [16] " Input:"
#> [17] " -q query input files are FASTQ .fq/.fastq (default)"
#> [18] " --qseq query input files are in Illumina's qseq format"
#> [19] " -f query input files are (multi-)FASTA .fa/.mfa"
#> [20] " -r query input files are raw one-sequence-per-line"
#> [21] " -c <m1>, <m2>, <r> are sequences themselves, not files"
#> [22] " -s/--skip <int> skip the first <int> reads/pairs in the input (none)"
#> [23] " -u/--upto <int> stop after first <int> reads/pairs (no limit)"
#> [24] " -5/--trim5 <int> trim <int> bases from 5'/left end of reads (0)"
#> [25] " -3/--trim3 <int> trim <int> bases from 3'/right end of reads (0)"
#> [26] " --phred33 qualities are Phred+33 (default)"
#> [27] " --phred64 qualities are Phred+64"
#> [28] " --int-quals qualities encoded as space-delimited integers"
#> [29] ""
#> [30] " Alignment:"
#> [31] " --n-ceil <func> func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)"
#> [32] " --ignore-quals treat all quality values as 30 on Phred scale (off)"
#> [33] " --nofw do not align forward (original) version of read (off)"
#> [34] " --norc do not align reverse-complement version of read (off)"
#> [35] ""
#> [36] " Spliced Alignment:"
#> [37] " --pen-cansplice <int> penalty for a canonical splice site (0)"
#> [38] " --pen-noncansplice <int> penalty for a non-canonical splice site (12)"
#> [39] " --pen-canintronlen <func> penalty for long introns (G,-8,1) with canonical splice sites"
#> [40] " --pen-noncanintronlen <func> penalty for long introns (G,-8,1) with noncanonical splice sites"
#> [41] " --min-intronlen <int> minimum intron length (20)"
#> [42] " --max-intronlen <int> maximum intron length (500000)"
#> [43] " --known-splicesite-infile <path> provide a list of known splice sites"
#> [44] " --novel-splicesite-outfile <path> report a list of splice sites"
#> [45] " --novel-splicesite-infile <path> provide a list of novel splice sites"
#> [46] " --no-temp-splicesite disable the use of splice sites found"
#> [47] " --no-spliced-alignment disable spliced alignment"
#> [48] " --rna-strandness <string> specify strand-specific information (unstranded)"
#> [49] " --tmo reports only those alignments within known transcriptome"
#> [50] " --dta reports alignments tailored for transcript assemblers"
#> [51] " --dta-cufflinks reports alignments tailored specifically for cufflinks"
#> [52] " --avoid-pseudogene tries to avoid aligning reads to pseudogenes (experimental option)\023"
#> [53] " --no-templatelen-adjustment disables template length adjustment for RNA-seq reads"
#> [54] ""
#> [55] " Scoring:"
#> [56] " --mp <int>,<int> max and min penalties for mismatch; lower qual = lower penalty <6,2>"
#> [57] " --sp <int>,<int> max and min penalties for soft-clipping; lower qual = lower penalty <2,1>"
#> [58] " --no-softclip no soft-clipping"
#> [59] " --np <int> penalty for non-A/C/G/Ts in read/ref (1)"
#> [60] " --rdg <int>,<int> read gap open, extend penalties (5,3)"
#> [61] " --rfg <int>,<int> reference gap open, extend penalties (5,3)"
#> [62] " --score-min <func> min acceptable alignment score w/r/t read length"
#> [63] " (L,0.0,-0.2)"
#> [64] ""
#> [65] " Reporting:"
#> [66] " -k <int> (default: 5) report up to <int> alns per read"
#> [67] ""
#> [68] " Paired-end:"
#> [69] " -I/--minins <int> minimum fragment length (0), only valid with --no-spliced-alignment"
#> [70] " -X/--maxins <int> maximum fragment length (500), only valid with --no-spliced-alignment"
#> [71] " --fr/--rf/--ff -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)"
#> [72] " --no-mixed suppress unpaired alignments for paired reads"
#> [73] " --no-discordant suppress discordant alignments for paired reads"
#> [74] ""
#> [75] " Output:"
#> [76] " -t/--time print wall-clock time taken by search phases"
#> [77] " --summary-file print alignment summary to this file."
#> [78] " --new-summary print alignment summary in a new style, which is more machine-friendly."
#> [79] " --quiet print nothing to stderr except serious errors"
#> [80] " --met-file <path> send metrics to file at <path> (off)"
#> [81] " --met-stderr send metrics to stderr (off)"
#> [82] " --met <int> report internal counters & metrics every <int> secs (1)"
#> [83] " --no-head supppress header lines, i.e. lines starting with @"
#> [84] " --no-sq supppress @SQ header lines"
#> [85] " --rg-id <text> set read group id, reflected in @RG line and RG:Z: opt field"
#> [86] " --rg <text> add <text> (\"lab:value\") to @RG line of SAM header."
#> [87] " Note: @RG line only printed when --rg-id is set."
#> [88] " --omit-sec-seq put '*' in SEQ and QUAL fields for secondary alignments."
#> [89] ""
#> [90] " Performance:"
#> [91] " -o/--offrate <int> override offrate of index; must be >= index's offrate"
#> [92] " -p/--threads <int> number of alignment threads to launch (1)"
#> [93] " --reorder force SAM output order to match order of input reads"
#> [94] " --mm use memory-mapped I/O for index; many 'hisat2's can share"
#> [95] ""
#> [96] " Other:"
#> [97] " --qc-filter filter out reads that are bad according to QSEQ filter"
#> [98] " --seed <int> seed for random number generator (0)"
#> [99] " --non-deterministic seed rand. gen. arbitrarily instead of using read attributes"
#> [100] " --remove-chrname remove 'chr' from reference names in alignment"
#> [101] " --add-chrname add 'chr' to reference names in alignment "
#> [102] " --version print version information and quit"
#> [103] " -h/--help print this usage message"
sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
#> [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] Rhisat2_1.4.0 BiocStyle_2.16.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.4.6 lattice_0.20-41 prettyunits_1.1.1
#> [4] Rsamtools_2.4.0 Biostrings_2.56.0 assertthat_0.2.1
#> [7] digest_0.6.25 SGSeq_1.22.0 BiocFileCache_1.12.0
#> [10] R6_2.4.1 GenomeInfoDb_1.24.0 stats4_4.0.0
#> [13] RSQLite_2.2.0 evaluate_0.14 httr_1.4.1
#> [16] pillar_1.4.3 zlibbioc_1.34.0 rlang_0.4.5
#> [19] GenomicFeatures_1.40.0 progress_1.2.2 curl_4.3
#> [22] blob_1.2.1 S4Vectors_0.26.0 RUnit_0.4.32
#> [25] Matrix_1.2-18 rmarkdown_2.1 BiocParallel_1.22.0
#> [28] stringr_1.4.0 igraph_1.2.5 RCurl_1.98-1.2
#> [31] bit_1.1-15.2 biomaRt_2.44.0 DelayedArray_0.14.0
#> [34] compiler_4.0.0 rtracklayer_1.48.0 xfun_0.13
#> [37] pkgconfig_2.0.3 askpass_1.1 BiocGenerics_0.34.0
#> [40] htmltools_0.4.0 openssl_1.4.1 tidyselect_1.0.0
#> [43] SummarizedExperiment_1.18.0 tibble_3.0.1 GenomeInfoDbData_1.2.3
#> [46] bookdown_0.18 matrixStats_0.56.0 IRanges_2.22.0
#> [49] XML_3.99-0.3 crayon_1.3.4 dplyr_0.8.5
#> [52] dbplyr_1.4.3 GenomicAlignments_1.24.0 bitops_1.0-6
#> [55] rappdirs_0.3.1 grid_4.0.0 lifecycle_0.2.0
#> [58] DBI_1.1.0 magrittr_1.5 stringi_1.4.6
#> [61] XVector_0.28.0 ellipsis_0.3.0 vctrs_0.2.4
#> [64] tools_4.0.0 bit64_0.9-7 Biobase_2.48.0
#> [67] glue_1.4.0 purrr_0.3.4 hms_0.5.3
#> [70] parallel_4.0.0 yaml_2.2.1 AnnotationDbi_1.50.0
#> [73] BiocManager_1.30.10 GenomicRanges_1.40.0 memoise_1.1.0
#> [76] knitr_1.28
Kim, Daehwan, Ben Langmead, and Steven L Salzberg. 2015. “HISAT: A Fast Spliced Aligner with Low Memory Requirements.” Nat. Methods 12:357.