HISAT-3N (hierarchical indexing for spliced alignment of transcripts - 3 nucleotides)
is designed for nucleotide conversion sequencing technologies and implemented based on HISAT2.
There are two strategies for HISAT-3N to align nucleotide conversion sequencing reads: standard mode and repeat mode.
The standard mode aligns reads with a standard 3N index only, so it is fast and requires only a small amount of memory (~9GB for human genome alignment).
The repeat mode aligns reads with both a standard 3N index and a repeat 3N index, then outputs up to 1,000 alignment results (the number of outputted alignments can be adjusted by
The repeat mode can also align nucleotide conversion reads more accurately,
and it is only 10% slower than the standard mode with slightly more memory requirements (the repeat mode uses about ~10.5GB).
The HISAT-3N alignment process requires a 64-bit computer running either Linux or Mac OS and at least 16GB of RAM.
A few notes:
- Building the standard 3N index requires 16GB of RAM or less.
- Building the repeat 3N index requires 256GB of RAM.
- The alignment process using either the standard or repeat index requires less than 16GB of RAM.
- SAMtools is required to sort SAM files in order to generate a HISAT-3N table.
git clone https://github.com/DaehwanKimLab/hisat2.git cd hisat2 git checkout -b hisat-3n origin/hisat-3n make
Make sure that you select the
Build a 3N index with
hisat-3n-build builds a 3N-index, which internally contains two HISAT2 indexes for a set of DNA sequences. For the standard 3N-index,
each index contains 16 files with suffix
For the repeat 3N-index, there are 16 more files in addition to the standard 3N-index, and these files have the suffix
These files constitute the entirety of the HISAT-3N index.
An example for building a standard HISAT-3N index:
hisat-3n-build genome.fa genome
An example for building a repeat HISAT-3N index, which requires 256GB memory:
hisat-3n-build --repeat-index genome.fa genome
It is optional to make a graph index and add SNP or splice site information to the index, which can increase the alignment accuracy. For more details, please refer to the HISAT2 manual.
# Standard HISAT-3N index with SNPs included hisat-3n-build --exons genome.exon genome.fa genome # Standard HISAT-3N index with splice sites included hisat-3n-build --ss genome.ss genome.fa genome # Repeat HISAT-3N index with SNPs included hisat-3n-build --repeat-index --exons genome.exon genome.fa genome # Repeat HISAT-3N index with splice sites included hisat-3n-build --repeat-index --ss genome.ss genome.fa genome
After building the HISAT-3N index, you are ready to use
hisat-3n for alignment.
HISAT-3N has the same set of parameters as in HISAT2 with some additional arguments. Please refer to the HISAT2 manual for more details.
For the human reference genome, HISAT-3N requires about 9GB for alignment with the standard 3N-index and 10.5GB for the repeat 3N-index.
Specify the nucleotide conversion type (e.g., C to T in bisulfite-sequencing reads). The parameter option is two characters separated by ‘,’. Type the original nucleotide for the first character (nt1) and type the converted nucleotide as the second character (nt2). For example, if performing SLAM-seq where some ‘T’s are converted to ‘C’s, input
--base-change T,C. As another example, if performing bisulfite-seq, where some ‘C’s are converted to ‘T’s, please input
--base-change C,T. If you want to align non-converted reads to the regular HISAT2 index, then omit this command.
Specify the index file basename for HISAT-3N. The basename is the name of the index files up to but not including the suffix
.3n.*.*.ht2/ etc. For example, if you build your index with basename ‘genome’ using a HISAT-3N-build, please input
--repeat-limit <int>You can set up the number of alignments to be checked for each repeat alignment. You may increase the number to direct hisat-3n to output more, if a read has multiple mapping locations. We suggest that you limit the repeat number for paired-end read alignment to no more than 1,000,000. default: 1000.
--unique-onlyOnly output uniquely aligned reads.
Single-end SLAM-seq read (T to C conversion) alignment with standard 3N-index:
hisat-3n --index genome -f -U read.fa -S output.sam --base-change T,C
Paired-end bisulfite-seq read (C to T conversion) alignment with repeat 3N-index:
hisat-3n --index genome -f -1 read_1.fa -2 read_2.fa -S output.sam --base-change C,T
Single-end TAPS reads (C to T conversion) alignment with repeat 3N-index and only output unique aligned results:
hisat-3n --index genome -q -U read.fq -S output.sam --base-change C,T --unique
Extra SAM tags generated by HISAT-3N:
Yf:i:<N>: Number of conversions detected in the read.
YZ:A:<A>: The value
–indicates the read is mapped to REF-3N (
+) or REF-RC-3N (
Generate a 3N-conversion-table with
To generate a 3N-conversion-table, users need to sort the
hisat-3n generated SAM alignment file.
SAMtools is required for this sorting process.
samtools sort to convert the SAM file into a sorted SAM file.
samtools sort output.sam -o output_sorted.sam -O sam
Generate a 3N-conversion-table with
hisat-3n-table [options]* --sam <samFile> --ref <refFile> --table-name <tableFile> --base-change <char1,char2>
Specify a sorted SAM filename
Specify the reference genome file (FASTA format), which was used to generate the HISAT-3N index.
Specify the filename in which to write the 3N-conversion-table (tsv format).
Specify the base-change rule. You should enter the exact same
--base-changearguments that you used in hisat-3n. For example, please input
--base-change C,Tfor bisulfite-seq reads.
This directs the program to only count the unique aligned reads when constructing the 3N-conversion-table.
This directs the program to only count the multiple aligned reads when constructing the 3N-conversion-table.
This directs the program to count the CpG islands in the reference genome. This option is especially designed for bisulfite-seq reads.
This directs the program to launch these many
<int>parallel threads for building the table (default: 1).
This will display the program’s usage information and then quit the program.
Generate a 3N conversion table for bisulfite sequencing data:
hisat-3n-table -p 16 --sam output_sorted.sam --ref genome.fa --table-name output.tsv --base-change C,T
Generate a 3N-conversion-table for TAPS data and only count bases in CpG islands covered by uniquely aligned reads:
hisat-3n-table -p 16 --sam output_sorted.sam --ref genome.fa --table-name output.tsv --base-change C,T --CG-only --unique-only
There are 7 columns in the 3N-conversion-table:
ref: the chromosome name.
pos: 1-based position in
strand: ‘+’ for forward strand. ‘-‘ for reverse strand.
convertedBaseQualities: the qualities of the converted bases in read-level measurement. The length of this string is equal to the number of converted bases.
convertedBaseCount: the number of distinct read positions where converted bases in read-level measurements were found. this number is equal to the length of convertedBaseQualities.
unconvertedBaseQualities: the qualities of the unconverted bases in read-level measurement. The length of this string is equal to the number of unconverted bases in read-level measurement.
unconvertedBaseCount: the number of distinct read positions where unconverted bases in read-level measurements were found. this number is equal to the length of unconvertedBaseQualities.
ref pos strand convertedBaseQualities convertedBaseCount unconvertedBaseQualities unconvertedBaseCount 1 11874 + FFFFFB<BF<F 11 0 1 11877 - FFFFFF< 7 0 1 11878 + FFFBB//F/BB 11 0 1 11879 + 0 FFFBB//FB/ 10 1 11880 - F 1 FFFF/ 5 [SAMtools]: http://samtools.sourceforge.net
Yun Zhang, Chanhee Park, Christopher Bennett, Micah Thornton, and Daehwan Kim
HISAT-3N: a rapid and accurate three-nucleotide sequence aligner. bioRxiv (2020)
Daehwan Kim, Joseph Paggi, Chanhee Park, Christopher Bennett, and Steven Salzberg
Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37, 907–915 (2019)