- Introduction
- HISAT-genotype Set-up
- Running HISAT-genotype
Introduction
Using the human reference genome, referred to as the linear reference (e.g. GRCh38 and hg38), for genomic analysis would be rather straightforward if our variants were uniformly distributed with only one nucleotide difference every 1,000 nucleotides, which most of the currently used alignment programs could handle with great accuracy. However, the distribution of variants is not uniform. Some genomic regions such as HLA genes and DNA fingerprinting loci are highly polymorphic. So using the reference genome for analyzing such highly polymorphic regions may not be the most effective approach, and this is where our graph reference comes into play.
HISAT-genotype Set-up
We use HISAT2 for graph representation and alignment, which is currently the most practical and quickest program available. We refer to hisat-genotype as our top directory where all of our programs are located. hisat-genotype is a place holder that you can change to whatever name you’d like to use.
Requirements
List
- Python 3<
- Samtools 1.3<
- C++ Compiler
Description
HISAT-genotype is encoded in PYTHON 3 and contains standard python libraries. A python 3.7 release is recommended. Additional software required is Samtools version 1.3 or later.
Building HISAT2 from source requires a GNU-like environment with GCC, GNU Make and other basics. It should be possible to build HISAT2 on most vanilla Linux installations or on a Mac installation with Xcode installed. HISAT2 can also be built on Windows using Cygwin or MinGW (MinGW recommended). For a MinGW build the choice of what compiler is to be used is important since this will determine if a 32 or 64 bit code can be successfully compiled using it. If there is a need to generate both 32 and 64 bit on the same machine then a multilib MinGW has to be properly installed. MSYS, the zlib library, and depending on architecture pthreads library are also required. We are recommending a 64 bit build since it has some clear advantages in real life research problems. In order to simplify the MinGW setup it might be worth investigating popular MinGW personal builds since these are coming already prepared with most of the toolchains needed.
Automated Install - (Mac/Linux)
This is a simple automated method for installing and getting HISAT-genotype setup on a Linux/Mac system using Bash. Replace the ~ which whichever directory you’d like to store HISAT-genotype in. The -r option in the setup script will pre-download all of the basic requited indicies into the HISAT-genotype source directory. If you want to manually direct where these indicies are downloaded use the -x option followed by the absolute path to the desired location.
git clone https://github.com/DaehwanKimLab/hisat-genotype.git ~/hisatgenotype
cd hisatgenotype
bash setup.sh -r
Check if setup was successful using the following commands:
hisatgenotype --help
hisat2 --help
If there is an error then something did not set-up properly and you’ll need to run a manual install
setup.sh options
- -h | Default : none
Show help screen
- -b | Default : False
Do not try to automatically add HISAT-genotype and HISAT2 to your Path environment
- -r | Default : False
Pre-download the base indicies for HISAT-genotype
- -x | Default : [PATH_TO_HISATGENOTYPE]/indicies
If -r option is set, set desired location for indicies if different than default
NOTE: If you want to predownload all indicies before running HISAT-genotype but after HISAT-genotype install and not automatically during HISAT-genotype run, you can use bash setup.sh -brx PATH_TO_DIR
while in HISAT-genotype install directory.
Manual Install - (Mac/Linux/Windows)
Downloading HISAT-genotype and Building HISAT2 from source
This download example will place HISAT-genotype in your home (~) directory if you are using a linux system. Change the ~ to whichever directory you desire if this is not the behavior you want.
git clone --recurse-submodules https://github.com/DaehwanKimLab/hisat-genotype ~/hisatgenotype
cd ~/hisatgenotype/hisat2
$ make
Adding HISAT-genotype to PATH
Add the above directory (hisat-genotype) to your PATH environment variable (e.g. ~/.bashrc) to make the binaries built above and other python scripts available everywhere:
$ export PATH=~/hisatgenotype:~/hisatgenotype/hisat2:$PATH
$ export PYTHONPATH=~/hisatgenotype/hisatgenotype_modules:$PYTHONPATH
Running HISAT-genotype
Past Manuals
hisatgenotype - Analysis of a whole human genome
The hisatgenotype.py python script will analyze a whole human genome using whole genome sequencing reads. It will align reads to genotype genome, extract reads belonging to each locus of interest, and perform typing and assembly.
Usage:
$ hisatgenotype -x [GENOME] --base [GENE_GROUP] -z [INDEX_DIR] [OPTIONS] -1 [FASTQ_PAIR1] -2 [FASTQ_PAIR2]
Standard Options
- -x / --ref-genome | Default : None
Base name for genome index if not genotype_genome. Generally reserved for custom graph genomes end user may want to use
Example:-x custome_genome
- --base / --base-fname | Default : (empty) all databases
Base file name for index, variants, haplotypes, etc. (e.g. hla, rbg, codis). This will be anything in the hisatgenotype_db folder
Example:--base hla
- --locus-list | Default : (empty) all genes
A comma-separated list of gene names Example:
--locus-list A,B,C,DRB1,DQA1,DQB1
- -z / --index_dir | Default : pre-downloaded directory or link file (hg_ix.link) location
Set location for the indecies HISATgenotype requires Example:
-z ~/hisatgenotype/indicies
- -f / --fasta | Default :
False
Bool to indicate if reads are provided in FASTA format
Example:-f
- -U | Default : None
Single-end read file name
Example:-U read.fq.gz
- -1 | Default : None
Paired-end read file name 1
Example:-1 read.1.fq.gz
- -2 | Default : None
Paired-end read file name 2
Example:-2 read.2.fq.gz
- --keep-alignemnt | Default :
False
Bool to keep the alignment BAM file if typing from FASTQ(A). If typing from a BAM file this is irrelevent.
Example:--keep-alignment
- --in-dir | Default : None
Directory HISATgenotpye will search for FASTQ(A) files and batch process. HISAT-genotype will attempt to automatically pair the files if
--single-end
isn’t set. Try to have the names be similar between the pairs with a single difference to make it easier for HISAT-genotype to pair the files (e.g. hg_granulocyte_samp1_L.fastq and hg_granulocyte_samp1_R.fastq)
Example:--in-dir input_fastq_dir
- --out-dir | Default :
/hisatgenotype_out
Directory where all resulting files will be placed. This includes the typing results, BAM files, and assembly files.
Example:--out-dir results
- --bamfile | Default : None
BAM file name if using already aligned reads
Example:--bamfile hg_granulocyte_samp1.bam
- --single-end | Default :
False
Bool to indicate if file(s) is/are single ended. Only needed with
--in-dir
or--bamfile
Example:--single-end
- --assembly | Default : disabled
Perform assembly of each locus of interest
Example:--assembly
- --assembly-name | Default :
assemply_graph
Assembly base file name to use
Example:--assembly-name hg_granulocyte
- --assembly-verbose | Default :
False
Bool to output additional assembly information
Example:--assembly-verbose
- -p / --threads | Default : 1
Number of threads to be used
Example:-p 4
- --verbose | Default :
False
Provide more information Example:
--verbose
- -h / --help | Default :
False
Output help message
Example:--help
- --advanced-help | Default :
False
Output help message for advanced options
Example:--advanced-help
Advanced Options
- --keep-extract | Default :
False
Bool to keep extracted read fastq files
Example:--keep-extract
- --build-base | Default :
False
Build the indexes listed in --base-fname
- --aligner | Default :
hisat2
Set aligner to use (ex. hisat2, bowtie2)
- --linear-index | Default :
False
Use linear index
- --num-mismatch | Default :
0
Maximum number of mismatches per read alignment to be considered
- --inter-gap
Maximum distance for variants to be in the same haplotype
- --intra-gap
Break a haplotype into several haplotypes
- --whole-haplotype
Include partial alleles (e.g. A_nuc.fasta)
- --min-var-freq | Default :
0.0
Exclude variants whose freq is below than this value in percentage
- --ext-seq | Default :
0
Length of extra sequences flanking backbone sequences
- --leftshift | Default :
False
Shift deletions to the leftmost
- --suffix | Default :
fq.gz
Read file suffix
- --simulation | Default :
False
Simulated reads (Default: False)
- --pp, --threads-aprocess | Default :
1
Number of threads a process
- --max-sample | Default :
sys.maxint
Number of samples to be extracted
- --job-range | Default :
0,1
two numbers (e.g. 1,3)
- --extract-whole | Default :
False
Extract all reads
- --no-partial | Default :
False
Include partial alleles (e.g. A_nuc.fasta)
- --simulate-interval | Default :
10
Reads simulated at every these base pairs
- --read-len | Default :
100
Length of simulated reads
- --fragment-len | Default : `350
Length of fragments
- --best-alleles | Default :
False
Placeholder
- --random-seed | Default :
1
A seeding number for randomness
- --num-editdist | Default :
2
Maximum number of mismatches per read alignment to be considered
- --perbase-errorrate | Default :
0.0
Per basepair error rate in percentage when simulating
- --perbase-snprate | Default :
0.0
Per basepair SNP rate in percentage when simulating
- --skip-fragment-regions | Default : None
A comma-separated list of regions from which no reads originate, e.g., 500-600,1200-1400
- --verbose-level | Default :
0
also print some statistics to stderr
- --no-error-correction | Default :
False
Correct sequencing errors
- --only-locus-list | Default : (empty) all genes
A comma-separated list of genes
- --discordant | Default :
False
Allow discordantly mapped pairs or singletons
- --type-primary-exons | Default :
False
Look at primary exons first
- --keep-low-abundance-alleles | Default :
False
Do not remove alleles with low abundance while performing typing
- --display-alleles | Default : None
A comma-separated list of alleles to display in HTML
- --debug | Default : None
Test database or code
(options: basic, pair, full, single-end, test_list, test_id)
Example:--debug test_id:10,basic
hisatgenotype_toolkit - Individual Scripts and Tools for Custom Pipelines
Work In Progress to document
Usage:
$ hisatgenotype_toolkit <BASE_TOOL> [TOOL_OPTIONS]
build-genome
(hisatgenotype_build_genome.py)
call-variants
(hisatgenotype_call_variants.py)
convert-codis
(hisatgenotype_convert_codis.py)
extract-RBG
(hisatgenotype_extract_RBG.py)
extract-codis-data
(hisatgenotype_extract_codis_data.py)
extract-cyp-data
(hisatgenotype_extract_cyp_data.py)
extract-reads
(hisatgenotype_extract_reads.py)
extract-vars
(hisatgenotype_extract_vars.py)
legacy
(hisatgenotype_legacy.py)
locus
(hisatgenotype_locus.py)
locus-samples
(hisatgenotype_locus_samples.py)
parse-results
(hisatgenotype_parse_results.py)
- --in-dir | Default : Current Directory
Input directory where HISAT-genotype .report files can be found
- -t / --trim | Default : 4/All
Trim the reported alleles to 1 (A*01), 2 (A*01:01), 3 (A*01:01:01), or all (A*01:01:01:01) fields
- --csv | Default : False
Return the formated output in a tab deliminated csv file (tsv) for use in spreadsheets
- --output-file | Default : HG_report_results.csv
Name of csv file