Introduction

Using the human reference genome, referred to as the linear reference (e.g. GRCh38 and hg38), for genomic analysis would be rather straightforward if our variants were uniformly distributed with only one nucleotide difference every 1,000 nucleotides, which most of the currently used alignment programs could handle with great accuracy. However, the distribution of variants is not uniform. Some genomic regions such as HLA genes and DNA fingerprinting loci are highly polymorphic. So using the reference genome for analyzing such highly polymorphic regions may not be the most effective approach, and this is where our graph reference comes into play.

HISAT-genotype Set-up

We use HISAT2 for graph representation and alignment, which is currently the most practical and quickest program available. We refer to hisat-genotype as our top directory where all of our programs are located. hisat-genotype is a place holder that you can change to whatever name you’d like to use.

Requirements

List

Python 3<
Samtools 1.3<
C++ Compiler

Description

HISAT-genotype is encoded in PYTHON 3 and contains standard python libraries. A python 3.7 release is recommended. Additional software required is Samtools version 1.3 or later.

Building HISAT2 from source requires a GNU-like environment with GCC, GNU Make and other basics. It should be possible to build HISAT2 on most vanilla Linux installations or on a Mac installation with Xcode installed. HISAT2 can also be built on Windows using Cygwin or MinGW (MinGW recommended). For a MinGW build the choice of what compiler is to be used is important since this will determine if a 32 or 64 bit code can be successfully compiled using it. If there is a need to generate both 32 and 64 bit on the same machine then a multilib MinGW has to be properly installed. MSYS, the zlib library, and depending on architecture pthreads library are also required. We are recommending a 64 bit build since it has some clear advantages in real life research problems. In order to simplify the MinGW setup it might be worth investigating popular MinGW personal builds since these are coming already prepared with most of the toolchains needed.

Automated Install - (Mac/Linux)

This is a simple automated method for installing and getting HISAT-genotype setup on a Linux/Mac system using Bash. Replace the ~ which whichever directory you’d like to store HISAT-genotype in. The -r option in the setup script will pre-download all of the basic requited indicies into the HISAT-genotype source directory. If you want to manually direct where these indicies are downloaded use the -x option followed by the absolute path to the desired location.

git clone https://github.com/DaehwanKimLab/hisat-genotype.git ~/hisatgenotype
cd hisatgenotype
bash setup.sh -r

Check if setup was successful using the following commands:

hisatgenotype --help
hisat2 --help

If there is an error then something did not set-up properly and you’ll need to run a manual install

setup.sh options

-h | Default : none

Show help screen
-b | Default : False

Do not try to automatically add HISAT-genotype and HISAT2 to your Path environment
-r | Default : False

Pre-download the base indicies for HISAT-genotype
-x | Default : [PATH_TO_HISATGENOTYPE]/indicies

If -r option is set, set desired location for indicies if different than default

NOTE: If you want to predownload all indicies before running HISAT-genotype but after HISAT-genotype install and not automatically during HISAT-genotype run, you can use bash setup.sh -brx PATH_TO_DIR while in HISAT-genotype install directory.

Manual Install - (Mac/Linux/Windows)

Downloading HISAT-genotype and Building HISAT2 from source

This download example will place HISAT-genotype in your home (~) directory if you are using a linux system. Change the ~ to whichever directory you desire if this is not the behavior you want.

git clone --recurse-submodules https://github.com/DaehwanKimLab/hisat-genotype ~/hisatgenotype
cd ~/hisatgenotype/hisat2

$ make

Adding HISAT-genotype to PATH

Add the above directory (hisat-genotype) to your PATH environment variable (e.g. ~/.bashrc) to make the binaries built above and other python scripts available everywhere:

$ export PATH=~/hisatgenotype:~/hisatgenotype/hisat2:$PATH
$ export PYTHONPATH=~/hisatgenotype/hisatgenotype_modules:$PYTHONPATH

Running HISAT-genotype

Past Manuals

Manual Pre v1.1

hisatgenotype - Analysis of a whole human genome

The hisatgenotype.py python script will analyze a whole human genome using whole genome sequencing reads. It will align reads to genotype genome, extract reads belonging to each locus of interest, and perform typing and assembly.

Usage:

$ hisatgenotype -x [GENOME] --base [GENE_GROUP] -z [INDEX_DIR] [OPTIONS] -1 [FASTQ_PAIR1] -2 [FASTQ_PAIR2]

Standard Options

-x / --ref-genome | Default : None

Base name for genome index if not genotype_genome. Generally reserved for custom graph genomes end user may want to use
Example: -x custome_genome
--base / --base-fname | Default : (empty) all databases

Base file name for index, variants, haplotypes, etc. (e.g. hla, rbg, codis). This will be anything in the hisatgenotype_db folder
Example: --base hla
--locus-list | Default : (empty) all genes

A comma-separated list of gene names Example: --locus-list A,B,C,DRB1,DQA1,DQB1
-z / --index_dir | Default : pre-downloaded directory or link file (hg_ix.link) location

Set location for the indecies HISATgenotype requires Example: -z ~/hisatgenotype/indicies
-f / --fasta | Default : False

Bool to indicate if reads are provided in FASTA format
Example: -f
-U | Default : None

Single-end read file name
Example: -U read.fq.gz
-1 | Default : None

Paired-end read file name 1
Example: -1 read.1.fq.gz
-2 | Default : None

Paired-end read file name 2
Example: -2 read.2.fq.gz
--keep-alignemnt | Default : False

Bool to keep the alignment BAM file if typing from FASTQ(A). If typing from a BAM file this is irrelevent.
Example: --keep-alignment
--in-dir | Default : None

Directory HISATgenotpye will search for FASTQ(A) files and batch process. HISAT-genotype will attempt to automatically pair the files if --single-end isn’t set. Try to have the names be similar between the pairs with a single difference to make it easier for HISAT-genotype to pair the files (e.g. hg_granulocyte_samp1_L.fastq and hg_granulocyte_samp1_R.fastq)
Example: --in-dir input_fastq_dir
--out-dir | Default : /hisatgenotype_out

Directory where all resulting files will be placed. This includes the typing results, BAM files, and assembly files.
Example: --out-dir results
--bamfile | Default : None

BAM file name if using already aligned reads
Example: --bamfile hg_granulocyte_samp1.bam
--single-end | Default : False

Bool to indicate if file(s) is/are single ended. Only needed with --in-dir or --bamfile
Example: --single-end
--assembly | Default : disabled

Perform assembly of each locus of interest
Example: --assembly
--assembly-name | Default : assemply_graph

Assembly base file name to use
Example: --assembly-name hg_granulocyte
--assembly-verbose | Default : False

Bool to output additional assembly information
Example: --assembly-verbose
-p / --threads | Default : 1

Number of threads to be used
Example: -p 4
--verbose | Default : False

Provide more information Example: --verbose
-h / --help | Default : False

Output help message
Example: --help
--advanced-help | Default : False

Output help message for advanced options
Example: --advanced-help

Advanced Options

--keep-extract | Default : False

Bool to keep extracted read fastq files
Example: --keep-extract
--build-base | Default : False

Build the indexes listed in --base-fname
--aligner | Default : hisat2

Set aligner to use (ex. hisat2, bowtie2)
--linear-index | Default : False

Use linear index
--num-mismatch | Default : 0

Maximum number of mismatches per read alignment to be considered
--inter-gap

Maximum distance for variants to be in the same haplotype
--intra-gap

Break a haplotype into several haplotypes
--whole-haplotype

Include partial alleles (e.g. A_nuc.fasta)
--min-var-freq | Default : 0.0

Exclude variants whose freq is below than this value in percentage
--ext-seq | Default : 0

Length of extra sequences flanking backbone sequences
--leftshift | Default : False

Shift deletions to the leftmost
--suffix | Default : fq.gz

Read file suffix
--simulation | Default : False

Simulated reads (Default: False)
--pp, --threads-aprocess | Default : 1

Number of threads a process
--max-sample | Default : sys.maxint

Number of samples to be extracted
--job-range | Default : 0,1

two numbers (e.g. 1,3)
--extract-whole | Default : False

Extract all reads
--no-partial | Default : False

Include partial alleles (e.g. A_nuc.fasta)
--simulate-interval | Default : 10

Reads simulated at every these base pairs
--read-len | Default : 100

Length of simulated reads
--fragment-len | Default : `350

Length of fragments
--best-alleles | Default : False

Placeholder
--random-seed | Default : 1

A seeding number for randomness
--num-editdist | Default : 2

Maximum number of mismatches per read alignment to be considered
--perbase-errorrate | Default : 0.0

Per basepair error rate in percentage when simulating
--perbase-snprate | Default : 0.0

Per basepair SNP rate in percentage when simulating
--skip-fragment-regions | Default : None

A comma-separated list of regions from which no reads originate, e.g., 500-600,1200-1400
--verbose-level | Default : 0

also print some statistics to stderr
--no-error-correction | Default : False

Correct sequencing errors
--only-locus-list | Default : (empty) all genes

A comma-separated list of genes
--discordant | Default : False

Allow discordantly mapped pairs or singletons
--type-primary-exons | Default : False

Look at primary exons first
--keep-low-abundance-alleles | Default : False

Do not remove alleles with low abundance while performing typing
--display-alleles | Default : None

A comma-separated list of alleles to display in HTML
--debug | Default : None

Test database or code
(options: basic, pair, full, single-end, test_list, test_id)
Example: --debug test_id:10,basic

hisatgenotype_toolkit - Individual Scripts and Tools for Custom Pipelines

Work In Progress to document

Usage:

$ hisatgenotype_toolkit <BASE_TOOL> [TOOL_OPTIONS]

build-genome

(hisatgenotype_build_genome.py)

call-variants

(hisatgenotype_call_variants.py)

convert-codis

(hisatgenotype_convert_codis.py)

extract-RBG

(hisatgenotype_extract_RBG.py)

extract-codis-data

(hisatgenotype_extract_codis_data.py)

extract-cyp-data

(hisatgenotype_extract_cyp_data.py)

extract-reads

(hisatgenotype_extract_reads.py)

extract-vars

(hisatgenotype_extract_vars.py)

legacy

(hisatgenotype_legacy.py)

locus

(hisatgenotype_locus.py)

locus-samples

(hisatgenotype_locus_samples.py)

parse-results

(hisatgenotype_parse_results.py)

--in-dir | Default : Current Directory

Input directory where HISAT-genotype .report files can be found
-t / --trim | Default : 4/All

Trim the reported alleles to 1 (A*01), 2 (A*01:01), 3 (A*01:01:01), or all (A*01:01:01:01) fields
--csv | Default : False

Return the formated output in a tab deliminated csv file (tsv) for use in spreadsheets
--output-file | Default : HG_report_results.csv

Name of csv file