Introduction

Using the human reference genome, referred to as the linear reference (e.g. GRCh38 and hg38), for genomic analysis would be rather straightforward if our variants were uniformly distributed with only one nucleotide difference every 1,000 nucleotides, which most of the currently used alignment programs could handle with great accuracy. However, the distribution of variants is not uniform. Some genomic regions such as HLA genes and DNA fingerprinting loci are highly polymorphic. So using the reference genome for analyzing such highly polymorphic regions may not be the most effective approach, and this is where our graph reference comes into play.


HISAT-genotype Set-up

We use HISAT2 for graph representation and alignment, which is currently the most practical and quickest program available. We refer to hisat-genotype as our top directory where all of our programs are located. hisat-genotype is a place holder that you can change to whatever name you’d like to use.

Requirements

HISAT-genotype is encoded in python2 and contains standard python libraries. A python 2.7 release is recommended. Additional software required is Samtools version 1.3 or later.

Building HISAT2 from source requires a GNU-like environment with GCC, GNU Make and other basics. It should be possible to build HISAT2 on most vanilla Linux installations or on a Mac installation with Xcode installed. HISAT2 can also be built on Windows using Cygwin or MinGW (MinGW recommended). For a MinGW build the choice of what compiler is to be used is important since this will determine if a 32 or 64 bit code can be successfully compiled using it. If there is a need to generate both 32 and 64 bit on the same machine then a multilib MinGW has to be properly installed. MSYS, the zlib library, and depending on architecture pthreads library are also required. We are recommending a 64 bit build since it has some clear advantages in real life research problems. In order to simplify the MinGW setup it might be worth investigating popular MinGW personal builds since these are coming already prepared with most of the toolchains needed.

Downloading HISAT-genotype and Building HISAT2 from source

This download example will place HISAT-genotype in your home (~) directory if you are using a linux system. Change the ~ to whichever directory you desire if this is not the behavior you want.

$ git clone https://github.com/DaehwanKimLab/hisat-genotype ~/hisat-genotype
$ cd ~/hisat-genotype
hisat-genotype$ git checkout hisatgenotype_v1.1.2_beta
$ make hisat2-align-s hisat2-build-s hisat2-inspect-s

Adding HISAT-genotype to PATH

Add the above directory (hisat-genotype) to your PATH environment variable (e.g. ~/.bashrc) to make the binaries built above and other python scripts available everywhere:

$ export PATH=~/hisat-genotype:~/hisat-genotype/hisatgenotype_scripts:$PATH
$ export PYTHONPATH=~/hisat-genotype/hisatgenotype_modules:$PYTHONPATH

Running HISAT-genotype

Past Manuals

Manual Pre v1.1

hisatgenotype - Analysis of a whole human genome

The hisatgenotype.py python script will analyze a whole human genome using whole genome sequencing reads. It will align reads to genotype genome, extract reads belonging to each locus of interest, and perform typing and assembly.

Usage:

$ hisatgenotype -x [GENOME] [OPTIONS] -1 [FASTQ_PAIR1] -2 [FASTQ_PAIR2]

Standard Options

  • -x / --ref-genome | Default : None

    Base name for genome index if not genotype_genome. Generally reserved for custom graph genomes end user may want to use
    Example: -x custome_genome

  • --base / --base-fname | Default : (empty) all databases

    Base file name for index, variants, haplotypes, etc. (e.g. hla, rbg, codis). This will be anything in the hisatgenotype_db folder
    Example: --base hla

  • --locus-list | Default : (empty) all genes

    A comma-separated list of gene names Example: --locus-list A,B,C,DRB1,DQA1,DQB1

  • -f / --fasta | Default : False

    Bool to indicate if reads are provided in FASTA format
    Example: -f

  • -U | Default : None

    Single-end read file name
    Example: -U read.fq.gz

  • -1 | Default : None

    Paired-end read file name 1
    Example: -1 read.1.fq.gz

  • -2 | Default : None

    Paired-end read file name 2
    Example: -2 read.2.fq.gz

  • --keep-alignemnt | Default : False

    Bool to keep the alignment BAM file if typing from FASTQ(A). If typing from a BAM file this is irrelevent.
    Example: --keep-alignment

  • --in-dir | Default : None

    Directory HISATgenotpye will search for FASTQ(A) files and batch process. HISAT-genotype will attempt to automatically pair the files if --single-end isn’t set. Try to have the names be similar between the pairs with a single difference to make it easier for HISAT-genotype to pair the files (e.g. hg_granulocyte_samp1_L.fastq and hg_granulocyte_samp1_R.fastq)
    Example: --in-dir input_fastq_dir

  • --out-dir | Default : /hisatgenotype_out

    Directory where all resulting files will be placed. This includes the typing results, BAM files, and assembly files.
    Example: --out-dir results

  • --bamfile | Default : None

    BAM file name if using already aligned reads
    Example: --bamfile hg_granulocyte_samp1.bam

  • --single-end | Default : False

    Bool to indicate if file(s) is/are single ended. Only needed with --in-dir or --bamfile
    Example: --single-end

  • --assembly | Default : disabled

    Perform assembly of each locus of interest
    Example: --assembly

  • --assembly-name | Default : assemply_graph

    Assembly base file name to use
    Example: --assembly-name hg_granulocyte

  • --assembly-verbose | Default : False

    Bool to output additional assembly information
    Example: --assembly-verbose

  • -p / --threads | Default : 1

    Number of threads to be used
    Example: -p 4

  • --verbose | Default : False

    Provide more information Example: --verbose

  • -h / --help | Default : False

    Output help message
    Example: --help

  • --advanced-help | Default : False

    Output help message for advanced options
    Example: --advanced-help

Advanced Options

  • --keep-extract | Default : False

    Bool to keep extracted read fastq files
    Example: --keep-extract

  • --build-base | Default : False

    Build the indexes listed in --base-fname

  • --aligner | Default : hisat2

    Set aligner to use (ex. hisat2, bowtie2)

  • --linear-index | Default : False

    Use linear index

  • --num-mismatch | Default : 0

    Maximum number of mismatches per read alignment to be considered

  • --inter-gap

    Maximum distance for variants to be in the same haplotype

  • --intra-gap

    Break a haplotype into several haplotypes

  • --whole-haplotype

    Include partial alleles (e.g. A_nuc.fasta)

  • --min-var-freq | Default : 0.0

    Exclude variants whose freq is below than this value in percentage

  • --ext-seq | Default : 0

    Length of extra sequences flanking backbone sequences

  • --leftshift | Default : False

    Shift deletions to the leftmost

  • --suffix | Default : fq.gz

    Read file suffix

  • --simulation | Default : False

    Simulated reads (Default: False)

  • --pp, --threads-aprocess | Default : 1

    Number of threads a process

  • --max-sample | Default : sys.maxint

    Number of samples to be extracted

  • --job-range | Default : 0,1

    two numbers (e.g. 1,3)

  • --extract-whole | Default : False

    Extract all reads

  • --no-partial | Default : False

    Include partial alleles (e.g. A_nuc.fasta)

  • --simulate-interval | Default : 10

    Reads simulated at every these base pairs

  • --read-len | Default : 100

    Length of simulated reads

  • --fragment-len | Default : `350

    Length of fragments

  • --best-alleles | Default : False

    Placeholder

  • --random-seed | Default : 1

    A seeding number for randomness

  • --num-editdist | Default : 2

    Maximum number of mismatches per read alignment to be considered

  • --perbase-errorrate | Default : 0.0

    Per basepair error rate in percentage when simulating

  • --perbase-snprate | Default : 0.0

    Per basepair SNP rate in percentage when simulating

  • --skip-fragment-regions | Default : None

    A comma-separated list of regions from which no reads originate, e.g., 500-600,1200-1400

  • --verbose-level | Default : 0

    also print some statistics to stderr

  • --no-error-correction | Default : False

    Correct sequencing errors

  • --only-locus-list | Default : (empty) all genes

    A comma-separated list of genes

  • --discordant | Default : False

    Allow discordantly mapped pairs or singletons

  • --type-primary-exons | Default : False

    Look at primary exons first

  • --keep-low-abundance-alleles | Default : False

    Do not remove alleles with low abundance while performing typing

  • --display-alleles | Default : None

    A comma-separated list of alleles to display in HTML

  • --debug | Default : None

    Test database or code
    (options: basic, pair, full, single-end, test_list, test_id)
    Example: --debug test_id:10,basic

hisatgenotype_toolkit - Individual Scripts and Tools for Custom Pipelines

Work In Progress to document

Usage:

$ hisatgenotype_toolkit <BASE_TOOL> [TOOL_OPTIONS]

build-genome

(hisatgenotype_build_genome.py)

call-variants

(hisatgenotype_call_variants.py)

convert-codis

(hisatgenotype_convert_codis.py)

extract-RBG

(hisatgenotype_extract_RBG.py)

extract-codis-data

(hisatgenotype_extract_codis_data.py)

extract-cyp-data

(hisatgenotype_extract_cyp_data.py)

extract-reads

(hisatgenotype_extract_reads.py)

extract-vars

(hisatgenotype_extract_vars.py)

legacy

(hisatgenotype_legacy.py)

locus

(hisatgenotype_locus.py)

locus-samples

(hisatgenotype_locus_samples.py)