Introduction

This is the old manual for versions earlier than 1.1. Please use this information for those versions only.

Scripts

hisatgenotype.py – Analysis of a whole human genome

The hisatgenotype.py python script will analyze a whole human genome using whole genome sequencing reads. It will align reads to genotype genome, extract reads belonging to each locus of interest, and perform typing and assembly.

Argument	Description	Default	Example
--base	Base file name for index, variants, haplotypes, etc.	genotype_genome	`--base genotype_genome`
--region-list	Comma separated list of regions	(empty) meaning every region available	`--region-list hla.A,hla.B,hla.C,codis.FGA,cyp.2D6`
-p / --threads	Number of threads to be used	1	`-p 4`
-U	Single-end read file name	None	`-U read.fq.gz`
-1	Paired-end read file name 1	None	`-1 read.1.fq.gz`
-2	Paired-end read file name 2	None	`-2 read.2.fq.gz`
-f / --fasta	Reads are provided in FASTA format	None	`-f`
--alignment-file	Sorted bam alignment file converted from HISAT2’s sam alignment output	None	`--alignment-file NA12878.bam`
--num-editdist	Number of maximum edit distance or mismatches allowed in the alignment, typing, and assembly of a read (not a pair)	2	`--num-editdist 0`
--assembly	Perform assembly of each locus of interest	disabled	`--assembly`
--verbose	Provide more information	disabled	`--verbose`
-h / --help	Output help message	disabled	`--help`

Examples:

$ hisatgenotype.py --base genotype_genome -p 4 -1 read.1.fq.gz -2 read.2.fq.gz
$ hisatgenotype.py --verbose --base genotype_genome -p 4 --region-list hla.A -1 read.1.fq.gz -2 read.2.fq.gz
$ hisatgenotype.py --base genotype_genome --alignment-file NA12878.bam  
$ hisatgenotype.py --base genotype_genome --region-list hla.A,hla.B,hla.C,hla.DQA1,hla.DQB1,hla.DRB1,codis,cyp.2D6 --alignment-file NA12878.bam

hisatgenotype_locus.py – Analysis of a specific gene, genomic region, or gene family

The hisatgenotype_locus.py python script will analyze a particular locus or a set of genes or genomic regions using extracted reads. As the name of the script (locus) implies, it will align reads to a part of genotype genome and perform typing and assembly.

Argument	Description	Default	Example
--base	Base file name for index, variants, haplotypes, etc.	hla	`--base codis`
--locus-list	Comma separated list of loci	(empty) meaning every locus available	`--locus-list A,B,C,DQA1,DQB1,DRB1`
-p / --threads	Number of threads to be used	1	`-p 4`
-U	Single-end read file name	None	`-U read.fq.gz`
-1	Paired-end read file name 1	None	`-1 read.1.fq.gz`
-2	Paired-end read file name 2	None	`-2 read.2.fq.gz`
-f / --fasta	Reads are provided in FASTA format	false	`-f`
--num-editdist	Number of maximum edit distance or mismatches allowed in the alignment, typing, and assembly of a read (not a pair)	2	`--num-editdist 0`
--assembly	Perform assembly of each locus of interest	disabled	`--assembly`
--genotype-genome	Base name for genotype genome, which the program will use instead of region-based small indexes	None	`--genotype-genome genotype_genome`
--verbose	Provide more information	disabled	`--verbose`
-h / --help	Output help message	disabled	`--help`

Examples:

 $ hisatgenotype_locus.py --base hla --locus-list A --num-editdist 2 --assembly -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz
 $ hisatgenotype_locus.py --base hla --locus-list A,B,C,DQA1,DQB1,DRB1 --num-editdist 2 -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz

For advanced users

In addition to analyzing real sequencing reads, the script comes with additional testing capabilities, which might come in handy when determining whether a problematic result is due to noise in sequencing reads or algorithms. To do so, the script can generate simulation reads with sequencing errors and SNPs, and test and evaluate the accuracy of its typing and assembly algorithms, with additional options described below.

Argument	Description	Default	Example
--read-len	Length of a simulated read	100	`--read-len 150`
--fragment-len	Length of a fragment	350	`--fragment-len 500`
--perbase-errorrate	Sequencing error rate per base in percentage	0.0 (%)	`--perbase-errorrate 0.2`
--perbase-snprate	SNP rate per base in percentage	0.0 (%)	`--perbase-snprate 0.1`
--skip-fragment-regions	Comma separated list of regions from which no reads originate	None	`--skip-fragment-regions 500-600,1200-1400`
--random-seed	Seeding number for random number generation	1	`--random-seed 10`
--debug	Comma separated list of debugging parameters	None	`--debug "pair,test_list:A03:01:01:05,A24:02:01:01"`

Examples:

 $ hisatgenotype_locus.py --base hla --locus-list A 
 $ hisatgenotype_locus.py --base hla --locus-list A --debug "pair"
 $ hisatgenotype_locus.py --base hla --locus-list B --assembly --debug "pair,test_list:A*03:01:01:05,A*24:02:01:01"
 $ hisatgenotype_locus.py --base hla --locus-list A --assembly --simulate-interval 10 --perbase-errorrate 0.3 --perbase-snprate 0.1 --num-editdist 2 --debug "pair,full"
 $ hisatgenotype_locus.py --base hla --locus-list A --assembly --simulate-interval 10 --perbase-errorrate 0.3 --perbase-snprate 0.1 --num-editdist 2 --debug "pair,full" --skip-fragment-regions 1000-1020 --random-seed 0

hisatgenotype_build_genome.py – Building a genotype genome

The hisatgenotype_build_genome.py python script will create a genotype genome using the human reference genome (GRCh38) and several genomic databases. A tutorial for the script can be found here.

Argument	Description	Default	Example
--base	Base file name for index, variants, haplotypes, etc.	genotype_genome	`--base genotype_genome`
--database-list	Comma separated list of databases	hla	`--database-list hla,codis,cyp`
--commonvar	Include common variants from dbSNP (about 13 million)	false	`--commonvar`
--clinvar	Include clinically relevant variants from NCBI ClinVar database	false	`--clinvar`
-p / --threads	Number of threads to be used	1	`-p 4`
--verbose	Provide more information	disabled	`--verbose`
-h / --help	Output help message	disabled	`--help`

Examples:

 $ hisatgenotype_build_genome.py -p 4 --base genotype_genome --database-list hla --commonvar
 $ hisatgenotype_build_genome.py -p 4 --base genotype_genome --database-list hla,codis,cyp --commonvar

hisatgenotype_extract_vars.py – Extraction of variants, haplotypes, etc.

This hisatgenotype_extract_vars.py python script is typically run by other scripts, though instructions are provided here for users who wish to directly run the script. The script will extract allele sequences, variants, haplotypes, etc. from multiple sequence alignments available in hisatgenotype_db. Tutorials for the script can be found at HLA typing and DNA fingerprinting anlaysis.

Argument	Description	Default	Example
--base	Base file name for index, variants, haplotypes, etc.	hla	`--base codis`
--locus-list	Comma separated list of loci	(empty) meaning every locus available	`--locus-list A,B,C,DQA1,DQB1,DRB1`
--min-var-freq	Exclude variants whose freq is below than this value in percentage	0.0 (%)	`--min-var-freq 0.1`
--leftshift	Shift deletions to the leftmost	disabled	`--leftmost`
--inter-gap	Maximum distance for variants to be in the same haplotype	30	`--inter-gap 30`
--intra-gap	Break a haplotype into several haplotypes	50	`--intra-gap 50`
--verbose	Provide more information	disabled	`--verbose`
-h / --help	Output help message	disabled	`--help`

Examples:

 $ hisatgenotype_extract_vars.py --base hla --locus-list A,B,C,DQA1,DQB1,DRB1 --min-var-freq 0.1
 $ hisatgenotype_extract_vars.py --base codis

hisatgenotype_extract_reads.py – Extraction of reads that belong to genomic regions of interest

The hisatgenotype_extract_reads.py python script will extract reads that belong to loci of interest from whole sequencing reads. The script can also be used to extract reads from many samples. A tutorial for the script can be found here.

Argument	Description	Default	Example
--base	Base file name for index, variants, haplotypes, etc.	genotype_genome	`--base genotype_genome`
--database-list	Comma separated list of databases	(empty) meaning every database available	`--database-list hla,codis,cyp`
-U	Single-end read file name	None	`-U read.fq.gz`
-1	Paired-end read file name 1	None	`-1 read.1.fq.gz`
-2	Paired-end read file name 2	None	`-2 read.2.fq.gz`
--read-dir	Name of directory where read files are placed	None	`--read-dir test_input`
--out-dir	Name of directory where extracted read files will be placed	None	`--out-dir test_output`
--suffix	Read file suffix name	fq.gz	`--suffix fastq.gz`
-f / --fasta	Reads are provided in FASTA format	false	`-f`
--num-editdist	Number of maximum edit distance or mismatches allowed in the alignment, typing, and assembly of a read (not a pair)	2	`--num-editdist 0`
-p / --threads	Number of threads to be used	1	`-p 4`
--max-sample	Number of samples to be extracted	sys.maxint	`--max-sample 917`
--job-range	Comma separated two numbers representing a range	0,1	`--job-range 1,4`
--verbose	Provide more information	disabled	`--verbose`
-h / --help	Output help message	disabled	`--help`

Examples:

 $ hisatgenotype_extract_reads.py --base genotype_genome --database-list hla,codis,cyp -1 test.1.fq.gz -2 test.2.fq.gz
 $ hisatgenotype_extract_reads.py --base genotype_genome --database-list hla,codis,cyp --read-dir test_input --out-dir test_output

hisatgenotype_locus_samples.py – Analysis of many samples

The hisatgenotype_locus_samples.py python script will analyze genomic regions of interest using reads that are extracted from using hisatgenotype_extract_reads.py. Please avoid directly running the script on whole genomic sequencing reads as doing so will lead to inaccurate results due to reasons partly described here. A tutorial for the script can be found at HLA typing. The script is designed to analyze many samples.

Argument	Description	Default	Example
--base	Base file name for index, variants, haplotypes, etc.	genotype_genome	`--base genotype_genome`
--region-list	Comma separated list of regions	(empty) meaning every region available	`--region-list hla.A,hla.B,hla.C,codis.FGA,cyp.2D6`
--read-dir	Name of directory where read files are placed	None	`--read-dir test_input`
--out-dir	Output directory name	None	`--out-dir test_output`
-f / --fasta	Reads are provided in FASTA format	false	`-f`
--num-editdist	Number of maximum edit distance or mismatches allowed in the alignment, typing, and assembly of a read (not a pair)	2	`--num-editdist 0`
-p / --threads	Number of threads to be used	1	`-p 4`
--assembly	Perform assembly of each locus of interest	disabled	`--assembly`
--verbose	Provide more information	disabled	`--verbose`
-h / --help	Output help message	disabled	`--help`

Examples:

 $ hisatgenotype_locus_samples.py -p 4 --region-list hla --read-dir test_input --out-dir test_output
 $ hisatgenotype_locus_samples.py -p 4 --region-list hla.A,hla.B --read-dir test_input --out-dir test_output
 $ hisatgenotype_locus_samples.py -p 4 --region-list codis --read-dir test_input --out-dir test_output

HISAT-genotype
next-generation genomic analysis software platform

Manual Pre v1.1

Introduction

Scripts

hisatgenotype.py – Analysis of a whole human genome

hisatgenotype_locus.py – Analysis of a specific gene, genomic region, or gene family

For advanced users

hisatgenotype_build_genome.py – Building a genotype genome

hisatgenotype_extract_vars.py – Extraction of variants, haplotypes, etc.

hisatgenotype_extract_reads.py – Extraction of reads that belong to genomic regions of interest

hisatgenotype_locus_samples.py – Analysis of many samples