Introduction

SeqWho has two major functions: 1) model training 2) file classification. In both cases SeqWho function by k-merizing the reads in FASTQ(A) files and counting the 1-5mers to build a frequency array. SeqWho can then utilize either a full HISAT2 repeat index during training or the internal repeat index in the SeqWho.ix file during file classification. SeqWho uses these core functions to train a Random Forest classifier as the core mode of classification. Theoretically, there is no limit to the number of classifications possible in SeqWho. It depends mostly on getting enough data to properly train the models.


SeqWho Set-up

Requirements

SeqWho is written in Python 3 and will not function in Python 2. While it is certainly possible to manually set-up Python 3 to use SeqWho, we recommend using a conda environment built from the environment.yml included with SeqWho for optimal performance. This is the method we will describe here.

  • Python >3
  • Conda >4.7

Install - (Mac/Linux/Windows)

SeqWho provides two installation method.

  • PIP
  • Downloading SeqWho from GitHub

PIP

We recommend installing SeqWho in a separate conda environment.

conda create -n seqwho_v1 python=3.7 pip
conda activate seqwho_v1
pip install seqwho

Downloading SeqWho and setting up conda environment

This download example will place SeqWho in your home (~) directory if you are using a linux system. Change the ~ to whichever directory you desire if this is not the behavior you want.

git clone https://github.com/DaehwanKimLab/seqwho ~/seqwho

cd seqwho

conda env create -f base_conda.yml

Adding SeqWho to PATH

Add the above directory (SeqWho) to your PATH environment variable (e.g. ~/.bashrc) to make the binaries built above and other python scripts available everywhere:

$ export PATH=~/seqwho/seqwho_lib:$PATH

Running SeqWho

We have separated this section into model building and file classification

Please activate the conda environment before running SeqWho

$ conda activate seqwho_v1

Model Building

Usage:

$ seqwho_buildindex.py -r [REPEAT FILES] -l [LABEL FILE] [OPTIONS]

Note: this building step requires all training files and supporting files be in the current working directory where seqwho_buildindex is being run

Required Arguments

  • -r / --repeats | Default : None

    A comma-separated list of repeat indicies to use. Typically from HISAT2 Example: -r mouse.rep,human.rep

  • -l / --labels | Default : none

    CSV file with labels and file names for training. Format: file_name,species,sequence_type
    Example: -l labels.csv

Optional Arguments

  • -m / --mask | Default : (empty)

    A comma-separated list of any file types to omit in the training Example: -m whole_genome_sequencing,rnaseq

  • -k / --ksize | Default : 5

    Set the max size of the k-mers to count Example: -k 5 represents 1-5 mers

  • -j / --repeat-kmer | Default : 31

    Size of repeat k-mers to use. Default is 31-mer
    Example: -j 31

  • -o / --out | Default : SeqWho.ix

    Name of output index file Example: -o SeqWho

  • --rebuild | Default : FALSE

    Overwrite any existing index if one is detected with the same name as determined in -o

  • -v / --verbose | Default : FALSE

    Show output messages

Generate Repeat Files for seqwho_buildindex.py

To generate the repeat files for seqwho_buildindex.py, you need to install HISAT2 and passing the genome file to hisat2-repeat.
Example:

$ hisat2-repeat mouse_genome.fa mouse

The file mouse.rep.100-65535.seed is the repeat file for seqwho_buildindex.py -r option.

File Calling

Usage:

$ seqwho.py -x [SEQWHO INDEX] -f [FILE(S) ... ] [OPTIONS]

Required Arguments

  • -x / --index | Default : None

    Path to SeqWho index file and name of file Example: -x SeqWho.ix

  • -f / --files | Default : none

    Space-separated list of files, individual file, or pattern matching syntax to classify. The files can be gzipped or uncompressed. For paired-end reads, please treat them as single-end reads. Example: -f file1.fq.gz file2.fq.gz or -f ~/files/*.fq

Optional Arguments

  • -o / --out | Default : SeqWho_call

    Directory name for output calls of SeqWho Example: -o SeqWho_call

Results and output files

SeqWho generates three kinds of files:

1) PNG containing four plots, one with single nucleotide quality core counts, the second with averag quality score by position, the third with cound of reads with a givent length, and the final a heat map of nucleotide frequency by position in the read.

2) A Json file with all of the data neede to generate the plots in 1, and the following data:

  • Estimated Read Number: Number of read estimated to be in the file
  • File format: File format (fasta or fastq)
  • Biased 5’ end 6-mers”: Any 6-mers at the 5’ end that are more frequent than expected by chance
  • Mean Quality: Average quality of bases
  • Mean Read Len: Average length of reads
  • Perc PolyA: Percent reads that are poly-A
  • Perc GC Cont: Percent GC content
  • Perc Reads w N: Percent of reads with an N
  • Reads Passed: Number of reads passing into model
  • Reads Omitted: Number of reads filtered because of low quality or are poly-A
  • Perc Passed: Percent of reads that passes to model testing
  • Call: Call of the file

3) TSV with all of the information in 2 above

A PNG is generated for each file tested and dropped into a separate results folder while only one JSON and TSV file are generated with all of the results imbedded.