Sleipnir
Seqs2Ngrams

Seqs2Ngrams reads an input FASTA file and uses it to compute pairwise sequence similarity scores between genes. This is done simplistically by computing the similarity of each gene pair's vectors of k-mer counts, but could easily be modified to do something more complex.

Usage

Basic Usage

 Seqs2Ngrams -i <sequences.fasta> -o <similarities.dab> -n <ngram>

Breaks the sequences in sequences.fasta into k-mers of size ngram, computes k-mer frequency counts for each gene, and outputs gene pair similarities based on comparisons of these frequency vectors in similarities.dab.

Detailed Usage

package "Seqs2Ngrams"
version "1.0"
purpose "Convert sequence data to n-gram counts"

section "Main"
option  "input"     i   "Sequence input FASTA file"
                        string  typestr="filename"
option  "output"    o   "DAT/DAB output file"
                        string  typestr="filename"
option  "n"         n   "N-gram size"
                        int default="7"

section "Miscellaneous"
option  "normalize" m   "Normalize to the range [0,1]"
                        flag    off
option  "zscore"    z   "Convert values to z-scores"
                        flag    off
option  "genes"     g   "Gene inclusion file"
                        string  typestr="filename"

section "Optional"
option  "verbosity" v   "Message verbosity"
                        int default="5"
Flag Default Type Description
-i stdin FASTA text file Input FASTA file containing gene IDs and sequences. Identifiers after each > record header should consist solely of the unique gene ID without any additiona information, e.g. > YAL001C.
-o stdout DAT/DAB file Output DAT/DAB file containing pairwise sequence similarity scores calculated from the given sequence.
-n 7 Integer Size of sequence n-grams to use when calculating pairwise similarity.
-n off Flag If on, normalize output edges to the range [0,1].
-z off Flag If on, normalize output edges to z-scores (subtract mean, divide by standard deviation).
-g None Text gene list If given, use only genes in the list.