Sleipnir
|
Seqs2Ngrams reads an input FASTA file and uses it to compute pairwise sequence similarity scores between genes. This is done simplistically by computing the similarity of each gene pair's vectors of k-mer counts, but could easily be modified to do something more complex.
Seqs2Ngrams -i <sequences.fasta> -o <similarities.dab> -n <ngram>
Breaks the sequences in sequences.fasta
into k-mers of size ngram
, computes k-mer frequency counts for each gene, and outputs gene pair similarities based on comparisons of these frequency vectors in similarities.dab
.
package "Seqs2Ngrams"
version "1.0"
purpose "Convert sequence data to n-gram counts"
section "Main"
option "input" i "Sequence input FASTA file"
string typestr="filename"
option "output" o "DAT/DAB output file"
string typestr="filename"
option "n" n "N-gram size"
int default="7"
section "Miscellaneous"
option "normalize" m "Normalize to the range [0,1]"
flag off
option "zscore" z "Convert values to z-scores"
flag off
option "genes" g "Gene inclusion file"
string typestr="filename"
section "Optional"
option "verbosity" v "Message verbosity"
int default="5"
Flag | Default | Type | Description |
---|---|---|---|
-i | stdin | FASTA text file | Input FASTA file containing gene IDs and sequences. Identifiers after each > record header should consist solely of the unique gene ID without any additiona information, e.g. > YAL001C. |
-o | stdout | DAT/DAB file | Output DAT/DAB file containing pairwise sequence similarity scores calculated from the given sequence. |
-n | 7 | Integer | Size of sequence n-grams to use when calculating pairwise similarity. |
-n | off | Flag | If on, normalize output edges to the range [0,1]. |
-z | off | Flag | If on, normalize output edges to z-scores (subtract mean, divide by standard deviation). |
-g | None | Text gene list | If given, use only genes in the list. |