Sleipnir: Data2Svm

Data2Svm learns a support vector machine model classifying individual genes in or out (positive or negative) of a given gene set. Constructs features for each example (gene) based on data in an input PCL. Similar to SVMer.

Usage

Basic Usage

 Data2Svm -i <data.pcl> -m <learned.svm> -g <context.txt> -G <holdout.txt>

Learn a support vector machine model (saved as learned.svm) using the microarray expression values in data.pcl as features, labeling the genes in context.txt as positive examples (and all other genes as negatives), and holding out the genes in holdout.txt from training. Outputs (to standard output) the predicted SVM classifications for all genes after learning.

Detailed Usage

package "Data2Svm"
version "1.0"
purpose "SVM evaluation of data for GO term prediction"

section "Main"
option  "input"             i   "Data set to analyze (PCL)"
                                string  typestr="filename"  yes
option  "model"             m   "SVM model file"
                                string  typestr="filename"

section "Learning/Evaluation"
option  "genes"             g   "List of positive genes"
                                string  typestr="filename"
option  "genex"             G   "List of test genes"
                                string  typestr="filename"
option  "heldout"           l   "Evaluate only test genes"
                                flag    off
option  "random_features"   z   "Randomize input features"
                                flag    off
option  "random_output"     Z   "Randomize output values"
                                flag    off

section "SVM"
option  "cache"             e   "SVM cache size"
                                int default="40"
option  "kernel"            k   "SVM kernel function"
                                values="linear","poly","rbf"    default="linear"
option  "tradeoff"          C   "Classification tradeoff"
                                float
option  "gamma"             M   "RBF gamma"
                                float   default="1"
option  "degree"            d   "Polynomial degree"
                                int default="3"
option  "alphas"            a   "SVM alphas file"
                                string  typestr="filename"
option  "iterations"        t   "SVM iterations"
                                int default="100000"

section "Optional"
option  "normalize"         n   "Z-score normalize feature values"
                                flag    off
option  "skip"              s   "Columns to skip in input PCL"
                                int default="2"
option  "random"            r   "Seed random generator"
                                int default="0"
option  "verbosity"         v   "Message verbosity"
                                int default="5"

Flag	Default	Type	Description
-i	stdin	PCL text file	Input PCL file from which features will be drawn to construct SVM examples.
-m	stdout	SVM model file	Output learned SVM model.
-g	None	Gene text file	Set of genes to be labeled as positive examples.
-G	None	Gene text file	If given, set of genes to be held out of training and evaluated as test examples.
-l	off	Flag	If on, evaluate and output SVM predictions only for test genes; if off, evaluate all genes.
-z	off	Flag	If on, randomize input feature values within each row (gene).
-Z	off	Flag	If on, randomize output SVM prediction labels across all genes.
-e	40	Integer (MB)	SVM cache size in megabytes.
-k	linear	linear, poly, or rbf	SVM kernel type: linear, polynomial, or radial basis function.
-C	None	Float	SVM tradeoff between misclassification and margin; an appropriate default is calculated if no value is given.
-M	1	Float	Gamma parameter for RBF kernel.
-d	3	Integer	Degree parameter for polynomial kernel.
-a	None	Alphas file	If given, SVM Light alphas file used to initialize the SVM model.
-t	100000	Integer	Maximum number of iterations to run per SVM learning epoch.
-n	off	Flag	If on, normalize input edges to z-scores (subtract mean, divide by standard deviation) before processing.
-s	2	Integer	Number of columns to skip between the initial ID column and the first experimental (data) column in the input PCL.