Sleipnir: Data2Features

Data2Features converts a collection of DAT/DAB or PCL files into a features file appropriate for use with the Weka machine learning package. Data2Features focuses on machine learning about entire datasets, collapsing a collection of per-gene or per-gene-pair values into a single summary feature. Similar to Data2Bnt.

Usage

Basic Usage

 Data2Features -p <positives.txt> -e <features.txt> -d <data.txt> <data.pcl/dab>*

Output (to standard output) an XRFF feature file appropriate for use with Weka; each example represents a dataset from the microarray PCL files data.pcl or DAT/DAB files data.dab, with features as specified in features.txt, defaults from data.txt, and values calculated as the average across gene pairs in positives.txt.

Detailed Usage

package "Data2Features"
version "1.0"
purpose "Data transformation to feature sets for machine learning"

section "Main"
option  "positives"     p   "Positive gene list"
                            string  typestr="filename"
option  "environment"   e   "List of environment features and default values"
                            string  typestr="filename"  yes
option  "data"          d   "Feature values for each data set"
                            string  typestr="filename"  yes

section "Miscellaneous"
option  "genome"        g   "SGD features file"
                            string  typestr="filename"

section "PCL Processing"
option  "distance"      D   "Similarity measure"
                            values="pearson","euclidean","kendalls","kolm-smir","spearman","pearnorm",
                            "hypergeom","innerprod","bininnerprod","quickpear","mi" default="pearnorm"
option  "normalize"     N   "Normalize distances"
                            flag    off
option  "zscore"        Z   "Convert correlations to z-scores"
                            flag    on
option  "skip"          S   "PCL columns to skip after ID"
                            int default="2"

section "Optional"
option  "memmap"        m   "Memory map input DABs"
                            flag    off
option  "verbosity"     v   "Message verbosity"
                            int default="5"

Flag	Default	Type	Description
None	None	DAT/DAB files	Input DAT/DAB files from which data is drawn for features in the output Weka file.
-p	stdin	Gene text file	List of genes labeled as positives for machine learning; can be drawn from the same pathway/process/complex/GO term/etc.
-e	None	Text file	Tab-delimited text file containing three columns: feature name, \|-delimited feature values, and an optional default value. Lines starting with # are ignored as comments.
-d	None	Text file	Tab-delimited text file containing one dataset per line. The first tab-delimited token of each line should be a dataset name, with all subsequent tokens of the form <feature name>\|<feature value>.
-g	None	SGD features text file	SGD_features.tab file; if given, process only genes appearing in this file.
-D	pearnorm	pearson, euclidean, kendalls, kolm-smir, spearman, pearnorm, hypergeom, innerprod, bininnerprod, quickpear, or mi	Similarity measure to be used for converting PCL inputs into pairwise scores.
-N	off	Flag	If on, normalize input edges to the range [0,1] before processing.
-Z	off	Flag	If on, normalize input edges to z-scores (subtract mean, divide by standard deviation) before processing.
-S	2	Integer	Number of columns to skip between the initial ID column and the first experimental (data) column in the input PCL.
-m	off	Flag	If given, memory map the input files when possible. DAT and PCL inputs cannot be memmapped.