Sleipnir
Data2Features

Data2Features converts a collection of DAT/DAB or PCL files into a features file appropriate for use with the Weka machine learning package. Data2Features focuses on machine learning about entire datasets, collapsing a collection of per-gene or per-gene-pair values into a single summary feature. Similar to Data2Bnt.

Usage

Basic Usage

 Data2Features -p <positives.txt> -e <features.txt> -d <data.txt> <data.pcl/dab>*

Output (to standard output) an XRFF feature file appropriate for use with Weka; each example represents a dataset from the microarray PCL files data.pcl or DAT/DAB files data.dab, with features as specified in features.txt, defaults from data.txt, and values calculated as the average across gene pairs in positives.txt.

Detailed Usage

package "Data2Features"
version "1.0"
purpose "Data transformation to feature sets for machine learning"

section "Main"
option  "positives"     p   "Positive gene list"
                            string  typestr="filename"
option  "environment"   e   "List of environment features and default values"
                            string  typestr="filename"  yes
option  "data"          d   "Feature values for each data set"
                            string  typestr="filename"  yes

section "Miscellaneous"
option  "genome"        g   "SGD features file"
                            string  typestr="filename"

section "PCL Processing"
option  "distance"      D   "Similarity measure"
                            values="pearson","euclidean","kendalls","kolm-smir","spearman","pearnorm",
                            "hypergeom","innerprod","bininnerprod","quickpear","mi" default="pearnorm"
option  "normalize"     N   "Normalize distances"
                            flag    off
option  "zscore"        Z   "Convert correlations to z-scores"
                            flag    on
option  "skip"          S   "PCL columns to skip after ID"
                            int default="2"

section "Optional"
option  "memmap"        m   "Memory map input DABs"
                            flag    off
option  "verbosity"     v   "Message verbosity"
                            int default="5"
Flag Default Type Description
None None DAT/DAB files Input DAT/DAB files from which data is drawn for features in the output Weka file.
-p stdin Gene text file List of genes labeled as positives for machine learning; can be drawn from the same pathway/process/complex/GO term/etc.
-e None Text file Tab-delimited text file containing three columns: feature name, |-delimited feature values, and an optional default value. Lines starting with # are ignored as comments.
-d None Text file Tab-delimited text file containing one dataset per line. The first tab-delimited token of each line should be a dataset name, with all subsequent tokens of the form <feature name>|<feature value>.
-g None SGD features text file SGD_features.tab file; if given, process only genes appearing in this file.
-D pearnorm pearson, euclidean, kendalls, kolm-smir, spearman, pearnorm, hypergeom, innerprod, bininnerprod, quickpear, or mi Similarity measure to be used for converting PCL inputs into pairwise scores.
-N off Flag If on, normalize input edges to the range [0,1] before processing.
-Z off Flag If on, normalize input edges to z-scores (subtract mean, divide by standard deviation) before processing.
-S 2 Integer Number of columns to skip between the initial ID column and the first experimental (data) column in the input PCL.
-m off Flag If given, memory map the input files when possible. DAT and PCL inputs cannot be memmapped.