Sleipnir
|
Data2Features converts a collection of DAT/DAB or PCL files into a features file appropriate for use with the Weka machine learning package. Data2Features focuses on machine learning about entire datasets, collapsing a collection of per-gene or per-gene-pair values into a single summary feature. Similar to Data2Bnt.
Data2Features -p <positives.txt> -e <features.txt> -d <data.txt> <data.pcl/dab>*
Output (to standard output) an XRFF feature file appropriate for use with Weka; each example represents a dataset from the microarray PCL files data.pcl
or DAT/DAB files data.dab
, with features as specified in features.txt
, defaults from data.txt
, and values calculated as the average across gene pairs in positives.txt
.
package "Data2Features"
version "1.0"
purpose "Data transformation to feature sets for machine learning"
section "Main"
option "positives" p "Positive gene list"
string typestr="filename"
option "environment" e "List of environment features and default values"
string typestr="filename" yes
option "data" d "Feature values for each data set"
string typestr="filename" yes
section "Miscellaneous"
option "genome" g "SGD features file"
string typestr="filename"
section "PCL Processing"
option "distance" D "Similarity measure"
values="pearson","euclidean","kendalls","kolm-smir","spearman","pearnorm",
"hypergeom","innerprod","bininnerprod","quickpear","mi" default="pearnorm"
option "normalize" N "Normalize distances"
flag off
option "zscore" Z "Convert correlations to z-scores"
flag on
option "skip" S "PCL columns to skip after ID"
int default="2"
section "Optional"
option "memmap" m "Memory map input DABs"
flag off
option "verbosity" v "Message verbosity"
int default="5"
Flag | Default | Type | Description |
---|---|---|---|
None | None | DAT/DAB files | Input DAT/DAB files from which data is drawn for features in the output Weka file. |
-p | stdin | Gene text file | List of genes labeled as positives for machine learning; can be drawn from the same pathway/process/complex/GO term/etc. |
-e | None | Text file | Tab-delimited text file containing three columns: feature name, |-delimited feature values, and an optional default value. Lines starting with # are ignored as comments. |
-d | None | Text file | Tab-delimited text file containing one dataset per line. The first tab-delimited token of each line should be a dataset name, with all subsequent tokens of the form <feature name>|<feature value>. |
-g | None | SGD features text file | SGD_features.tab file; if given, process only genes appearing in this file. |
-D | pearnorm | pearson, euclidean, kendalls, kolm-smir, spearman, pearnorm, hypergeom, innerprod, bininnerprod, quickpear, or mi | Similarity measure to be used for converting PCL inputs into pairwise scores. |
-N | off | Flag | If on, normalize input edges to the range [0,1] before processing. |
-Z | off | Flag | If on, normalize input edges to z-scores (subtract mean, divide by standard deviation) before processing. |
-S | 2 | Integer | Number of columns to skip between the initial ID column and the first experimental (data) column in the input PCL. |
-m | off | Flag | If given, memory map the input files when possible. DAT and PCL inputs cannot be memmapped. |