|
Sleipnir
|
Data2Features converts a collection of DAT/DAB or PCL files into a features file appropriate for use with the Weka machine learning package. Data2Features focuses on machine learning about entire datasets, collapsing a collection of per-gene or per-gene-pair values into a single summary feature. Similar to Data2Bnt.
Data2Features -p <positives.txt> -e <features.txt> -d <data.txt> <data.pcl/dab>*
Output (to standard output) an XRFF feature file appropriate for use with Weka; each example represents a dataset from the microarray PCL files data.pcl or DAT/DAB files data.dab, with features as specified in features.txt, defaults from data.txt, and values calculated as the average across gene pairs in positives.txt.
package "Data2Features"
version "1.0"
purpose "Data transformation to feature sets for machine learning"
section "Main"
option "positives" p "Positive gene list"
string typestr="filename"
option "environment" e "List of environment features and default values"
string typestr="filename" yes
option "data" d "Feature values for each data set"
string typestr="filename" yes
section "Miscellaneous"
option "genome" g "SGD features file"
string typestr="filename"
section "PCL Processing"
option "distance" D "Similarity measure"
values="pearson","euclidean","kendalls","kolm-smir","spearman","pearnorm",
"hypergeom","innerprod","bininnerprod","quickpear","mi" default="pearnorm"
option "normalize" N "Normalize distances"
flag off
option "zscore" Z "Convert correlations to z-scores"
flag on
option "skip" S "PCL columns to skip after ID"
int default="2"
section "Optional"
option "memmap" m "Memory map input DABs"
flag off
option "verbosity" v "Message verbosity"
int default="5"
| Flag | Default | Type | Description |
|---|---|---|---|
| None | None | DAT/DAB files | Input DAT/DAB files from which data is drawn for features in the output Weka file. |
| -p | stdin | Gene text file | List of genes labeled as positives for machine learning; can be drawn from the same pathway/process/complex/GO term/etc. |
| -e | None | Text file | Tab-delimited text file containing three columns: feature name, |-delimited feature values, and an optional default value. Lines starting with # are ignored as comments. |
| -d | None | Text file | Tab-delimited text file containing one dataset per line. The first tab-delimited token of each line should be a dataset name, with all subsequent tokens of the form <feature name>|<feature value>. |
| -g | None | SGD features text file | SGD_features.tab file; if given, process only genes appearing in this file. |
| -D | pearnorm | pearson, euclidean, kendalls, kolm-smir, spearman, pearnorm, hypergeom, innerprod, bininnerprod, quickpear, or mi | Similarity measure to be used for converting PCL inputs into pairwise scores. |
| -N | off | Flag | If on, normalize input edges to the range [0,1] before processing. |
| -Z | off | Flag | If on, normalize input edges to z-scores (subtract mean, divide by standard deviation) before processing. |
| -S | 2 | Integer | Number of columns to skip between the initial ID column and the first experimental (data) column in the input PCL. |
| -m | off | Flag | If given, memory map the input files when possible. DAT and PCL inputs cannot be memmapped. |
1.7.6.1