Sleipnir
|
KNNImputer imputs missing values in microarray data as described in Troyanskaya et al 2001. Given an input PCL, for each gene with missing values, some number of nearest neighbors (by a configurable similarity measure) are found, and the missing value is replaced with a weighted average of the equivalent value in those neighbors. KNNImputer can optionally remove genes with too many missing values to impute.
KNNImputer -i <data.pcl> -o <imputed.pcl>
Replace missing values in the microarray data data.pcl
based on their nearest neighbors, remove any genes with too many missing values, and save the result in imputed.pcl
.
package "KNNImputer"
version "1.0"
purpose "More modern version of KNNImpute."
section "Main"
option "input" i "Input PCL file"
string typestr="filename"
option "output" o "Output PCL file"
string typestr="filename"
section "Genes/Neighbors"
option "neighbors" k "Nearest neighbors to use"
int default="10"
option "distance" d "Similarity measure"
values="pearson","euclidean","kendalls","kolm-smir","spearman",
"pearnorm","hypergeom" default="euclidean"
option "missing" m "Fraction of conditions which must be present"
double default="0.7"
section "Miscellaneous"
option "genes" g "Gene inclusion file"
string typestr="filename"
option "weights" w "Input weights file"
string typestr="filename"
option "autocorrelate" a "Autocorrelate distances"
flag off
section "Optional"
option "skip" s "Columns to skip in input PCL"
int default="2"
option "limit" l "Gene count limit for caching"
int default="-1"
option "verbosity" v "Message verbosity"
int default="5"
Flag | Default | Type | Description |
---|---|---|---|
-i | stdin | PCL text file | Input PCL file in which missing values are to be imputed. |
-o | stdout | PCL text file | Output PCL file in which missing values have been replaced and genes with too many missing values have been removed. |
-k | 10 | Integer | Number of neighbors to use for each missing value imputation. |
-d | euclidean | euclidean, pearson, kendalls, kolm-smir, spearman, pearnorm, or hypergeom | Similarity measure to use for finding nearest neighbors. The default (Euclidean distance) is highly recommended. |
-m | 0.7 | Double | Fraction of a gene's expression vector that must be present; genes with less than this many non-missing values are removed from the output. For example, in a PCL with 10 columns, genes with more than three missing values would be removed by default. |
-g | None | Gene text file | If given, only genes in the given gene set are included in the output. |
-w | None | PCL text file | If given, a PCL file with dimensions equal to the data given with -i . However, the values in the cells of the weights PCL represent the relative weight given to each gene/experiment pair. If no weights file is given, all weights default to 1. |
-a | off | Flag | If on, autocorrelate similarity scores (find the maximum similarity score over all possible lags of the two vectors; see Sleipnir::CMeasureAutocorrelate). |
-s | 2 | Integer | Number of columns to skip between the initial ID column and the first experimental (data) column in the input PCL. |
-l | -1 | Integer | Maximum number of genes in input file before in-memory score caching is disabled. If -1, caching is never performed. Caching greatly speeds up processing, but can consume large amounts of memory for inputs with many genes (rows). |