Sleipnir: DChecker

DChecker inputs a gold standard answer file and a DAT/DAB of predicted functional relationships (or other interactions) and outputs the information necessary to perform a performance analysis (ROC curve, precision/recall curve, or AUC score) for the given predictions.

Usage

Basic Usage

 DChecker -w <answers.dab> -i <predictions.dab>

Output (to standard output) positive gene counts, true and false positive pair counts, true and false negative pair counts, and an overall AUC score using the default binning of continuous data in predictions.dab and the binary gold standard answers in answers.dab.

 DChecker -w <answers.dab> -i <predictions.dab> -f

Output true/false positive and negative pair counts assuming that predictions.dab contains only a finite number of different values and using these as bins. This is appropriate for inherently discrete data, e.g. cocluster counts.

 DChecker -w <answers.dab> -i <predictions.dab> -b 0 -n -m 0 -M 1 -e 0.01

Output true/false positive and negative pair counts by normalizing the given data predictions.dab to the range [0,1] and creating bin cutoffs at 0.01 increments between 0 and 1. This can provide a finer grained binning than the default -b setting for some prediction/data sets (and a less fine grained binning for others; when in doubt, try both).

 DChecker -w <answers.dab> -i <predictions.dab> -c <context.txt>

Output true/false positive and negative pair counts for the given predictions.dab using only the gene pairs relevant to the given biological function context.txt. This is appropriate for evaluating context-specific functional relationship predictions.

Detailed Usage

package "DChecker"
version "1.0"
purpose "Similarity to answer file checker"

section "Main"
option  "input"         i   "Similarity DAT/DAB file"
                            string  typestr="filename"  yes
option  "answers"       w   "Answer DAT/DAB file"
                            string  typestr="filename"  yes

section "Miscellaneous"
option  "directory"     d   "Output directory"
                            string  typestr="directory" default="."
option  "auc"           a   "Use alternative AUCn calculation"
                            float   default="0"
option  "randomize"     R   "Calculate specified number of randomized scores"
                            int default="0"

section "Ranking Method"
option  "bins"          b   "Bins for quantile sorting"
                            int default="1000"
option  "finite"        f   "Count finitely many bins"
                            flag    off
option  "min"           m   "Minimum correlation to process"
                            float   default="0"
option  "max"           M   "Maximum correlation to process"
                            float   default="1"
option  "delta"         e   "Size of correlation bins"
                            double  default="0.01"

section "Learning/Evaluation"
option  "genes"         g   "Gene inclusion file"
                            string  typestr="filename"
option  "genex"         G   "Gene exclusion file"
                            string  typestr="filename"
option  "ubiqg"                 P       "Ubiquitous gene file (-j and -J refer to connections to ubiq instead of all bridging pairs)"
                                                        string  typestr="filename"
option  "genet"         c   "Term inclusion file"
                            string  typestr="filename"
option  "genee"         C   "Edge inclusion file"
                            string  typestr="filename"
option  "genep"         l   "Gene inclusion file for positives"
                            string  typestr="filename"
option  "ctxtpos"               q       "Use positive edges between context genes"
                                                        flag    on
option  "ctxtneg"               Q       "Use negative edges between context genes"
                                                        flag    on
option  "bridgepos"             j       "Use bridging positives between context and non-context genes"
                                                        flag    off
option  "bridgeneg"             J       "Use bridging negatives between context and non-context genes"
                                                        flag    on
option  "outpos"                u       "Use positive edges outside the context"
                                                        flag    off
option  "outneg"                U       "Use negative edges outside the context"
                                                        flag    off
option  "weights"           W   "Weight file"
                            string  typestr="filename"
option  "flipneg"           F       "Flip weights(one minus original) for negative standards"
                                                        flag    on

section "Preprocessing"
option  "normalize"     n   "Normalize scores before processing"
                            flag    off
option  "invert"        t   "Invert correlations to distances"
                            flag    off
option  "abs"           A   "Convert input to its absolute values"
                            float   default="0.0"

section "Optional"
option  "sse"           s   "Calculate sum of squared errors"
                            flag    off
option  "memmap"        p   "Memory map input DABs"
                            flag    off
option  "verbosity"     v   "Message verbosity"
                            int default="5"

Flag	Default	Type	Description
None	None	Gene text files	If given, contexts in which multiple context-specific evaluations are performed. Each gene set is read, treated as a "term" filter (see Sleipnir::CDat::FilterGenes) on the given answer file, and a context-specific evaluation is saved in the directory `-d`.
-i	stdin	DAT/DAB file	Input DAT, DAB, DAS, or PCL file.
-w	None	DAT/DAB file	Functional gold standard for learning. Should consist of gene pairs with scores of 0 (unrelated), 1 (related), or missing (NaN).
-d	.	Directory	If multiple contexts are being checked, output directory in which individual contexts' score files are placed.
-b	1000	Integer	If nonzero, number of quantile bins into which input scores are sorted. Each bin is then used as a cutoff for predicted positives and negatives.
-f	off	Flag	If on, assume the input predictions contain a small, finite number of distinct values and bin quantiles appropriate. Bad things will happen if `-f` is on and there are actually a large number of distinct input values.
-m	0	Float	If `-b` is zero and `-f` is off, minimum input score to treat as a positive/negative cutoff.
-M	1	Float	If `-b` is zero and `-f` is off, maximum input score to treat as a positive/negative cutoff.
-e	0.01	Double	If `-b` is zero and `-f` is off, size of step to take for cutoffs between `-m` and `-M`.
-g	None	Text gene list	If given, use only gene pairs for which both genes are in the list. For details, see Sleipnir::CDat::FilterGenes.
-G	None	Text gene list	If given, use only gene pairs for which neither gene is in the list. For details, see Sleipnir::CDat::FilterGenes.
-c	None	Text gene list	If given, use only gene pairs passing a "term" filter against the list. For details, see Sleipnir::CDat::FilterGenes.
-C	None	Text gene list	If given, use only gene pairs passing an "edge" filter against the list. For details, see Sleipnir::CDat::FilterGenes.
-n	off	Flag	If on, normalize input edges to the range [0,1] before processing.
-t	off	Flag	If on, output one minus the input's values.
-s	off	Flag	If on, output sum of squared error between input predictions and answer file (assumes a continuous rather than discrete answer file).
-p	off	Flag	If given, memory map the input files when possible. DAT and PCL inputs cannot be memmapped.