Sleipnir: Answerer

Answerer generates gold standard DAT or DAB files for machine learning or evaluation. These are usually DAB files in which each functionally related gene pair is given the value 1, each unrelated gene pair the value 0, and any uncertain gene pairs are left with missing values (NaNs).

Overview

Given sets of known related genes - pathways, complexes, GO terms, etc. - Answerer generates a gold standard Sleipnir::CDat. By considering every pair of genes coannotated to one of these sets to be related, the gold standard answers will include a collection of known functionally related pairs. If Answerer is only provided with positive gene sets, it will only generate positive (related) pairs, modulo any uncertain pairs introduced by the overlap option (see below).

In addition to these positive gene sets, Answerer optionally also takes one or more negative sets. These represent "minimally related" genes, such that gene pairs not coannotated to a negative set are definitely unrelated. This is intended to give two threshholds: positive gene sets should be fairly specific (e.g. "mitotic cell cycle" or "aldehyde metabolism" in GO), such that genes coannotated to these processes are known to be doing something biologically similar. Negative gene sets should be fairly general (e.g. "physiological process" or "translation" in GO), such that any genes not similar enough to be coannotated at this level are definitely unrelated. Any genes coannotated with specificity "between" these levels (i.e. above the positive level but below the negative level) are uncertain and not included in Answerer's gold standard.

For example, suppose Answerer got the following two positive sets:

 A
 B
 C

and

 A
 D
 E

and one negative set:

 A
 B
 C
 E

Then it would generate the answer file:

 A  B   1
 A  C   1
 A  D   1
 A  E   1
 B  C   1
 B  D   0
 C  D   0
 D  E   1

Note that the pairs B E and C E are missing from this answer file: they are neither positive nor negative, since B, C, and E are all coannotated to a negative set.

Usage

Basic Usage

 Answerer -p <positives_dir> [-n <negatives_dir>] -o <answers.dab>

Generates the answer file answers.dab from the positives gene sets (text files, one gene per line) in positives_dir and, optionally, the negative sets in negatives_dir. If -o is omitted, answers are saved as a DAT on standard output.

Detailed Usage

package "Answerer"
version "1.0"
purpose "Generates an answer file given positives, negatives, and a genome"

option  "output"            o   "Output DAB file"
                                string  typestr="filename"

defgroup "Positives"    yes
groupoption "positives"     p   "Directory containing related gene lists"
                                string  typestr="directory" group="Positives"
groupoption "input"         i   "Pre-existing positive DAT file"
                                string  typestr="filename"  group="Positives"

section "Negatives"
option  "negatives"     n   "Directory containing minimally related gene lists"
                                string  typestr="directory"
option  "interactions"      x   "Expected interactions per gene"
                                double
option  "prior"         P   "Target prior for the answer file. This prior is only a target, may turn out to be lower."
                                double

section "Modifications"
option  "incident"          c   "Require negative pairs to include an annotated gene"
                                flag    off
option  "exclude"           e   "DAT/DAB file of gene pairs to exclude from the standard"
                                string  typestr="filename"
option  "scramble"          s   "Fraction of gene pairs to set randomly"
                                double  default="0"

section "Miscellaneous"
option  "overlap"           l   "P-value cutoff for negative term overlap"
                                double  default="0"
option  "genome"            g   "List of all genes to be considered"
                                string  typestr="filename"
option  "test"              t   "Fraction of genes to randomly select for testing"
                                double  default="0"

section "Optional"
option  "random"            r   "Seed random generator"
                                int default="0"
option  "verbosity"         v   "Message verbosity"
                                int default="5"

Flag	Default	Type	Description
-p	None	Directory	Directory containing related (positive) gene lists. Each gene list is a text file containing one systematic gene ID per line.
-i	None	DAT/DAB file	File containing known related pairs. If given, positive gene pairs will be drawn directly from the given Sleipnir::CDat rather than calculated from coannotation to gene sets. Any gene pair with a non-missing, non-zero value in the given Sleipnir::CDat will become a positive pair.
-n	None	Directory	Directory containing unrelated (negative) gene lists. Each gene list is a text file containing one systematic gene ID per line.
-x	None	Double	Expected number of positive functional relationships per gene. If given, negative gene pairs are chosen at random from the non-positive pairs, with probability equal to the prior of functional relationship times the size of the genome divided by the requested number of positive interactions. For example, if yeast has 6000 genes and you want a gene pair to have a 5% chance of being functionally related, choose an interaction number 0.05 * 6000 = 300.
-g	None	Gene file	Text file containing gene IDs, one per line. Only genes in this list will be used; gene pairs containing genes not in the list will be ignored.
-t	0	Double	If nonzero, this fraction of the genome is omitted as genes for future holdout (exclusion) sets. In practice, this will omit to standard out (if `-o` is given) or standard error (if it is not) a list of the requested number of genes. These can be saved to a file and later used as a holdout/test set.
-l	0	Double	If nonzero, genes coannotated to positive sets with hypergeometric p-value of overlap less than this value will be considered uncertain (i.e. missing, NaN) in the output answers instead of unrelated. In other words, if genes A and B are annotated to two different positive sets, but these two sets have significant overlap, the gene pair will be neither positive nor negative in the output answer file. 0.05 is a good value for producing generally sane answer sets.