Sleipnir
|
Answerer generates gold standard DAT or DAB files for machine learning or evaluation. These are usually DAB files in which each functionally related gene pair is given the value 1, each unrelated gene pair the value 0, and any uncertain gene pairs are left with missing values (NaNs).
Given sets of known related genes - pathways, complexes, GO terms, etc. - Answerer generates a gold standard Sleipnir::CDat. By considering every pair of genes coannotated to one of these sets to be related, the gold standard answers will include a collection of known functionally related pairs. If Answerer is only provided with positive gene sets, it will only generate positive (related) pairs, modulo any uncertain pairs introduced by the overlap
option (see below).
In addition to these positive gene sets, Answerer optionally also takes one or more negative sets. These represent "minimally related" genes, such that gene pairs not coannotated to a negative set are definitely unrelated. This is intended to give two threshholds: positive gene sets should be fairly specific (e.g. "mitotic cell cycle" or "aldehyde metabolism" in GO), such that genes coannotated to these processes are known to be doing something biologically similar. Negative gene sets should be fairly general (e.g. "physiological process" or "translation" in GO), such that any genes not similar enough to be coannotated at this level are definitely unrelated. Any genes coannotated with specificity "between" these levels (i.e. above the positive level but below the negative level) are uncertain and not included in Answerer's gold standard.
For example, suppose Answerer got the following two positive sets:
A B C
and
A D E
and one negative set:
A B C E
Then it would generate the answer file:
A B 1 A C 1 A D 1 A E 1 B C 1 B D 0 C D 0 D E 1
Note that the pairs B E and C E are missing from this answer file: they are neither positive nor negative, since B, C, and E are all coannotated to a negative set.
Answerer -p <positives_dir> [-n <negatives_dir>] -o <answers.dab>
Generates the answer file answers.dab
from the positives gene sets (text files, one gene per line) in positives_dir
and, optionally, the negative sets in negatives_dir
. If -o
is omitted, answers are saved as a DAT on standard output.
package "Answerer"
version "1.0"
purpose "Generates an answer file given positives, negatives, and a genome"
option "output" o "Output DAB file"
string typestr="filename"
defgroup "Positives" yes
groupoption "positives" p "Directory containing related gene lists"
string typestr="directory" group="Positives"
groupoption "input" i "Pre-existing positive DAT file"
string typestr="filename" group="Positives"
section "Negatives"
option "negatives" n "Directory containing minimally related gene lists"
string typestr="directory"
option "interactions" x "Expected interactions per gene"
double
option "prior" P "Target prior for the answer file. This prior is only a target, may turn out to be lower."
double
section "Modifications"
option "incident" c "Require negative pairs to include an annotated gene"
flag off
option "exclude" e "DAT/DAB file of gene pairs to exclude from the standard"
string typestr="filename"
option "scramble" s "Fraction of gene pairs to set randomly"
double default="0"
section "Miscellaneous"
option "overlap" l "P-value cutoff for negative term overlap"
double default="0"
option "genome" g "List of all genes to be considered"
string typestr="filename"
option "test" t "Fraction of genes to randomly select for testing"
double default="0"
section "Optional"
option "random" r "Seed random generator"
int default="0"
option "verbosity" v "Message verbosity"
int default="5"
Flag | Default | Type | Description |
---|---|---|---|
-p | None | Directory | Directory containing related (positive) gene lists. Each gene list is a text file containing one systematic gene ID per line. |
-i | None | DAT/DAB file | File containing known related pairs. If given, positive gene pairs will be drawn directly from the given Sleipnir::CDat rather than calculated from coannotation to gene sets. Any gene pair with a non-missing, non-zero value in the given Sleipnir::CDat will become a positive pair. |
-n | None | Directory | Directory containing unrelated (negative) gene lists. Each gene list is a text file containing one systematic gene ID per line. |
-x | None | Double | Expected number of positive functional relationships per gene. If given, negative gene pairs are chosen at random from the non-positive pairs, with probability equal to the prior of functional relationship times the size of the genome divided by the requested number of positive interactions. For example, if yeast has 6000 genes and you want a gene pair to have a 5% chance of being functionally related, choose an interaction number 0.05 * 6000 = 300. |
-g | None | Gene file | Text file containing gene IDs, one per line. Only genes in this list will be used; gene pairs containing genes not in the list will be ignored. |
-t | 0 | Double | If nonzero, this fraction of the genome is omitted as genes for future holdout (exclusion) sets. In practice, this will omit to standard out (if -o is given) or standard error (if it is not) a list of the requested number of genes. These can be saved to a file and later used as a holdout/test set. |
-l | 0 | Double | If nonzero, genes coannotated to positive sets with hypergeometric p-value of overlap less than this value will be considered uncertain (i.e. missing, NaN) in the output answers instead of unrelated. In other words, if genes A and B are annotated to two different positive sets, but these two sets have significant overlap, the gene pair will be neither positive nor negative in the output answer file. 0.05 is a good value for producing generally sane answer sets. |