Sleipnir
|
BNFunc produces gene sets (and, optionally, answer files) from functional catalog slims (i.e. lists of individual ontology terms). These gene sets are essentially lists of all genes annotated under the given terms, and can be used with tools such as Answerer to build more complex functional gold standards.
BNFunc provides a quick and easy way to retrieve gene sets from a functional catalog. It consumes a slim file, which is a list of functional catalog terms; it produces one gene set for each term in the slim, each containing the list of genes annotated at or below the appropriate term. BNFunc has limited abilities to produce a functional gold standard directly (by marking coannotated gene pairs as related, 1, an non-coannotated gene pairs as unrelated, 0), or these gene sets can be used with Answerer to construct a gold standard.
Suppose we've obtained the functional gold slim from GRIFn and transformed it into a file positive_slim.txt
in the appropriate format:
translation GO:0043037 cytoskeleton organization and biogenesis GO:0007010 transcription from RNA polymerase II promoter GO:0006366 ... boron transport GO:0046713
You should first create a directory to hold the results (e.g. ./positive_sets/) and download the Gene Ontology structure and yeast annotation files. Then run:
BNFunc -i positive_slim.txt -d ./positive_sets/ -y gene_ontology.obo -g gene_association.sgd
This will produce one gene file per term in the slim, e.g. translation
containing:
YOR335C YJR047C YGL105W ...
Down through boron transport
containing:
YNL275W
If you want to go on to create a functional gold standard as in Myers et al 2005 or Huttenhower et al 2006, you'll need negative gene sets (i.e. the "minimally related" gene sets) in addition to positive ones. You can obtain a negative functional slim from the MEFIT download site, and transform it into a tab-delimited file negative_slim.txt
of the proper form:
development GO:0007275 nitrogen compound metabolism GO:0006807 catabolism GO:0009056 ... regulation of biological process GO:0050789
Now create another new directory -c ./negative_sets/ and run:
BNFunc -i negative_slim.txt -d ./negative_sets/ -y gene_ontology.obo -g gene_association.sgd
Now you have positive and negative gene sets that you can easily use with Answerer.
BNFunc -i <slim.txt> -d <output_dir> -y gene_ontology.obo -g <gene_association.sgd> -k ko -K <ORG> -m funcat-2.0_scheme -a <funcat-2.0_data_18052006>
Saves gene lists for the terms specified in slim.txt
into the directory output_dir
. The slim file must list IDs from one or more of the provided functional catalogs. Only a subset of these need be used: the Gene Ontology (arguments -y
and -g
), the KEGG orthology (arguments -k
and -K
, with organism codes SCE, HSA, etc.), or the MIPS funcat (-m
and -a
).
package "BNFunc"
version "1.0"
purpose "Functional Bayes net preparation"
section "Main"
option "input" i "Ontology slim file"
string typestr="filename" yes
section "Miscellaneous"
option "directory" d "Output directory"
string typestr="directory" default="."
option "output" o "Answer file"
string typestr="filename"
option "negatives" I "Negative slim file"
string typestr="filename"
section "Function Catalogs"
option "onto" y "ontology (obo file)"
string typestr="filename"
option "obo_anno" g "Gene annotations that correspond to the OBO ontology for the organism of interest."
string typestr="filename"
option "namespace" n "Namespace (the gene ontology namespaces can be abbreviated bp, mf, and cc)"
string typestr="namespace" default=""
option "kegg" k "KEGG ontology"
string typestr="filename"
option "kegg_org" K "KEGG organism"
string default="SCE"
option "mips_onto" m "MIPS ontology"
string typestr="filename"
option "mips_anno" a "MIPS annotations"
string typestr="filename"
section "Gene Names"
option "synonyms" s "Prefer synonym names"
flag off
option "dbids" b "Include GO database IDs"
flag off
option "allids" l "Output all available IDs"
flag off
section "Optional"
option "test" t "Test fraction"
double default="0"
option "sql" q "File in which to save SQL tables"
string typestr="filename"
option "nsets" N "Generate negative sets for input slim"
flag off
option "nsetlap" L "P-value of overlap for negative rejection"
double default="0.05"
option "config" c "Command line config file"
string typestr="filename" default="BNFunc.ini"
option "random" r "Seed random generator"
int default="0"
option "verbosity" v "Message verbosity"
int default="5"
option "annotations" f "File for propogated annotations"
string typestr="filename"
Flag | Default | Type | Description |
---|---|---|---|
-i | None | Slim text file | Tab-delimited text file containing two columns with one functional catalog term per line. The first column specifies a text description of the term, and the second column gives the ID of the term (e.g. GO:0006796, ko00361, 01.04.01, etc.) |
-d | . | Directory | Directory in which gene set files are created. Large slims can create lots of files; use the default (current directory .) with caution! |
-o | None | DAB file | If given, produce a functional gold standard from the given positive (and, optionally, negative) slim in addition to outputting gene lists. For details, see Answerer. |
-I | None | Slim text file | If given, use the given slim as negative (minimally related) gene sets when producing a functional gold standard. For details, see Answerer. |
-y | None | OBO text file | OBO file containing the structure of the Gene Ontology. |
-g | None | Annotation text file | Gene Ontology annotation file for the desired organism. |
-n | bp | String | Gene Ontology namespace to be used for term ID lookups. "bp", "cc", and "mf" can be used as abbreviations for the three common namespaces (biological process, cellular component, and molecular function). |
-k | None | KEGG orthology text file | ko file containing the structure and annotations of the KEGG orthology. |
-K | SCE | KEGG organism code | Three letter organism code of annotations to be read from the ko file. Options include SCE for yeast, HSA for human, DME for fly, CEL for worm, and MMU for mouse. |
-m | None | MIPS schema text file | File containing the schema (structure) of the MIPS functional catalog. |
-a | None | MIPS annotation text file | File containing the annotations to be used with the MIPS functional catalog. |
-s | off | Flag | If on, output the first common gene name (if present) rather than the systematic name from the annotation file. |
-b | off | Flag | If on, output each gene's database ID rather than the systematic name from the annotation file. |
-l | off | Flag | If on, output every available ID and synonym for each gene (tab delimited, one gene per line). |
-t | 0 | Double | Fraction of genes to be reserved for testing. If nonzero, genes will randomly be selected, not placed in any output set, and printed to standard output. These can be saved and used for later holdout evaluation. |
-q | None | SQL text file | If given, gene sets will also be saved as SQL tables in addition to the text file lists. |
-N | None | Directory | If given, gene sets indicating which genes are functionally unrelated to each slim term will be produced in the requested directory. This is rarely directly useful, since it's easier to produce negative gene sets from a second slim file. |
-L | 0.05 | Double | If two input terms have a hypergeometric p-value of overlap below this threshhold, genes annotated to the two terms cannot be considered unrelated. They will either be related (if coannotated to some other term) or missing (neither related nor unrelated). Only applies if -o or -N is given. |