Sleipnir: BNFunc

BNFunc produces gene sets (and, optionally, answer files) from functional catalog slims (i.e. lists of individual ontology terms). These gene sets are essentially lists of all genes annotated under the given terms, and can be used with tools such as Answerer to build more complex functional gold standards.

Overview

BNFunc provides a quick and easy way to retrieve gene sets from a functional catalog. It consumes a slim file, which is a list of functional catalog terms; it produces one gene set for each term in the slim, each containing the list of genes annotated at or below the appropriate term. BNFunc has limited abilities to produce a functional gold standard directly (by marking coannotated gene pairs as related, 1, an non-coannotated gene pairs as unrelated, 0), or these gene sets can be used with Answerer to construct a gold standard.

Suppose we've obtained the functional gold slim from GRIFn and transformed it into a file positive_slim.txt in the appropriate format:

 translation    GO:0043037
 cytoskeleton organization and biogenesis   GO:0007010
 transcription from RNA polymerase II promoter  GO:0006366
 ...
 boron transport    GO:0046713

You should first create a directory to hold the results (e.g. ./positive_sets/) and download the Gene Ontology structure and yeast annotation files. Then run:

 BNFunc -i positive_slim.txt -d ./positive_sets/ -y gene_ontology.obo -g gene_association.sgd

This will produce one gene file per term in the slim, e.g. translation containing:

 YOR335C
 YJR047C
 YGL105W
 ...

Down through boron transport containing:

 YNL275W

If you want to go on to create a functional gold standard as in Myers et al 2005 or Huttenhower et al 2006, you'll need negative gene sets (i.e. the "minimally related" gene sets) in addition to positive ones. You can obtain a negative functional slim from the MEFIT download site, and transform it into a tab-delimited file negative_slim.txt of the proper form:

 development    GO:0007275
 nitrogen compound metabolism   GO:0006807
 catabolism GO:0009056
 ...
 regulation of biological process   GO:0050789

Now create another new directory -c ./negative_sets/ and run:

 BNFunc -i negative_slim.txt -d ./negative_sets/ -y gene_ontology.obo -g gene_association.sgd

Now you have positive and negative gene sets that you can easily use with Answerer.

Usage

Basic Usage

 BNFunc -i <slim.txt> -d <output_dir> -y gene_ontology.obo -g <gene_association.sgd>
        -k ko -K <ORG> -m funcat-2.0_scheme -a <funcat-2.0_data_18052006>

Saves gene lists for the terms specified in slim.txt into the directory output_dir. The slim file must list IDs from one or more of the provided functional catalogs. Only a subset of these need be used: the Gene Ontology (arguments -y and -g), the KEGG orthology (arguments -k and -K, with organism codes SCE, HSA, etc.), or the MIPS funcat (-m and -a).

Detailed Usage

package "BNFunc"
version "1.0"
purpose "Functional Bayes net preparation"

section "Main"
option  "input"     i   "Ontology slim file"
                        string  typestr="filename"  yes

section "Miscellaneous"
option  "directory" d   "Output directory"
                        string  typestr="directory" default="."
option  "output"    o   "Answer file"
                        string  typestr="filename"
option  "negatives" I   "Negative slim file"
                        string  typestr="filename"

section "Function Catalogs"
option  "onto"      y   "ontology (obo file)"
                        string  typestr="filename"
option  "obo_anno"  g   "Gene annotations that correspond to the OBO ontology for the organism of interest."
                        string  typestr="filename"
option  "namespace" n   "Namespace (the gene ontology namespaces can be abbreviated bp, mf, and cc)"
                        string  typestr="namespace" default=""
option  "kegg"      k   "KEGG ontology"
                        string  typestr="filename"
option  "kegg_org"  K   "KEGG organism"
                        string  default="SCE"
option  "mips_onto" m   "MIPS ontology"
                        string  typestr="filename"
option  "mips_anno" a   "MIPS annotations"
                        string  typestr="filename"

section "Gene Names"
option  "synonyms"  s   "Prefer synonym names"
                        flag    off
option  "dbids"     b   "Include GO database IDs"
                        flag    off
option  "allids"    l   "Output all available IDs"
                        flag    off

section "Optional"
option  "test"      t   "Test fraction"
                        double  default="0"
option  "sql"       q   "File in which to save SQL tables"
                        string  typestr="filename"
option  "nsets"     N   "Generate negative sets for input slim"
                        flag    off
option  "nsetlap"   L   "P-value of overlap for negative rejection"
                        double  default="0.05"
option  "config"    c   "Command line config file"
                        string  typestr="filename"  default="BNFunc.ini"
option  "random"    r   "Seed random generator"
                        int default="0"
option  "verbosity" v   "Message verbosity"
                        int default="5"
option  "annotations"   f   "File for propogated annotations"
                        string  typestr="filename"

Flag	Default	Type	Description
-i	None	Slim text file	Tab-delimited text file containing two columns with one functional catalog term per line. The first column specifies a text description of the term, and the second column gives the ID of the term (e.g. GO:0006796, ko00361, 01.04.01, etc.)
-d	.	Directory	Directory in which gene set files are created. Large slims can create lots of files; use the default (current directory .) with caution!
-o	None	DAB file	If given, produce a functional gold standard from the given positive (and, optionally, negative) slim in addition to outputting gene lists. For details, see Answerer.
-I	None	Slim text file	If given, use the given slim as negative (minimally related) gene sets when producing a functional gold standard. For details, see Answerer.
-y	None	OBO text file	OBO file containing the structure of the Gene Ontology.
-g	None	Annotation text file	Gene Ontology annotation file for the desired organism.
-n	bp	String	Gene Ontology namespace to be used for term ID lookups. "bp", "cc", and "mf" can be used as abbreviations for the three common namespaces (biological process, cellular component, and molecular function).
-k	None	KEGG orthology text file	`ko` file containing the structure and annotations of the KEGG orthology.
-K	SCE	KEGG organism code	Three letter organism code of annotations to be read from the `ko` file. Options include SCE for yeast, HSA for human, DME for fly, CEL for worm, and MMU for mouse.
-m	None	MIPS schema text file	File containing the schema (structure) of the MIPS functional catalog.
-a	None	MIPS annotation text file	File containing the annotations to be used with the MIPS functional catalog.
-s	off	Flag	If on, output the first common gene name (if present) rather than the systematic name from the annotation file.
-b	off	Flag	If on, output each gene's database ID rather than the systematic name from the annotation file.
-l	off	Flag	If on, output every available ID and synonym for each gene (tab delimited, one gene per line).
-t	0	Double	Fraction of genes to be reserved for testing. If nonzero, genes will randomly be selected, not placed in any output set, and printed to standard output. These can be saved and used for later holdout evaluation.
-q	None	SQL text file	If given, gene sets will also be saved as SQL tables in addition to the text file lists.
-N	None	Directory	If given, gene sets indicating which genes are functionally unrelated to each slim term will be produced in the requested directory. This is rarely directly useful, since it's easier to produce negative gene sets from a second slim file.
-L	0.05	Double	If two input terms have a hypergeometric p-value of overlap below this threshhold, genes annotated to the two terms cannot be considered unrelated. They will either be related (if coannotated to some other term) or missing (neither related nor unrelated). Only applies if `-o` or `-N` is given.