Sleipnir
|
BNServer is a complex tool provided a multithreaded TCP/IP interface to real-time Bayesian data integration and inference. A running BNServer can service client requests over a network for values from specific biological datasets, predicted functional relationships, queries into a functional relationship network, and graph visualization using Graphviz.
As the name implies, BNServer is a network server which can service client requests for information using a simple, binary TCP/IP protocol based on Sleipnir::CServer. The server loads a variety of information at startup, including one or more biological datasets (stored in a Sleipnir::CDatabase rather than standard Sleipnir::CDat files) and one or more naive classifiers, and can provide various related pieces of information to a client:
Since BNServer is a very complex program, let's first go over the pieces of data necessary to start it running. First, convert a collection of biological datasets (usually DAB/QUANT file pairs) into a Sleipnir::CDatabase directory. Let's call that directory ./db/, which will contain a bunch of files:
00000000.db 00000001.db 00000002.db ...
You'll also need a tab-delimited text file listing each biological context of interest (usually a functional catalog term) in three columns, an integer index (one-based), a textual description, and an ID. Let's call this file contexts.txt:
1 activation of NF-kappaB transcription factor GO:0051092 2 adult locomotory behavior GO:0008344 3 aging GO:0007568 ...
Next, create a tab-delimited text file with two columns, a one-based integer index and a gene ID. This file (genes.txt
) will contain every gene of interest for your organism, e.g.:
1 YPL149W 2 YHR171W 3 YBR217W
Now tie these two files together by creating a relational mapping between the two lists. You'll need a tab-delimited contexts_genes.txt
file with two columns. The first column is a context ID, the second a gene ID, and there should be one row for each gene annotation to a context term. For example, if context #1 contains genes #3, 9, and 16, and context #2 contains genes #3 and 4, your contexts_genes.txt
file might start:
1 3 1 9 1 16 2 3 2 4 ...
If you have fully predicted, context-specific functional relationship networks produced by BNUnraveler, you can use Hubber to generate background connectivities for each gene from them. Let's call this binary file of gene "hubbiness" backgrounds.bin
. If you don't have such a file, don't worry; it's optional.
If you have knowledge of disease-associated genes, you can also create a tab-delimited diseases_genes.txt
file providing relational mappings between disease and gene IDs. The diseases don't need names, so there's no separate diseases file. Like the contexts/genes mapping, diseases_genes.txt
should be a tab-delimited text file; unlike the other file, this one's optional. If you have it, it should look like:
1 18719 2 7473 2 19634 3 4117 ...
Now, for each context you listed in your contexts file, you should have a context-specific Bayes net stored in an (X)DSL file. This must be a naive classifier with parameters that have already been learned; BNWeaver is ideal for this. Put all of these files in a single directory, e.g. ./contexts/, named the same as their textual description in
contexts.txt
with non-alphanumeric characters substituted with underscores. For example, ./contexts/ might contain:
activation_of_NF_kappaB_transcription_factor.xdsl adult_locomotory_behavior.xdsl aging.xdsl ...
Note that every non-alphanumeric character must be replaced with an underscore. You should also have one more non-context-specific, global Bayesian network containing the "default" probabilities to be used outside of a specific context. This can be generated with BNWeaver or, better yet, BNCreator. Call that file default.xdsl
.
We're on the home stretch here. Collect Gene Ontology structure and annotation files and the KEGG orthology ko file. Make sure Graphviz is installed and in your current path, and you can finally run:
BNServer -d ./db/ -i contexts.txt -c contexts_genes.txt -a backgrounds.bin -s diseases_genes.txt
-n ./contexts/ -b default.xdsl -g gene_ontology.obo -G gene_association.sgd -k ko -K SCE
This will start a server running on the default port, using the default path to Graphviz and generating output files in the current directory (all of which can be modified using other options; see below). What can you do with such a thing? BNServer will listen on the specified port and service incoming network requests. As detailed in Sleipnir::CServer, all Sleipnir network communication is prefixed by a byte count for the incoming message. BNServer accepts several different message types; each should consist of the byte count for the whole message followed by the opcode (a single byte) and appropriate one- or four-byte arguments:
Opcode | Name | Arguments | Description |
---|---|---|---|
0 | Inference | Four-byte context ID, zero or more four-byte gene IDs | For each input gene, returns one four-byte floating point value per gene in the genome, each representing the probability of functional relationship with the input gene in the given context. Uninferrable pairs are marked with NaNs. For example, in a three-gene genome, an input request of 5 0 2 would return probabilities for the second gene (ID #2) in the default context (which has no ID, hence #0): 24 0.1 NaN 0.9 |
1 | Data | Two four-byte gene IDs | Returns discretized data values for the given gene pair across each dataset in the database. For example, suppose the database contained six datasets. Requesting data for genes #1 and #2, 9 1 1 2 , would return something of the form 6 0 1 0 2 3 1 , with each dataset's value encoded in a one-indexed byte (and zero representing a missing value). |
2 | Graph | One-byte boolean, four-byte context ID, four-byte neighbor count, zero or more four-byte gene IDs | For the given gene set, perform Bayesian inference in the given context and retrieve the requested number of neighbors most related to the query set in the resulting functional relationship graph. If the given boolean is true, the resulting graph will be saved in DOT format in the server's file directory and the filename returned over the network. If it's false, the contents of the DOT themselves will be sent back to the caller. |
3 | Contexts | Zero or more four-byte gene IDs | For each gene in the request, return two four-byte floating point values per context indicating the gene's in-connectivity and background-connectivity in each context. A gene's in-connectivity is its average probability of functional relationship with a gene in the context; its background-connectivity is its average probability of functional relationship with any gene. For example, for a server with three contexts, a query of 5 3 2 would retrieve gene #2's in- and background-connectivities with each context: 24 0.9 0.1 0.15 0.12 0.3 0.35 . |
4 | TermFinder | Four-byte ontology ID, four-byte floating point p-value, zero or more four-byte gene IDs | Return the given gene set's functional enrichments in the given ontology below the given p-value cutoff. Ontology IDs are 0 for GO BP, 1 for GO MF, 2 for GO CC, and 3 for KEGG. Results are prepended with the number of terms followed by, for each term, the null-terminated ID string, null-terminated description string, four-byte floating point p-value, four-byte integers in-term hits, term size, query size, background size, number of genes annotated to term, IDs of genes annotated to term. Thus, a query of the form 21 4 0 0.05 1 2 3 might result in two enriched terms: 149 2 GO:0006412 translation 0.01 3 773 3 7455 5 1 2 3 4 5 GO:0006081 aldehyde metabolic process 0.02 2 26 3 7455 4 1 2 8 9 |
5 | Diseases | Four-byte context ID, zero or more four-byte gene IDs | For each gene in the request, return two four-byte floating point values per disease indicating the gene's in-connectivity and background-connectivity to each disease in the requested context. A gene's in-connectivity is its average probability of functional relationship with a gene in the disease; its background-connectivity is its average probability of functional relationship with any gene. For example, for a server with three diseases, a query of 9 0 2 would retrieve gene #2's in- and background-connectivities with each gene in the default context, e.g. 24 0.9 0.1 0.15 0.12 0.3 0.35 . |
BNServer -d <database_dir> -i <contexts.txt> -c <contexts_genes.txt> -n <contexts_dir> -b <default.xdsl>
-g gene_ontology.obo -G <gene_association.sgd> -k ko -K <ORG>
Starts a server on the default port using data from a Sleipnir::CDatabase in database_dir
, a context list from contexts.txt
, context/gene mapping defitions from contexts_genes.txt
, context-specific Bayesian classifiers from contexts_dir
, a default global classifier default.xdsl
, the given Gene Ontology structure and annotation files, and the KEGG orthology ko
with the given organism code.
package "BNServer"
version "1.0"
purpose "Real time Bayes net calculation from DB data"
section "Input"
option "database" d "Database directory"
string typestr="directory" default="."
option "input" i "Context IDs and names"
string typestr="filename"
option "contexts" c "Context/gene mapping"
string typestr="filename" yes
option "diseases" s "Disease/gene mapping"
string typestr="filename"
option "is_nibble" N "Specify whether the database is nibble type"
flag on
section "Bayes nets"
option "networks" n "Bayes net directory"
string typestr="directory" default="."
option "default" b "Bayes net for no context"
string typestr="filename"
option "xdsl" x "Use XDSL files instead of DSL"
flag on
option "minimal_in" m "Read stored contexts and minimal Bayes nets"
flag off
option "minimal_out" M "Store contexts and minimal Bayes nets"
string typestr="filename"
section "P-values"
option "global" P "Parameter file for global context p-values"
string typestr="filename" yes
option "within_c" w "Within sets matrix for contexts"
string typestr="filename"
option "within_d" W "Within sets matrix for diseases"
string typestr="filename"
option "between_cc" e "Between sets matrix for contexts"
string typestr="filename"
option "between_dd" E "Between sets matrix for diseases"
string typestr="filename"
option "between_dc" B "Between sets matrix for diseases to contexts"
string typestr="filename"
option "backgrounds" a "Background connectivities for all genes"
string typestr="filename"
section "Ontologies"
option "go_onto" g "GO ontology"
string typestr="filename"
option "go_anno" G "GO annotations"
string typestr="filename"
option "kegg" k "KEGG ontology"
string typestr="filename"
option "kegg_org" K "KEGG organism"
string default="HSA"
section "Server"
option "port" p "Server port"
int default="1234"
option "timeout" t "Server timeout"
int default="100"
section "Precalculation"
option "networklets" l "Generate mini-network icons"
flag off
option "assoc_diseases" r "Disease names to generate disease/process associations"
string typestr="filename"
option "assoc_context" R "Context in which associations are computed"
int default="0"
section "Optional"
option "limit" L "Maximum genes to process per set"
int default="500"
option "files" f "File directory"
string typestr="directory" default="."
option "graphviz" z "Graphviz executable path"
string typestr="filename" default="fdp"
option "config" C "Command line config file"
string typestr="filename" default="BNServer.ini"
option "verbosity" v "Message verbosity"
int default="5"
Flag | Default | Type | Description |
---|---|---|---|
-d | . | Directory | Directory from which Sleipnir::CDatabase data files are read. |
-i | stdin | Context text file | Tab-delimited text file from which context indices, names, and string IDs are read. |
-c | None | Mapping text file | Tab-delimited text file from which context indices and associated gene indices are read. |
-a | None | Binary file | Binary file from which each gene's background connectivity for each context is read. See Hubber for details. |
-s | None | Mapping text file | Tab-delimited text file from which disease indices and associated gene indices are read. |
-n | . | Directory | Directory from which context-specific Bayesian classifiers ((X)DSL files) are read. |
-b | None | (X)DSL file | Bayesian classifier for the default (global) context. |
-x | on | Flag | If on, assume XDSL files will be used instead of DSL files. |
-m | off | Flag | If on, read binary stored Bayesian classifiers from a file specified by -n ; if off, assume -n specifies a directory of (X)DSL files. |
-M | None | Binary file | If given, store Bayesian classifiers in a custom binary format in the given filename. Cannot be used with -m on. Loading the classifiers from a binary file can be faster than loading several hundred separate (X)DSLs. If your classifiers aren't changing, you can load them from (X)DSL files once, leaving -m off and saving them with -M , then on subsequent runs turn -M off and load the binary file with -m on. |
-g | None | OBO text file | OBO file containing the structure of the Gene Ontology. |
-G | None | Annotation text file | Gene Ontology annotation file for the desired organism. |
-k | None | KEGG orthology text file | ko file containing the structure and annotations of the KEGG orthology. |
-K | SCE | KEGG organism code | Three letter organism code of annotations to be read from the ko file. Options include SCE for yeast, HSA for human, DME for fly, CEL for worm, and MMU for mouse. |
-p | 1234 | Integer | TCP/IP port on which the server should listen. |
-t | 100 | Integer | Millisecond timeout between server listen polls. The default should always be fine. |
-f | . | Directory | Directory into which the server will place all generated files (e.g. DOTs from graph queries). |
-z | fdp | Executable file | Path to the Graphviz executable used to render DOT files for graph queries. If fdp is in your path, the default should work fine; otherwise, you should provide an absolute path, e.g. /usr/bin/fdp . Either neato or fdp should work interchangeably. |