Sleipnir: BNServer

BNServer is a complex tool provided a multithreaded TCP/IP interface to real-time Bayesian data integration and inference. A running BNServer can service client requests over a network for values from specific biological datasets, predicted functional relationships, queries into a functional relationship network, and graph visualization using Graphviz.

Overview

As the name implies, BNServer is a network server which can service client requests for information using a simple, binary TCP/IP protocol based on Sleipnir::CServer. The server loads a variety of information at startup, including one or more biological datasets (stored in a Sleipnir::CDatabase rather than standard Sleipnir::CDat files) and one or more naive classifiers, and can provide various related pieces of information to a client:

Given two genes, retrieve all data values for the pair from all datasets in the database.
Given a single gene, perform inference and return probabilities of functional relationship for all gene pairs involving that gene.
Given a single gene, indicate which contexts it's most functionally related to.
Given a set of genes, perform inference for all of them and return the portion of the resulting functional relationship network most related to the query set.
Given a set of genes, perform a TermFinder functional enrichment query.

Since BNServer is a very complex program, let's first go over the pieces of data necessary to start it running. First, convert a collection of biological datasets (usually DAB/QUANT file pairs) into a Sleipnir::CDatabase directory. Let's call that directory ./db/, which will contain a bunch of files:

 00000000.db
 00000001.db
 00000002.db
 ...

You'll also need a tab-delimited text file listing each biological context of interest (usually a functional catalog term) in three columns, an integer index (one-based), a textual description, and an ID. Let's call this file contexts.txt:

 1  activation of NF-kappaB transcription factor    GO:0051092
 2  adult locomotory behavior   GO:0008344
 3  aging   GO:0007568
 ...

Next, create a tab-delimited text file with two columns, a one-based integer index and a gene ID. This file (genes.txt) will contain every gene of interest for your organism, e.g.:

 1  YPL149W
 2  YHR171W
 3  YBR217W

Now tie these two files together by creating a relational mapping between the two lists. You'll need a tab-delimited contexts_genes.txt file with two columns. The first column is a context ID, the second a gene ID, and there should be one row for each gene annotation to a context term. For example, if context #1 contains genes #3, 9, and 16, and context #2 contains genes #3 and 4, your contexts_genes.txt file might start:

If you have fully predicted, context-specific functional relationship networks produced by BNUnraveler, you can use Hubber to generate background connectivities for each gene from them. Let's call this binary file of gene "hubbiness" backgrounds.bin. If you don't have such a file, don't worry; it's optional.

If you have knowledge of disease-associated genes, you can also create a tab-delimited diseases_genes.txt file providing relational mappings between disease and gene IDs. The diseases don't need names, so there's no separate diseases file. Like the contexts/genes mapping, diseases_genes.txt should be a tab-delimited text file; unlike the other file, this one's optional. If you have it, it should look like:

Now, for each context you listed in your contexts file, you should have a context-specific Bayes net stored in an (X)DSL file. This must be a naive classifier with parameters that have already been learned; BNWeaver is ideal for this. Put all of these files in a single directory, e.g. ./contexts/, named the same as their textual description in contexts.txt with non-alphanumeric characters substituted with underscores. For example, ./contexts/ might contain:

 activation_of_NF_kappaB_transcription_factor.xdsl
 adult_locomotory_behavior.xdsl
 aging.xdsl
 ...

Note that every non-alphanumeric character must be replaced with an underscore. You should also have one more non-context-specific, global Bayesian network containing the "default" probabilities to be used outside of a specific context. This can be generated with BNWeaver or, better yet, BNCreator. Call that file default.xdsl.

We're on the home stretch here. Collect Gene Ontology structure and annotation files and the KEGG orthology ko file. Make sure Graphviz is installed and in your current path, and you can finally run:

 BNServer -d ./db/ -i contexts.txt -c contexts_genes.txt -a backgrounds.bin -s diseases_genes.txt
        -n ./contexts/ -b default.xdsl -g gene_ontology.obo -G gene_association.sgd -k ko -K SCE

This will start a server running on the default port, using the default path to Graphviz and generating output files in the current directory (all of which can be modified using other options; see below). What can you do with such a thing? BNServer will listen on the specified port and service incoming network requests. As detailed in Sleipnir::CServer, all Sleipnir network communication is prefixed by a byte count for the incoming message. BNServer accepts several different message types; each should consist of the byte count for the whole message followed by the opcode (a single byte) and appropriate one- or four-byte arguments:

Opcode	Name	Arguments	Description
0	Inference	Four-byte context ID, zero or more four-byte gene IDs	For each input gene, returns one four-byte floating point value per gene in the genome, each representing the probability of functional relationship with the input gene in the given context. Uninferrable pairs are marked with NaNs. For example, in a three-gene genome, an input request of `5 0 2` would return probabilities for the second gene (ID #2) in the default context (which has no ID, hence #0): `24 0.1 NaN 0.9`
1	Data	Two four-byte gene IDs	Returns discretized data values for the given gene pair across each dataset in the database. For example, suppose the database contained six datasets. Requesting data for genes #1 and #2, `9 1 1 2`, would return something of the form `6 0 1 0 2 3 1`, with each dataset's value encoded in a one-indexed byte (and zero representing a missing value).
2	Graph	One-byte boolean, four-byte context ID, four-byte neighbor count, zero or more four-byte gene IDs	For the given gene set, perform Bayesian inference in the given context and retrieve the requested number of neighbors most related to the query set in the resulting functional relationship graph. If the given boolean is true, the resulting graph will be saved in DOT format in the server's file directory and the filename returned over the network. If it's false, the contents of the DOT themselves will be sent back to the caller.
3	Contexts	Zero or more four-byte gene IDs	For each gene in the request, return two four-byte floating point values per context indicating the gene's in-connectivity and background-connectivity in each context. A gene's in-connectivity is its average probability of functional relationship with a gene in the context; its background-connectivity is its average probability of functional relationship with any gene. For example, for a server with three contexts, a query of `5 3 2` would retrieve gene #2's in- and background-connectivities with each context: `24 0.9 0.1 0.15 0.12 0.3 0.35`.
4	TermFinder	Four-byte ontology ID, four-byte floating point p-value, zero or more four-byte gene IDs	Return the given gene set's functional enrichments in the given ontology below the given p-value cutoff. Ontology IDs are 0 for GO BP, 1 for GO MF, 2 for GO CC, and 3 for KEGG. Results are prepended with the number of terms followed by, for each term, the null-terminated ID string, null-terminated description string, four-byte floating point p-value, four-byte integers in-term hits, term size, query size, background size, number of genes annotated to term, IDs of genes annotated to term. Thus, a query of the form `21 4 0 0.05 1 2 3` might result in two enriched terms: `149 2 GO:0006412 translation 0.01 3 773 3 7455 5 1 2 3 4 5 GO:0006081 aldehyde metabolic process 0.02 2 26 3 7455 4 1 2 8 9`
5	Diseases	Four-byte context ID, zero or more four-byte gene IDs	For each gene in the request, return two four-byte floating point values per disease indicating the gene's in-connectivity and background-connectivity to each disease in the requested context. A gene's in-connectivity is its average probability of functional relationship with a gene in the disease; its background-connectivity is its average probability of functional relationship with any gene. For example, for a server with three diseases, a query of `9 0 2` would retrieve gene #2's in- and background-connectivities with each gene in the default context, e.g. `24 0.9 0.1 0.15 0.12 0.3 0.35`.

Usage

Basic Usage

 BNServer -d <database_dir> -i <contexts.txt> -c <contexts_genes.txt> -n <contexts_dir> -b <default.xdsl>
        -g gene_ontology.obo -G <gene_association.sgd> -k ko -K <ORG>

Starts a server on the default port using data from a Sleipnir::CDatabase in database_dir, a context list from contexts.txt, context/gene mapping defitions from contexts_genes.txt, context-specific Bayesian classifiers from contexts_dir, a default global classifier default.xdsl, the given Gene Ontology structure and annotation files, and the KEGG orthology ko with the given organism code.

Detailed Usage

package "BNServer"
version "1.0"
purpose "Real time Bayes net calculation from DB data"

section "Input"
option  "database"      d   "Database directory"
                            string  typestr="directory" default="."
option  "input"         i   "Context IDs and names"
                            string  typestr="filename"
option  "contexts"      c   "Context/gene mapping"
                            string  typestr="filename"  yes
option  "diseases"      s   "Disease/gene mapping"
                            string  typestr="filename"
option  "is_nibble"     N   "Specify whether the database is nibble type"
                            flag    on

section "Bayes nets"
option  "networks"      n   "Bayes net directory"
                            string  typestr="directory" default="."
option  "default"       b   "Bayes net for no context"
                            string  typestr="filename"
option  "xdsl"          x   "Use XDSL files instead of DSL"
                            flag    on
option  "minimal_in"    m   "Read stored contexts and minimal Bayes nets"
                            flag    off
option  "minimal_out"   M   "Store contexts and minimal Bayes nets"
                            string  typestr="filename"

section "P-values"
option  "global"        P   "Parameter file for global context p-values"
                            string  typestr="filename"  yes
option  "within_c"      w   "Within sets matrix for contexts"
                            string  typestr="filename"
option  "within_d"      W   "Within sets matrix for diseases"
                            string  typestr="filename"
option  "between_cc"    e   "Between sets matrix for contexts"
                            string  typestr="filename"
option  "between_dd"    E   "Between sets matrix for diseases"
                            string  typestr="filename"
option  "between_dc"    B   "Between sets matrix for diseases to contexts"
                            string  typestr="filename"
option  "backgrounds"   a   "Background connectivities for all genes"
                            string  typestr="filename"

section "Ontologies"
option  "go_onto"       g   "GO ontology"
                            string  typestr="filename"
option  "go_anno"       G   "GO annotations"
                            string  typestr="filename"
option  "kegg"          k   "KEGG ontology"
                            string  typestr="filename"
option  "kegg_org"      K   "KEGG organism"
                            string  default="HSA"

section "Server"
option  "port"          p   "Server port"
                            int default="1234"
option  "timeout"       t   "Server timeout"
                            int default="100"

section "Precalculation"
option  "networklets"   l   "Generate mini-network icons"
                            flag    off
option  "assoc_diseases"    r   "Disease names to generate disease/process associations"
                            string  typestr="filename"
option  "assoc_context" R   "Context in which associations are computed"
                            int default="0"

section "Optional"
option  "limit"         L   "Maximum genes to process per set"
                            int default="500"
option  "files"         f   "File directory"
                            string  typestr="directory" default="."
option  "graphviz"      z   "Graphviz executable path"
                            string  typestr="filename"  default="fdp"
option  "config"        C   "Command line config file"
                            string  typestr="filename"  default="BNServer.ini"
option  "verbosity"     v   "Message verbosity"
                            int default="5"

Flag	Default	Type	Description
-d	.	Directory	Directory from which Sleipnir::CDatabase data files are read.
-i	stdin	Context text file	Tab-delimited text file from which context indices, names, and string IDs are read.
-c	None	Mapping text file	Tab-delimited text file from which context indices and associated gene indices are read.
-a	None	Binary file	Binary file from which each gene's background connectivity for each context is read. See Hubber for details.
-s	None	Mapping text file	Tab-delimited text file from which disease indices and associated gene indices are read.
-n	.	Directory	Directory from which context-specific Bayesian classifiers ((X)DSL files) are read.
-b	None	(X)DSL file	Bayesian classifier for the default (global) context.
-x	on	Flag	If on, assume XDSL files will be used instead of DSL files.
-m	off	Flag	If on, read binary stored Bayesian classifiers from a file specified by `-n`; if off, assume `-n` specifies a directory of (X)DSL files.
-M	None	Binary file	If given, store Bayesian classifiers in a custom binary format in the given filename. Cannot be used with `-m` on. Loading the classifiers from a binary file can be faster than loading several hundred separate (X)DSLs. If your classifiers aren't changing, you can load them from (X)DSL files once, leaving `-m` off and saving them with `-M`, then on subsequent runs turn `-M` off and load the binary file with `-m` on.
-g	None	OBO text file	OBO file containing the structure of the Gene Ontology.
-G	None	Annotation text file	Gene Ontology annotation file for the desired organism.
-k	None	KEGG orthology text file	`ko` file containing the structure and annotations of the KEGG orthology.
-K	SCE	KEGG organism code	Three letter organism code of annotations to be read from the `ko` file. Options include SCE for yeast, HSA for human, DME for fly, CEL for worm, and MMU for mouse.
-p	1234	Integer	TCP/IP port on which the server should listen.
-t	100	Integer	Millisecond timeout between server listen polls. The default should always be fine.
-f	.	Directory	Directory into which the server will place all generated files (e.g. DOTs from graph queries).
-z	fdp	Executable file	Path to the Graphviz executable used to render DOT files for graph queries. If `fdp` is in your path, the default should work fine; otherwise, you should provide an absolute path, e.g. `/usr/bin/fdp`. Either `neato` or `fdp` should work interchangeably.