Sleipnir: BNTester

BNTester evaluates arbitrarily structured Bayesian networks to produce predicted probabilities of functional relationship; these networks are often learned with BNConverter.

Overview

Although BNConverter can perform some limited evaluation on the Bayesian networks it learns, BNTester is the primary tool for performing Bayesian inference on arbitrary network structures. Like BNConverter, BNTester can handle networks with complex structure, unobserved (hidden) nodes, or continuous values.

Bayesian integration generally entails assigning one Bayesian network node to each available biological dataset. Groups of related datasets (e.g. all physical binding datasets) can be collected under a single unobserved "parent" node, and the network is capped by a single Functional Relationship (FR) node representing whether a particular observation (e.g. gene pair) is functionally related. This process is detailed in Troyanskaya et al 2003.

Given such a network (usually as a SMILE DSL or XDSL file) with reasonable parameter values and a collection of discretized biological datasets (usually Sleipnir::CDat s stored as DAT or DAB files with associated QUANT files), BNTester will infer probabilities of functional relationship for (optionally) all gene pairs in the genome. This can be used to evaluate performance for known related/unrelated gene pairs or to predict function and/or functional relationships for uncharacterized genes/gene pairs.

For example, consider the Bayesian network used in Myers et al 2005, which we'll assume we've saved as biopixie_learned.xdsl with reasonable parameter values (either learned or assigned manually):

The name of each node is displayed, and SMILE associated an ID with each name; this might be FR for FunctionalRelationship, COREGULATION for Coregulation, MICROARRAY for Microarray Correlation, and so forth. Each leaf node corresponds to a single dataset, and each non-leaf node is an unobserved (hidden) and has no associated dataset. To evaluate this network, we should assemble a directory of data files:

 MICROARRAY.dab
 MICROARRAY.quant
 TF.dab
 TF.quant
 CUR_COMPLEX.dab
 CUR_COMPLEX.quant
 ...
 SYNL_TRAD.dab
 SYNL_TRAD.quant

Each data file is a Sleipnir::CDat, either a DAT or a DAB, containing experimental results. Each QUANT file describes how to discretize that data for use with the Bayesian network (the number of bins in the QUANT must be the same as the number of values taken by the corresponding node in the Bayesian network). Once we've placed all of these files in a directory (e.g. ./data/), we can perform Bayesian inference to evaluate the network:

 BNTester -d ./data/ -i biopixie_learned.xdsl -o predicted_relationships.dab

The predicted_relationships.dab file now containins a Sleipnir::CDat in which each pairwise score represents a probability of functional relationship, and it can be mined with tools such as Dat2Dab or Dat2Graph.

Usage

Basic Usage

 BNConverter -d <data_dir> -i <network.xdsl> -o <learned.xdsl> -w <answers.dab> -t <frac>
        -e <test_predictions.dab> -E <train_predictions.dab>

Saves learned parameters for the network network.xdsl in the new network learned.xdsl, based on the data in data_dir (containing files with names corresponding to the network node IDs) and the functional gold standard in answers.dab. Hold frac fraction of the gene pairs out of training; store predicted probabilities of functional relationship for these pairs in test_predictions.dab and the remaining inferred probabilities in train_predictions.dab.

Detailed Usage

package "BNConverter"
version "1.0"
purpose "Bayes net training and testing"

defgroup "Data" yes
groupoption "datadir"   d   "Data directory"
                            string  typestr="directory" group="Data"
groupoption "dataset"   D   "Dataset DAD file"
                            string  typestr="filename"  group="Data"

section "Main"
option  "input"         i   "Input (X)DSL file"
                            string  typestr="filename"  yes
option  "output"        o   "Output (X)DSL or DAT/DAB file"
                            string  typestr="filename"  yes
option  "answers"       w   "Answer DAT/DAB file"
                            string  typestr="filename"

section "Learning/Evaluation"
option  "genes"         g   "Gene inclusion file"
                            string  typestr="filename"
option  "genex"         G   "Gene exclusion file"
                            string  typestr="filename"
option  "genet"         c   "Term inclusion file"
                            string  typestr="filename"
option  "randomize"     a   "Randomize CPTs before training"
                            flag    off
option  "murder"        m   "Kill the specified CPT before evaluation"
                            int
option  "test"          t   "Test fraction"
                            double  default="0"
option  "eval_train"    E   "Training evaluation results"
                            string  typestr="filename"
option  "eval_test"     e   "Test evaluation results"
                            string  typestr="filename"

section "Network Features"
option  "default"       b   "Bayes net containing defaults for cases with missing data"
                            string  typestr="filename"
option  "zero"          z   "Zero missing values"
                            flag    off
option  "elr"           l   "Use ELR algorithm for learning"
                            flag    off
option  "pnl"           p   "Use PNL library"
                            flag    off
option  "function"      f   "Use function-fitting networks"
                            flag    off

section "Optional"
option  "group"         u   "Group identical inputs"
                            flag    on
option  "iterations"    s   "EM iterations"
                            int default="20"
option  "checkpoint"    k   "Checkpoint outputs after each iteration"
                            flag    off
option  "random"        r   "Seed random generator"
                            int default="0"
option  "verbosity"     v   "Message verbosity"
                            int default="5"

Flag	Default	Type	Description
-d	None	Directory	Directory containing data files. Must be DAB, DAT, DAS, or PCL files with associated QUANT files (unless a continuous network is being learned) and names corresponding to the network node IDs.
-D	None	DAD file	DAD file containing data and/or answers for Bayesian learning or evaluation. Generally constructed using Dab2Dad.
-i	None	(X)DSL file	File from which Bayesian network structure and/or parameters are determined. During learning, only the structure is used; during evaluation, both structure and parameters are used.
-o	None	(X)DSL or DAT/DAB file	During learning, (X)DSL file into which a copy of the Bayesian network with learned parameters is stored. During evaluation, DAT or DAB file in which predicted probabilities of functional relationship are saved.
-w	None	DAT/DAB file	Functional gold standard for learning. Should consist of gene pairs with scores of 0 (unrelated), 1 (related), or missing (NaN).
-b	None	(X)DSL file	If present during learning, parameters from the given (X)DSL file are used instead of learned parameters for probability tables with too few examples. For details, see Sleipnir::CBayesNetSmile::SetDefault.
-g	None	Text gene list	If given, use only gene pairs for which both genes are in the list. For details, see Sleipnir::CDat::FilterGenes.
-G	None	Text gene list	If given, use only gene pairs for which neither gene is in the list. For details, see Sleipnir::CDat::FilterGenes.
-c	None	Text gene list	If given, use only gene pairs passing a "term" filter against the list. For details, see Sleipnir::CDat::FilterGenes.
-a	off	Flag	If on, randomize all parameters before learning or evaluation.
-m	None	Integer	If given, randomize the parameters of the network node at the given index.
-t	0	Double	Fraction of available gene pairs to randomly withhold from training and use for evaluation.
-E	None	DAT/DAB file	If given, save predicted probabilities of functional relationship for the training gene pairs in the requested file.
-e	None	DAT/DAB file	If given, save predicted probabilities of functional relationship for the test gene pairs in the requested file.
-z	off	Flag	If on, assume that all missing gene pairs in all datasets have a value of 0 (i.e. the first bin).
-l	off	Flag	If on, use the Extended Logistic Regression (ELR) algorithm for learning (due to Greiner and Zhou 2005) in place of EM. This will learn a discriminative model, whereas EM will learn a generative one.
-p	off	Flag	If on, use Intel's PNL library for Bayesian network manipulation rather than SMILE. Note that Sleipnir must be compiled with PNL support for this to function correctly!
-f	off	Flag	If on, assume the given (X)DSL file represents a custom function-fitting Bayesian network. For details, see Sleipnir::CBayesNetFN.
-u	on	Flag	If on, group identical examples into one heavily weighted example. This greatly improves efficiency, and there's essentially never a reason to deactivate it.