Sleipnir: BNCreator

BNCreator will construct a naive Bayesian classifier from data and learn its parameters or, given an existing classifier and data, evaluate the classifier to predict probabilities of functional relationship. The can be performed extremely efficiently and in a context-specific manner.

Overview

Given one or more biological datasets (stored as Sleipnir::CDat objects in DAT/DAB/etc. files) and a functional gold standard, BNCreator will construct a naive Bayesian classifier to probabilistically integrate the given data. This implies the construction of a Bayesian network with a single class node (corresponding to functional relationships drawn from the gold standard) with one child node per dataset. The values taken by these child nodes are determined from the discretization (QUANT files) of the given datasets.

BNCreator behaves similarly to BNConverter, but in a much simpler and more efficient manner. Moreover, Huttenhower et al 2006 has shown that more complex models (such as those learned by BNConverter) have essentially no benefits for integration of this type, so BNCreator is generally the way to go.

Suppose you have a directory containing data files of the form:

 MICROARRAY.dab
 MICROARRAY.quant
 TF.dab
 TF.quant
 CUR_COMPLEX.dab
 CUR_COMPLEX.quant
 ...
 SYNL_TRAD.dab
 SYNL_TRAD.quant

Each data file is a Sleipnir::CDat, either a DAT or a DAB, containing experimental results. Each QUANT file describes how to discretize that data for use with the Bayesian network (the number of bins in the QUANT must be the same as the number of values taken by the corresponding node in the Bayesian network). Once we've placed all of these files in a directory (e.g. ./data/) and assembled a functional gold standard (e.g. ANSWERS.dab, possibly constucted by Answerer), we can learn a naive classifier as follows:

 BNCreator -w ANSWERS.dab -o learned.xdsl ./data/*.dab

This produces a "learned" classifier with probabilities that model (as accurately as possible) the relationship between the given data and functional gold standard. This model can be used to predict functional relationships between new genes not in the standard (but with experimental data) by running BNCreator again in evaluation mode:

 BNCreator -i learned.xdsl -o predicted_relationships.dab -d ./data/

The predicted_relationships.dab file now containins a Sleipnir::CDat in which each pairwise score represents a probability of functional relationship, and it can be mined with tools such as Dat2Dab or Dat2Graph.

Usage

Basic Usage

 BNCreator -w <answers.dab> -o <learned.xdsl> <files.dab>*

Construct a naive classifier, learn its parameters, and store these in learned.xdsl, based on one or more given data files (files.dab, which must have associated QUANT files) and the functional gold standard in answers.dab.

 BNCreator -d <data_dir> -i <learned.xdsl> -o <predictions.dab>

Saves predicted probabilities of functional relationships in predictions.dab, based on the parameters in the classifier learned.xdsl and the data in data_dir.

More realistically, a classifier can be learned and evaluated as:

 BNCreator -w <answers.dab> -o <learned.xdsl> -b <defaults.xdsl> -Z <zeros.txt> -m <files.dab>*
 BNCreator -d <data_dir> -i <learned.xdsl> -o <predictions.dab> -Z <zeros.txt> -m

This reads default data values for missing gene pairs from zeros.txt and default probability distributions for sparse datasets from defaults.xdsl, and it memory maps data files using the -m flag for increased efficiency.

Detailed Usage

package "BNCreator"
version "1.0"
purpose "Bayes net construction and training from data"

defgroup "Input" yes
groupoption "answers"   w   "Answer file"
                            string  typestr="filename"  group="Input"
groupoption "input"     i   "Input (X)DSL file"
                            string  typestr="filename"  group="Input"

section "Main"
option  "output"        o   "Output DAB/DSL file"
                            string  typestr="filename"  yes
option  "directory"     d   "Data directory"
                            string  typestr="directory" default="."

section "Learning/Evaluation"
option  "genes"         g   "Gene inclusion file"
                            string  typestr="filename"
option  "genex"         G   "Gene exclusion file"
                            string  typestr="filename"
option  "genet"         c   "Term inclusion file"
                            string  typestr="filename"
option  "genee"         C   "Edge inclusion file"
                            string  typestr="filename"

section "Network Features"
option  "default"       b   "Bayes net containing defaults for cases with missing data"
                            string  typestr="filename"
option  "zero"          z   "Zero missing values"
                            flag    off
option  "zeros"         Z   "Read zeroed node IDs/outputs from the given file"
                            string  typestr="filename"

section "Optional"
option  "memmap"        m   "Memory map input files"
                            flag    off
option  "skip"          s   "Columns to skip for PCL inputs"
                            int default="2"
option  "zscore"        e   "Convert PCL correlations to z-scores"
                            flag    on
option  "terms"         r   "Term inclusion directory"
                            string  typestr="directory"
option  "group"         u   "Group identical inputs"
                            flag    on
option  "threads"       t   "Maximum number of threads to spawn"
                            int default="1"
option  "verbosity"     v   "Message verbosity"
                            int default="5"

Flag	Default	Type	Description
None	None	DAT/DAB files	Datasets used when learning a Bayesian classifier. One node will be created in the learned classifier for each DAT/DAB file on the command line, with the ID of the given filename (which should be alphanumeric). Must be accompanied by appropriate QUANT files.
-w	None	DAT/DAB file	Functional gold standard for learning. Should consist of gene pairs with scores of 0 (unrelated), 1 (related), or missing (NaN).
-i	None	(X)DSL file	Naive classifier for evaluation. BNCreator will look for data files in the given directory with filenames corresponding to the given network's node IDs.
-o	None	(X)DSL or DAT/DAB file	During learning, (X)DSL file into which the Bayesian classifier and learned parameters are stored. During evaluation, DAT or DAB file in which predicted probabilities of functional relationship are saved.
-d	None	Directory	Used only during evaluation as the directory containing data files. Must be DAB, DAT, or DAS files with associated QUANT files and names corresponding to the network node IDs. For learning, files must be given directly on the command line.
-g	None	Text gene list	If given, use only gene pairs for which both genes are in the list. For details, see Sleipnir::CDat::FilterGenes.
-G	None	Text gene list	If given, use only gene pairs for which neither gene is in the list. For details, see Sleipnir::CDat::FilterGenes.
-c	None	Text gene list	If given, use only gene pairs passing a "term" filter against the list. For details, see Sleipnir::CDat::FilterGenes.
-C	None	Text gene list	If given, use only gene pairs passing an "edge" filter against the list. For details, see Sleipnir::CDat::FilterGenes.
-b	None	(X)DSL file	If present during learning, parameters from the given (X)DSL file are used instead of learned parameters for probability tables with too few examples. For details, see Sleipnir::CBayesNetSmile::SetDefault.
-z	off	Flag	If on, assume that all missing gene pairs in all datasets have a value of 0 (i.e. the first bin).
-Z	None	Tab-delimited text file	If given, argument must be a tab-delimited text file containing two columns, the first node IDs (e.g. `MICROARRAY` or `TF` in the example above) and the second bin numbers (zero indexed). For each node ID present in this file, missing values will be substituted with the given bin number. For example, if a zeros file contained the line `TF 2`, each missing gene pair in the `TF` data file (probably `TF.dab`) would be assumed to have the discretized value 2 during learning/evaluation.
-m	off	Flag	If given, memory map the input files when possible. DAT and PCL inputs cannot be memmapped.
-s	2	Integer	Number of columns to skip in any PCL data files between the initial ID column and the experimental data columns. Must be the same number for all PCL files.
-e	off	Flag	If given, convert any PCL data files to z-scores instead of z-transformed Pearson correlations. Only used if PCL data files (instead of DAT/DAB/etc.) are present.
-t	None	Directory	If given, learn multiple context-specific Bayesian classifiers from the given data and answers, assuming that each file in the given directory is a text gene set specifying a functional context. This is much more easily done using BNWeaver.
-u	on	Flag	If on, group identical examples into one heavily weighted example. This greatly improves efficiency, and there's essentially never a reason to deactivate it.