Sleipnir
|
BNCreator will construct a naive Bayesian classifier from data and learn its parameters or, given an existing classifier and data, evaluate the classifier to predict probabilities of functional relationship. The can be performed extremely efficiently and in a context-specific manner.
Given one or more biological datasets (stored as Sleipnir::CDat objects in DAT/DAB/etc. files) and a functional gold standard, BNCreator will construct a naive Bayesian classifier to probabilistically integrate the given data. This implies the construction of a Bayesian network with a single class node (corresponding to functional relationships drawn from the gold standard) with one child node per dataset. The values taken by these child nodes are determined from the discretization (QUANT files) of the given datasets.
BNCreator behaves similarly to BNConverter, but in a much simpler and more efficient manner. Moreover, Huttenhower et al 2006 has shown that more complex models (such as those learned by BNConverter) have essentially no benefits for integration of this type, so BNCreator is generally the way to go.
Suppose you have a directory containing data files of the form:
MICROARRAY.dab MICROARRAY.quant TF.dab TF.quant CUR_COMPLEX.dab CUR_COMPLEX.quant ... SYNL_TRAD.dab SYNL_TRAD.quant
Each data file is a Sleipnir::CDat, either a DAT or a DAB, containing experimental results. Each QUANT file describes how to discretize that data for use with the Bayesian network (the number of bins in the QUANT must be the same as the number of values taken by the corresponding node in the Bayesian network). Once we've placed all of these files in a directory (e.g. ./data/) and assembled a functional gold standard (e.g.
ANSWERS.dab
, possibly constucted by Answerer), we can learn a naive classifier as follows:
BNCreator -w ANSWERS.dab -o learned.xdsl ./data/*.dab
This produces a "learned" classifier with probabilities that model (as accurately as possible) the relationship between the given data and functional gold standard. This model can be used to predict functional relationships between new genes not in the standard (but with experimental data) by running BNCreator again in evaluation mode:
BNCreator -i learned.xdsl -o predicted_relationships.dab -d ./data/
The predicted_relationships.dab
file now containins a Sleipnir::CDat in which each pairwise score represents a probability of functional relationship, and it can be mined with tools such as Dat2Dab or Dat2Graph.
BNCreator -w <answers.dab> -o <learned.xdsl> <files.dab>*
Construct a naive classifier, learn its parameters, and store these in learned.xdsl
, based on one or more given data files (files.dab
, which must have associated QUANT files) and the functional gold standard in answers.dab
.
BNCreator -d <data_dir> -i <learned.xdsl> -o <predictions.dab>
Saves predicted probabilities of functional relationships in predictions.dab
, based on the parameters in the classifier learned.xdsl
and the data in data_dir
.
More realistically, a classifier can be learned and evaluated as:
BNCreator -w <answers.dab> -o <learned.xdsl> -b <defaults.xdsl> -Z <zeros.txt> -m <files.dab>* BNCreator -d <data_dir> -i <learned.xdsl> -o <predictions.dab> -Z <zeros.txt> -m
This reads default data values for missing gene pairs from zeros.txt
and default probability distributions for sparse datasets from defaults.xdsl
, and it memory maps data files using the -m
flag for increased efficiency.
package "BNCreator"
version "1.0"
purpose "Bayes net construction and training from data"
defgroup "Input" yes
groupoption "answers" w "Answer file"
string typestr="filename" group="Input"
groupoption "input" i "Input (X)DSL file"
string typestr="filename" group="Input"
section "Main"
option "output" o "Output DAB/DSL file"
string typestr="filename" yes
option "directory" d "Data directory"
string typestr="directory" default="."
section "Learning/Evaluation"
option "genes" g "Gene inclusion file"
string typestr="filename"
option "genex" G "Gene exclusion file"
string typestr="filename"
option "genet" c "Term inclusion file"
string typestr="filename"
option "genee" C "Edge inclusion file"
string typestr="filename"
section "Network Features"
option "default" b "Bayes net containing defaults for cases with missing data"
string typestr="filename"
option "zero" z "Zero missing values"
flag off
option "zeros" Z "Read zeroed node IDs/outputs from the given file"
string typestr="filename"
section "Optional"
option "memmap" m "Memory map input files"
flag off
option "skip" s "Columns to skip for PCL inputs"
int default="2"
option "zscore" e "Convert PCL correlations to z-scores"
flag on
option "terms" r "Term inclusion directory"
string typestr="directory"
option "group" u "Group identical inputs"
flag on
option "threads" t "Maximum number of threads to spawn"
int default="1"
option "verbosity" v "Message verbosity"
int default="5"
Flag | Default | Type | Description |
---|---|---|---|
None | None | DAT/DAB files | Datasets used when learning a Bayesian classifier. One node will be created in the learned classifier for each DAT/DAB file on the command line, with the ID of the given filename (which should be alphanumeric). Must be accompanied by appropriate QUANT files. |
-w | None | DAT/DAB file | Functional gold standard for learning. Should consist of gene pairs with scores of 0 (unrelated), 1 (related), or missing (NaN). |
-i | None | (X)DSL file | Naive classifier for evaluation. BNCreator will look for data files in the given directory with filenames corresponding to the given network's node IDs. |
-o | None | (X)DSL or DAT/DAB file | During learning, (X)DSL file into which the Bayesian classifier and learned parameters are stored. During evaluation, DAT or DAB file in which predicted probabilities of functional relationship are saved. |
-d | None | Directory | Used only during evaluation as the directory containing data files. Must be DAB, DAT, or DAS files with associated QUANT files and names corresponding to the network node IDs. For learning, files must be given directly on the command line. |
-g | None | Text gene list | If given, use only gene pairs for which both genes are in the list. For details, see Sleipnir::CDat::FilterGenes. |
-G | None | Text gene list | If given, use only gene pairs for which neither gene is in the list. For details, see Sleipnir::CDat::FilterGenes. |
-c | None | Text gene list | If given, use only gene pairs passing a "term" filter against the list. For details, see Sleipnir::CDat::FilterGenes. |
-C | None | Text gene list | If given, use only gene pairs passing an "edge" filter against the list. For details, see Sleipnir::CDat::FilterGenes. |
-b | None | (X)DSL file | If present during learning, parameters from the given (X)DSL file are used instead of learned parameters for probability tables with too few examples. For details, see Sleipnir::CBayesNetSmile::SetDefault. |
-z | off | Flag | If on, assume that all missing gene pairs in all datasets have a value of 0 (i.e. the first bin). |
-Z | None | Tab-delimited text file | If given, argument must be a tab-delimited text file containing two columns, the first node IDs (e.g. MICROARRAY or TF in the example above) and the second bin numbers (zero indexed). For each node ID present in this file, missing values will be substituted with the given bin number. For example, if a zeros file contained the line TF 2 , each missing gene pair in the TF data file (probably TF.dab ) would be assumed to have the discretized value 2 during learning/evaluation. |
-m | off | Flag | If given, memory map the input files when possible. DAT and PCL inputs cannot be memmapped. |
-s | 2 | Integer | Number of columns to skip in any PCL data files between the initial ID column and the experimental data columns. Must be the same number for all PCL files. |
-e | off | Flag | If given, convert any PCL data files to z-scores instead of z-transformed Pearson correlations. Only used if PCL data files (instead of DAT/DAB/etc.) are present. |
-t | None | Directory | If given, learn multiple context-specific Bayesian classifiers from the given data and answers, assuming that each file in the given directory is a text gene set specifying a functional context. This is much more easily done using BNWeaver. |
-u | on | Flag | If on, group identical examples into one heavily weighted example. This greatly improves efficiency, and there's essentially never a reason to deactivate it. |