Sleipnir
|
BNWeaver is a multithreaded tool for learning context-specific naive Bayesian classifiers from datasets (DAT/DAB files). It is generally paired with BNUnraveler to learn classifiers from data and then infer context-specific functional relationships.
BNWeaver -w <answers.dab> -o <contexts_dir> -d <data_dir> [-b <global.xdsl>] [-t <threads>] <contexts.txt>*
Using the gold standard answers in answers.dab
and all DAT/DAB files in data_dir
, create one context-specific Bayesian classifier in contexts_dir
for each biological context contexts.txt
; optionally, use probability tables from global.xdsl
as fallbacks when insufficient data is available for context-specific learning, and use threads
parallel threads.
package "BNWeaver"
version "1.0"
purpose "Bayes net construction and training from data"
section "Main"
option "answers" w "Answer file"
string typestr="filename" yes
option "output" o "Output directory"
string typestr="directory" default="."
option "directory" d "Data directory"
string typestr="directory" default="."
section "Learning/Evaluation"
option "genex" G "Gene exclusion file"
string typestr="filename"
option "negatives" n "Gene set for negative pairs"
string typestr="filename"
option "randomize" a "Randomize data before training"
flag off
section "Network Features"
option "default" b "Bayes net containing defaults for cases with missing data"
string typestr="filename"
option "zero" z "Zero missing values"
flag off
option "zeros" Z "Read zeroed node IDs/outputs from the given file"
string typestr="filename"
section "Optional"
option "memmap" m "Memory map input files"
flag off
option "threads" t "Maximum number of threads to spawn"
int default="-1"
option "xdsl" x "Generate XDSL output rather than DSL"
flag on
option "group" u "Group identical inputs"
flag on
option "random" r "Seed random generator"
int default="0"
option "verbosity" v "Message verbosity"
int default="5"
Flag | Default | Type | Description |
---|---|---|---|
None | None | Gene text files | Gene sets representing biological contexts (sets of related genes) for which Bayesian classifiers will be learned. |
-w | None | DAT/DAB file | Functional gold standard for learning. Should consist of gene pairs with scores of 0 (unrelated), 1 (related), or missing (NaN). |
-o | . | Directory | Directory into which learned naive Bayesian classifiers ((X)DSL files) are placed. |
-d | . | Directory | Directory from which data files are read. Must be DAT/DAB files with names from which the node IDs of the Bayesian classifiers can be created. |
-G | None | Text gene list | If given, use only gene pairs for which neither gene is in the list. For details, see Sleipnir::CDat::FilterGenes. |
-n | None | Text gene list | If given, use only gene pairs including at least one gene from the given set For details, see Sleipnir::CDat::FilterGenes. |
-a | off | Flag | If on, randomly shuffle all data values (by gene pair) before learning. |
-b | None | (X)DSL file | If present during learning, parameters from the given (X)DSL file are used instead of learned parameters for probability tables with too few examples. For details, see Sleipnir::CBayesNetSmile::SetDefault. |
-z | off | Flag | If on, assume that all missing gene pairs in all datasets have a value of 0 (i.e. the first bin). |
-Z | None | Tab-delimited text file | If given, argument must be a tab-delimited text file containing two columns, the first node IDs (see BNCreator) and the second bin numbers (zero indexed). For each node ID present in this file, missing values will be substituted with the given bin number. |
-m | off | Flag | If given, memory map the input files when possible. DAT and PCL inputs cannot be memmapped. |
-t | 1 | Integer | Number of simultaneous threads to use for individual CPT learning. Threads are per classifier node (dataset), so the number of threads actually used is the minimum of -t and the number of datasets. |
-x | on | Flag | If on, assume XDSL files will be used instead of DSL files. |
-u | on | Flag | If on, group identical examples into one heavily weighted example. This greatly improves efficiency, and there's essentially never a reason to deactivate it. |