Sleipnir
|
Counter performs three related tasks:
Counter provides fine-grained control over the process of counting values in data, constructing probability tables from these counts, and using these probability tables for Bayesian inference. Like BNCreator, given one or more biological datasets (stored as Sleipnir::CDat objects in DAT/DAB/etc. files) and a functional gold standard, Counter can construct naive Bayesian classifiers to probabilistically integrate the given data. However, it does this by providing the intermediate count information describing the number of times each data value is present, and these counts can be weighted (i.e. regularized, see Heckerman/Geiger/Chickering 1995 or Steck/Jaakkola 2002) before they are normalized into probability distributions. This allows datasets known to provide more diverse or novel information to be upweighted, and it provides a way of doing very rapid Bayesian inference using naive classifiers independent of the SMILE library.
To use Counter, suppose you have a directory containing data files of the form:
MICROARRAY.dab MICROARRAY.quant CUR_COMPLEX.dab CUR_COMPLEX.quant TF.dab TF.quant ... SYNL_TRAD.dab SYNL_TRAD.quant
Each data file is a Sleipnir::CDat, either a DAT or a DAB, containing experimental results. Each QUANT file describes how to discretize that data for use with the Bayesian network (the number of bins in the QUANT must be the same as the number of values taken by the corresponding node in the Bayesian network). These files should all be placed in the same directory (e.g. ./data/); in a different location, you should assemble a functional gold standard (e.g.
ANSWERS.dab
, possibly constucted by Answerer).
Counter is most useful for context-specific learning and evaluation, since it allows parallelization and storage of counts from many datasets over many biological contexts. Each context generally represents a pathway, process, or other biological area in which different datasets are expected to behave differently (e.g. a microarray dataset will be informative regarding protein translation, since ribosomes are highly transcriptionally regulated, but it will not measure post-transcriptional regulation such as phosphorylation in MAPK cascades). Each context is provided to Sleipnir as a single text file containing one gene per line, all collected in the same directory (e.g. ./contexts/):
DNA_catabolism.txt DNA_integration.txt ... mitochondrion_organization_and_biogenesis.txt mitotic_cell_cycle.txt translation.txt
To generate context-specific data counts from this information, you might create an empty ./output/ directory and run:
Counter -w ANSWERS.dab -d ./data/ -m -t 4 -o ./output/ ./contexts/*.txt
where -m
indicates that the input data files should be memory mapped (generally improving performance) and -t 4
uses four threads in parallel. This will generate one file per context in the ./output/ directory, each of the form:
DNA_catabolism 5 79000 190 MICROARRAY 1000 4900 9100 5000 930 130 7 3 15 28 18 2 0 0 CUR_COMPLEX 79000 2 190 0 ... SYNL_TRAD 79000 10 190 15 ...
Here, each file is named for a context (e.g. DNA_catabolism.txt
) and begins with a header giving its name (e.g. DNA_catabolism
) and the number of datasets it contains (e.g. 5). The values below this header give the total counts for unrelated and related pairs in the relevant subset of the answer file (e.g. 79000 and 190 in the context of DNA catabolism). The appropriate number of dataset blocks follow this, each giving the count of values found in each of that dataset's discretized bins (based on its QUANT file) for the unrelated and related pairs, respectively. To generate a global.txt
file for the global context (i.e. for the entire answer file), run Counter with no context arguments on the command line:
Counter -w ANSWERS.dab -d ./data/ -m -o .
Now, given a directory with count files for each context, you can create regularized Bayesian classifiers from them, either in human-readable (X)DSL format (for use with SMILE/GeNIe) or in a compact binary format for rapid inference. To generate (X)DSL files, create an empty ./networks/ directory and run:
Counter -k ./output/ -o ./networks/ -s datasets.txt -b ./global.txt -l
This will generate one (X)DSL file per context in the ./networks/ directory (including
global.xdsl
). To instead store these classifiers in a binary format for later Bayesian inferernce, run:
Counter -k ./output/ -o ./networks.bin -s datasets.txt -b ./global.txt
In these commands, datasets.txt
is a tab-delimited text file containing two columns, the first a one-based integer index and the second a list of each dataset's name:
1 MICROARRAY 2 CUR_COMPLEX ... 5 SYNL_TRAD
This is the same format as is used with other tools such as BNServer.
One of Counter's unique features is the ability to regularize the parameters of the Bayesian classifiers constructed from data counts. This means that each dataset's probability distributions can be weighted according to a prior trust in that dataset: probability distributions from trusted datasets will be used unchanged, but less trusted datasets will be made closer to a uniform distribution (and thus have less impact on the eventual predictions). Weights are provided as a combination of a pseudocount parameter, which determines the effective number of counts in each CPT, and a file of alphas, which give the weight (relative to the pseudocounts) to give to a uniform prior for each dataset. For example, suppose we normalize to a pseudocount total of 100. Then we might have a tab-delimited text file alphas.txt
containing:
MICROARRAY 100 CUR_COMPLEX 0 ... SYNL_TRAD 10
This means that the probabilities for the MICROARRAY
node will be equally weighted between the actual counts and a uniform prior, those for the CUR_COMPLEX
node will be drawn entirely from the data, and those for the SYNL_TRAD
node will use ~91% information from the data (100/110) and ~9% a uniform prior. To generate Bayesian classifiers reflecting these weights, run:
Counter -k ./output/ -o ./networks.bin -s datasets.txt -b ./global.txt -p 100 -a ./alphas.txt
For more information, see Sleipnir::CBayesNetMinimal::OpenCounts.
Finally, to use your learned Bayesian classifiers and genomic data to infer functional relationships, you can create an empty ./predictions/ directory and run:
Counter -n ./networks.bin -o ./predictions/ -d ./data/ -s datasets.txt -e genes.txt -m -t 4 ./contexts/*.txt
This will generate one DAB file per context (e.g. ./predictions/DNA_catabolism.dab), each containing the probability of functional relationship for each gene pair predicted from the datasets in
./data/ and the classifiers in
./networks.bin. The
genes.txt
file is of the same format as datasets.txt
and lists each gene in the genome, e.g.
1 YAL068C 2 YAL066W 3 YAL065C ...
Counter -w <answers.dab> -d <data_dir> -o <output_dir> <contexts.txt>*
For each context contexts.txt
, generate a counts file in output_dir
summarizing the number of each data value for the DAT/DAB files in data_dir
(which must have associated QUANT files) relative to the functional gold standard in answers.dab
. A global count file is generated if no contexts are provided on the command line.
Counter -k <counts_dir> -o <networks.bin> -s <datasets.txt> -b <global.txt> -p <pseudocounts> -a <alphas.txt>
Using the counts previously output to counts_dir
and the global counts file global.txt
, save a set of Bayesian classifiers in networks.bin
(one per context plus a global default classifier) each containing one node per dataset as specified in datasets.txt
. Probability distributions can be optionally regularized using the effective pseudocount number pseudocounts
and the relative weight of a uniform prior for each node given in alphas.txt
.
Counter -n <networks.bin> -o <output_dir> -d <data_dir> -s <datasets.txt> -e <genes.txt> <contexts.txt>*
Performs Bayesian inference for each classifier previously saved in networks.bin
, producing one predicted functional relationship network DAT/DAB file per contexts.txt
in the directory output_dir
, using data from data_dir
, the dataset list in datasets.txt
, and the genome list in genes.txt
. A global relationship network is produced if no contexts are provided on the command line.
package "Counter" version "1.0" purpose "Pre-Bayesian learning tool; counts distributions of values in data" defgroup "Mode" yes groupoption "answers" w "Answer file (-w triggers counts mode)" string typestr="filename" group="Mode" groupoption "counts" k "Directory containing count files (-k triggers learning mode)" string typestr="directory" group="Mode" groupoption "networks" n "Bayes nets (-n triggers inference mode)" string typestr="filename" group="Mode" section "Main" option "output" o "Output count directory, Bayes nets, or inferences" string typestr="filename or directory" yes option "countname" O "For learning stage, what should the countname be called if no contexts are used (default: global)." string typestr="filename" default="global" option "directory" d "Data directory" string typestr="directory" default="." option "datasets" s "Dataset ID text file" string typestr="filename" option "genome" e "Gene ID text file" string typestr="filename" option "contexts" X "Context ID text file" string typestr="filename" section "Learning/Evaluation" option "genes" g "Gene inclusion file" string typestr="filename" option "genex" G "Gene exclusion file" string typestr="filename" option "ubiqg" P "Ubiquitous gene file (-j and -J refer to connections to ubiq instead of all bridging pairs)" string typestr="filename" option "genet" c "Term inclusion file" string typestr="filename" option "genee" C "Edge inclusion file" string typestr="filename" option "ctxtpos" q "Use positive edges between context genes" flag on option "ctxtneg" Q "Use negative edges between context genes" flag on option "bridgepos" j "Use bridging positives between context and non-context genes" flag off option "bridgeneg" J "Use bridging negatives between context and non-context genes" flag on option "outpos" u "Use positive edges outside the context" flag off option "outneg" U "Use negative edges outside the context" flag off option "weights" W "Use weighted context file" flag off option "flipneg" F "Flip weights(one minus original) for negative standards" flag on section "Network Features" option "default" b "Count file containing defaults for cases with missing data" string typestr="filename" option "zeros" Z "Read zeroed node IDs/outputs from the given file" string typestr="filename" option "genewise" S "Evaluate networks assuming genewise contexts" flag off section "Bayesian Regularization" option "pseudocounts" p "Effective number of pseudocounts to use" float default="-1" option "alphas" a "File containing equivalent sample sizes (alphas) for each node" string typestr="filename" option "regularize" r "Automatically regularize based on similarity" flag off option "reggroups" R "Automatically regularize based on given groups" string typestr="filename" section "Optional" option "temporary" y "Directory for temporary files" string typestr="directory" default="." option "smile" l "Output SMILE (X)DSL files rather than minimal networks" flag off option "xdsl" x "Generate XDSL output rather than DSL" flag on option "memmap" m "Memory map input files" flag off option "memmapout" M "Memory map output files (only for inference mode)" flag off option "threads" t "Maximum number of threads to spawn" int default="-1" option "verbosity" v "Message verbosity" int default="5" option "logratio" L "Output log ratios (instead of posteriors)" flag off
Flag | Default | Type | Description |
---|---|---|---|
None | None | Text files | Contexts used for calculating context-specific counts or producing context-specific functional relationship predictions. Each is a text file containing one gene per line. When no contexts are provided, global (context-independent) calculations are performed. |
-w | None | DAT/DAB file | Activates count generation mode. Functional gold standard for counting. Should consist of gene pairs with scores of 0 (unrelated), 1 (related), or missing (NaN). |
-k | None | Directory | Activates Bayesian classifier generation mode. Directory containing previously calculated count files, one text file per context. Should not contain the global count file. |
-n | None | Binary file | Activates Bayesian inference (functional relationship prediction) mode. Binary file containing previously calculated Bayesian classifiers. |
-o | None | Directory or binary file | In count generation mode, directory in which data value count files are placed. In classifier generation mode, file in which binary classifiers are saved or directory in which (X)DSL files are saved. In inference mode, directory in which context-specific functional relationship DAT/DAB files are created. |
-d | . | Directory | Directory from which data DAT/DAB files (and accompanying QUANT files) are read. |
-s | None | Text file | Tab-delimited text file containing two columns, the first a one-based integer index and the second the name of each dataset to be used (excluding the DAT/DAB suffix, e.g. MICROARRAY , CUR_COMPLEX , etc.) |
-e | None | Text file | Tab-delimited text file containing two columns, the first a one-based integer index and the second the unique identifier of each gene in the genome. |
-g | None | Text gene list | If given, use only gene pairs for which both genes are in the list. For details, see Sleipnir::CDat::FilterGenes. |
-G | None | Text gene list | If given, use only gene pairs for which neither gene is in the list. For details, see Sleipnir::CDat::FilterGenes. |
-c | None | Text gene list | If given, use only gene pairs passing a "term" filter against the list. For details, see Sleipnir::CDat::FilterGenes. |
-C | None | Text gene list | If given, use only gene pairs passing an "edge" filter against the list. For details, see Sleipnir::CDat::FilterGenes. |
-b | None | Text file | Count file containing default (global) values for global inference or for fallback in contexts with too little data. |
-Z | None | Tab-delimited text file | If given, argument must be a tab-delimited text file containing two columns, the first node IDs (see BNCreator) and the second bin numbers (zero indexed). For each node ID present in this file, missing values will be substituted with the given bin number. |
-p | -1 | Integer | If not -1, the effective number of pseudocounts to use relative to the weights in the given alphas file (if any). |
-a | None | Text file | If given, tab-delimited text file containing dataset IDs and the relative weight given to a uniform prior for each dataset. |
-y | . | Directory | Directory in which temporary files are generated during inference mode. |
-l | off | Flag | If on, SMILE (X)DSL files are created in classifier generation mode rather than a single binary file. |
-x | on | Flag | If on, XDSL files are generated instead of DSL files. |
-m | off | Flag | If given, memory map the input files when possible. DAT and PCL inputs cannot be memmapped. |
-t | 1 | Integer | Number of simultaneous threads to use for individual CPT learning. Threads are per classifier node (dataset), so the number of threads actually used is the minimum of -t and the number of datasets. |