Sleipnir: Counter

Counter performs three related tasks:

Given data in the form of DAB/QUANT file pairs and a gold standard answer file, it outputs the raw counts that would be used to generate conditional probability tables in a Bayesian classifier. This is analogous to the learning mode of BNCreator.
Given raw counts and (optionally) a prior weight for each dataset, it outputs the resulting Bayesian classifiers with properly normalized probability tables (also analogous to BNCreator learning, although much faster since it starts with pre-counted data).
Given a collection of Bayesian classifiers, it performs inference to output global or context-specific probabilities of functional relationship for all gene pairs (analogous to the evaluation mode of BNCreator).

Overview

Counter provides fine-grained control over the process of counting values in data, constructing probability tables from these counts, and using these probability tables for Bayesian inference. Like BNCreator, given one or more biological datasets (stored as Sleipnir::CDat objects in DAT/DAB/etc. files) and a functional gold standard, Counter can construct naive Bayesian classifiers to probabilistically integrate the given data. However, it does this by providing the intermediate count information describing the number of times each data value is present, and these counts can be weighted (i.e. regularized, see Heckerman/Geiger/Chickering 1995 or Steck/Jaakkola 2002) before they are normalized into probability distributions. This allows datasets known to provide more diverse or novel information to be upweighted, and it provides a way of doing very rapid Bayesian inference using naive classifiers independent of the SMILE library.

To use Counter, suppose you have a directory containing data files of the form:

 MICROARRAY.dab
 MICROARRAY.quant
 CUR_COMPLEX.dab
 CUR_COMPLEX.quant
 TF.dab
 TF.quant
 ...
 SYNL_TRAD.dab
 SYNL_TRAD.quant

Each data file is a Sleipnir::CDat, either a DAT or a DAB, containing experimental results. Each QUANT file describes how to discretize that data for use with the Bayesian network (the number of bins in the QUANT must be the same as the number of values taken by the corresponding node in the Bayesian network). These files should all be placed in the same directory (e.g. ./data/); in a different location, you should assemble a functional gold standard (e.g. ANSWERS.dab, possibly constucted by Answerer).

Counter is most useful for context-specific learning and evaluation, since it allows parallelization and storage of counts from many datasets over many biological contexts. Each context generally represents a pathway, process, or other biological area in which different datasets are expected to behave differently (e.g. a microarray dataset will be informative regarding protein translation, since ribosomes are highly transcriptionally regulated, but it will not measure post-transcriptional regulation such as phosphorylation in MAPK cascades). Each context is provided to Sleipnir as a single text file containing one gene per line, all collected in the same directory (e.g. ./contexts/):

 DNA_catabolism.txt
 DNA_integration.txt
 ...
 mitochondrion_organization_and_biogenesis.txt
 mitotic_cell_cycle.txt
 translation.txt

To generate context-specific data counts from this information, you might create an empty ./output/ directory and run:

 Counter -w ANSWERS.dab -d ./data/ -m -t 4 -o ./output/ ./contexts/*.txt

where -m indicates that the input data files should be memory mapped (generally improving performance) and -t 4 uses four threads in parallel. This will generate one file per context in the ./output/ directory, each of the form:

 DNA_catabolism 5
 79000  190
 MICROARRAY
 1000   4900    9100    5000    930 130 7
 3  15  28  18  2   0   0
 CUR_COMPLEX
 79000  2
 190    0
 ...
 SYNL_TRAD
 79000  10
 190    15
 ...

Here, each file is named for a context (e.g. DNA_catabolism.txt) and begins with a header giving its name (e.g. DNA_catabolism) and the number of datasets it contains (e.g. 5). The values below this header give the total counts for unrelated and related pairs in the relevant subset of the answer file (e.g. 79000 and 190 in the context of DNA catabolism). The appropriate number of dataset blocks follow this, each giving the count of values found in each of that dataset's discretized bins (based on its QUANT file) for the unrelated and related pairs, respectively. To generate a global.txt file for the global context (i.e. for the entire answer file), run Counter with no context arguments on the command line:

 Counter -w ANSWERS.dab -d ./data/ -m -o .

Now, given a directory with count files for each context, you can create regularized Bayesian classifiers from them, either in human-readable (X)DSL format (for use with SMILE/GeNIe) or in a compact binary format for rapid inference. To generate (X)DSL files, create an empty ./networks/ directory and run:

 Counter -k ./output/ -o ./networks/ -s datasets.txt -b ./global.txt -l

This will generate one (X)DSL file per context in the ./networks/ directory (including global.xdsl). To instead store these classifiers in a binary format for later Bayesian inferernce, run:

 Counter -k ./output/ -o ./networks.bin -s datasets.txt -b ./global.txt

In these commands, datasets.txt is a tab-delimited text file containing two columns, the first a one-based integer index and the second a list of each dataset's name:

 1  MICROARRAY
 2  CUR_COMPLEX
 ...
 5  SYNL_TRAD

This is the same format as is used with other tools such as BNServer.

One of Counter's unique features is the ability to regularize the parameters of the Bayesian classifiers constructed from data counts. This means that each dataset's probability distributions can be weighted according to a prior trust in that dataset: probability distributions from trusted datasets will be used unchanged, but less trusted datasets will be made closer to a uniform distribution (and thus have less impact on the eventual predictions). Weights are provided as a combination of a pseudocount parameter, which determines the effective number of counts in each CPT, and a file of alphas, which give the weight (relative to the pseudocounts) to give to a uniform prior for each dataset. For example, suppose we normalize to a pseudocount total of 100. Then we might have a tab-delimited text file alphas.txt containing:

 MICROARRAY 100
 CUR_COMPLEX    0
 ...
 SYNL_TRAD  10

This means that the probabilities for the MICROARRAY node will be equally weighted between the actual counts and a uniform prior, those for the CUR_COMPLEX node will be drawn entirely from the data, and those for the SYNL_TRAD node will use ~91% information from the data (100/110) and ~9% a uniform prior. To generate Bayesian classifiers reflecting these weights, run:

 Counter -k ./output/ -o ./networks.bin -s datasets.txt -b ./global.txt -p 100 -a ./alphas.txt

For more information, see Sleipnir::CBayesNetMinimal::OpenCounts.

Finally, to use your learned Bayesian classifiers and genomic data to infer functional relationships, you can create an empty ./predictions/ directory and run:

 Counter -n ./networks.bin -o ./predictions/ -d ./data/ -s datasets.txt -e genes.txt -m -t 4 ./contexts/*.txt

This will generate one DAB file per context (e.g. ./predictions/DNA_catabolism.dab), each containing the probability of functional relationship for each gene pair predicted from the datasets in ./data/ and the classifiers in ./networks.bin. The genes.txt file is of the same format as datasets.txt and lists each gene in the genome, e.g.

 1  YAL068C
 2  YAL066W
 3  YAL065C
 ...

Usage

Basic Usage

 Counter -w <answers.dab> -d <data_dir> -o <output_dir> <contexts.txt>*

For each context contexts.txt, generate a counts file in output_dir summarizing the number of each data value for the DAT/DAB files in data_dir (which must have associated QUANT files) relative to the functional gold standard in answers.dab. A global count file is generated if no contexts are provided on the command line.

 Counter -k <counts_dir> -o <networks.bin> -s <datasets.txt> -b <global.txt> -p <pseudocounts> -a <alphas.txt>

Using the counts previously output to counts_dir and the global counts file global.txt, save a set of Bayesian classifiers in networks.bin (one per context plus a global default classifier) each containing one node per dataset as specified in datasets.txt. Probability distributions can be optionally regularized using the effective pseudocount number pseudocounts and the relative weight of a uniform prior for each node given in alphas.txt.

 Counter -n <networks.bin> -o <output_dir> -d <data_dir> -s <datasets.txt> -e <genes.txt> <contexts.txt>*

Performs Bayesian inference for each classifier previously saved in networks.bin, producing one predicted functional relationship network DAT/DAB file per contexts.txt in the directory output_dir, using data from data_dir, the dataset list in datasets.txt, and the genome list in genes.txt. A global relationship network is produced if no contexts are provided on the command line.

Detailed Usage

package "Counter"
version "1.0"
purpose "Pre-Bayesian learning tool; counts distributions of values in data"

defgroup "Mode" yes
groupoption "answers"   w   "Answer file (-w triggers counts mode)"
                            string  typestr="filename"  group="Mode"
groupoption "counts"    k   "Directory containing count files (-k triggers learning mode)"
                            string  typestr="directory" group="Mode"
groupoption "networks"  n   "Bayes nets (-n triggers inference mode)"
                            string  typestr="filename"  group="Mode"

section "Main"
option  "output"        o   "Output count directory, Bayes nets, or inferences"
                            string  typestr="filename or directory" yes
option  "countname"     O   "For learning stage, what should the countname be called if no contexts are used (default: global)."
                            string  typestr="filename"  default="global"
option  "directory"     d   "Data directory"
                            string  typestr="directory" default="."
option  "datasets"      s   "Dataset ID text file"
                            string  typestr="filename"
option  "genome"        e   "Gene ID text file"
                            string  typestr="filename"
option  "contexts"      X   "Context ID text file"
                            string  typestr="filename"

section "Learning/Evaluation"
option  "genes"         g   "Gene inclusion file"
                            string  typestr="filename"
option  "genex"         G   "Gene exclusion file"
                            string  typestr="filename"
option  "ubiqg"                 P       "Ubiquitous gene file (-j and -J refer to connections to ubiq instead of all bridging pairs)"
                                                        string  typestr="filename"
option  "genet"         c   "Term inclusion file"
                            string  typestr="filename"
option  "genee"         C   "Edge inclusion file"
                            string  typestr="filename"
option  "ctxtpos"               q       "Use positive edges between context genes"
                                                        flag    on
option  "ctxtneg"               Q       "Use negative edges between context genes"
                                                        flag    on
option  "bridgepos"             j       "Use bridging positives between context and non-context genes"
                                                        flag    off
option  "bridgeneg"             J       "Use bridging negatives between context and non-context genes"
                                                        flag    on
option  "outpos"                u       "Use positive edges outside the context"
                                                        flag    off
option  "outneg"                U       "Use negative edges outside the context"
                                                        flag    off
option  "weights"           W       "Use weighted context file"
                                                        flag    off
option  "flipneg"           F       "Flip weights(one minus original) for negative standards"
                                                        flag    on

section "Network Features"
option  "default"       b   "Count file containing defaults for cases with missing data"
                            string  typestr="filename"
option  "zeros"         Z   "Read zeroed node IDs/outputs from the given file"
                            string  typestr="filename"
option  "genewise"      S   "Evaluate networks assuming genewise contexts"
                            flag    off

section "Bayesian Regularization"
option  "pseudocounts"          p   "Effective number of pseudocounts to use"
                            float default="-1"
option  "alphas"        a   "File containing equivalent sample sizes (alphas) for each node"
                            string  typestr="filename"
option  "regularize"            r   "Automatically regularize based on similarity"
                            flag    off
option  "reggroups"     R   "Automatically regularize based on given groups"
                            string  typestr="filename"

section "Optional"
option  "temporary"     y   "Directory for temporary files"
                            string  typestr="directory" default="."
option  "smile"         l   "Output SMILE (X)DSL files rather than minimal networks"
                            flag    off
option  "xdsl"          x   "Generate XDSL output rather than DSL"
                            flag    on
option  "memmap"        m   "Memory map input files"
                            flag    off
option  "memmapout"             M       "Memory map output files (only for inference mode)"
                                                        flag  off
option  "threads"       t   "Maximum number of threads to spawn"
                            int default="-1"
option  "verbosity"     v   "Message verbosity"
                            int default="5"
option  "logratio"      L   "Output log ratios (instead of posteriors)"
                            flag    off

Flag	Default	Type	Description
None	None	Text files	Contexts used for calculating context-specific counts or producing context-specific functional relationship predictions. Each is a text file containing one gene per line. When no contexts are provided, global (context-independent) calculations are performed.
-w	None	DAT/DAB file	Activates count generation mode. Functional gold standard for counting. Should consist of gene pairs with scores of 0 (unrelated), 1 (related), or missing (NaN).
-k	None	Directory	Activates Bayesian classifier generation mode. Directory containing previously calculated count files, one text file per context. Should not contain the global count file.
-n	None	Binary file	Activates Bayesian inference (functional relationship prediction) mode. Binary file containing previously calculated Bayesian classifiers.
-o	None	Directory or binary file	In count generation mode, directory in which data value count files are placed. In classifier generation mode, file in which binary classifiers are saved or directory in which (X)DSL files are saved. In inference mode, directory in which context-specific functional relationship DAT/DAB files are created.
-d	.	Directory	Directory from which data DAT/DAB files (and accompanying QUANT files) are read.
-s	None	Text file	Tab-delimited text file containing two columns, the first a one-based integer index and the second the name of each dataset to be used (excluding the DAT/DAB suffix, e.g. `MICROARRAY`, `CUR_COMPLEX`, etc.)
-e	None	Text file	Tab-delimited text file containing two columns, the first a one-based integer index and the second the unique identifier of each gene in the genome.
-g	None	Text gene list	If given, use only gene pairs for which both genes are in the list. For details, see Sleipnir::CDat::FilterGenes.
-G	None	Text gene list	If given, use only gene pairs for which neither gene is in the list. For details, see Sleipnir::CDat::FilterGenes.
-c	None	Text gene list	If given, use only gene pairs passing a "term" filter against the list. For details, see Sleipnir::CDat::FilterGenes.
-C	None	Text gene list	If given, use only gene pairs passing an "edge" filter against the list. For details, see Sleipnir::CDat::FilterGenes.
-b	None	Text file	Count file containing default (global) values for global inference or for fallback in contexts with too little data.
-Z	None	Tab-delimited text file	If given, argument must be a tab-delimited text file containing two columns, the first node IDs (see BNCreator) and the second bin numbers (zero indexed). For each node ID present in this file, missing values will be substituted with the given bin number.
-p	-1	Integer	If not -1, the effective number of pseudocounts to use relative to the weights in the given alphas file (if any).
-a	None	Text file	If given, tab-delimited text file containing dataset IDs and the relative weight given to a uniform prior for each dataset.
-y	.	Directory	Directory in which temporary files are generated during inference mode.
-l	off	Flag	If on, SMILE (X)DSL files are created in classifier generation mode rather than a single binary file.
-x	on	Flag	If on, XDSL files are generated instead of DSL files.
-m	off	Flag	If given, memory map the input files when possible. DAT and PCL inputs cannot be memmapped.
-t	1	Integer	Number of simultaneous threads to use for individual CPT learning. Threads are per classifier node (dataset), so the number of threads actually used is the minimum of `-t` and the number of datasets.