Sleipnir: Data2DB

Data2DB converts a collection of DAT/DAB files (Sleipnir::CDat) into a simple flatfile database (Sleipnir::CDatabase). DAT/DAB files organize data so that values for all gene pairs within a single dataset can be accessed efficiently; database files organize data so that values from all datasets for a single gene or gene pair can be accessed efficiently. This is critical for real-time Bayesian inference (e.g., by BNServer) and for Seek coexpression search (e.g. by SeekMiner, SeekServer).

Usage

Basic Usage

 Data2DB -n <classifier.xdsl> -i <genes.txt> -d <data_dir> -D <database_dir>

Construct a Sleipnir::CDatabase in the directory database_dir containing the data from DAT/DAB files in data_dir corresponding to nodes in the Bayesian network classifier.xdsl and organized using the gene index/name pairs in genes.txt (identical in format to Data2Sql). If many datasets are being processed or the target genome is large, blocking should be used (-b and -B).

 Data2DB -x <dataset_file_list.txt> -i <gene_map.txt> -D <database_dir>

Construct a Sleipnir::CDatabase containing the data from DAB files that are specified in the dataset_file_list.txt. The genes are indexed according to gene_map.txt. By default, there would be 1000 Sleipnir::CDatabaselet's (DB files) generated, with each containing N / 1000 genes. Users can control the number of generated DB files (and indirectly the number of genes contained in each DB) using the -f option.

Detailed Usage

package "Data2DB"
version "1.0"
purpose "Converts quantized DATs into compact database file collections"

section "Main"
option  "dataset"           x   "Input a set of dataset filenames"
                                string typestr="filename"
option  "network"           n   "Input (X)DSL Bayes net"
                                string  typestr="filename"  
option  "input"             i   "Input gene mapping"
                                string  typestr="filename"  
option  "dir_in"            d   "Data directory"
                                string  typestr="directory" default="."
option  "dir_out"           D   "Database directory"
                                string  typestr="directory" default="."

section "Database Features"
option  "files"             f   "Database file count"
                                int default="1000"
option  "block_files"       b   "Number of database files per block"
                                int default="-1"
option  "block_datasets"    B   "Number of datasets per block"
                                int default="-1"
option  "use_nibble"        N   "Use nibble for compact storage"
                                flag    off
option  "zeros"         Z   "Read zeroed node IDs/outputs from the given file"
                            string  typestr="filename"

section "Optional"
option  "buffer"            u   "Memory buffer disk writes"
                                flag    off
option  "memmap"            m   "Memory map input/output"
                                flag    off
option  "verbosity"         v   "Message verbosity"
                                int default="5"

Flag	Default	Type	Description
-x	None	Dataset file list	A simple one-column listing of path of DAB files. Dataset order in the CDatabase will correspond to the order in this file. Either this option or the `-n` option must be specified.
-n	None	(X)DSL file	Naive Bayesian classifier for which output database will be optimized. Dataset order in the output database will correspond to the Bayes net's node order, and the node IDs will be used to load input DAT/DABs from `-d`. Either this option or the -c -x option must be specified.
-i	stdin	Text file	Tab-delimited text file containing two columns, numerical gene IDs (one-based) and unique gene names (matching those in the input DAT/DAB files).
-d	.	Directory	Input directory containing DAT/DAB files with names corresponding to the given Bayes net node IDs.
-D	.	Directory	Output directory in which database files will be stored.
-f	1000	Integer	Number of separate database files to store in the output directory
-b	-1	Integer	Number of output files (and hence genes) to process per block. -1 indicates that all output files should be created in a single pass.
-B	-1	Integer	Number of input files (datasets) to process per block. -1 indicates that all input files should be read into memory simultaneously.
-u	off	Flag	If on, buffer each database file in memory during modification and write as a single unit on completion. Could in theory speed up database construction on certain disks/filesystems.
-m	off	Flag	If given, memory map the input files when possible. DAT and PCL inputs cannot be memmapped.
-N	off	Flag	If enabled, use Nibble (4 bits) to represent each element rather than the default 8 bits (or a byte).