Sleipnir: DBCombiner

Combines a set of DB files generated from different Sleipnir::CDatabase's into one DB file.

Perhaps for space reason, it is sometime not feasible to generate a Sleipnir::CDatabase covering all datasets on one machine or one partition. Consequently, people generate separate Sleipnir::CDatabase's on different machines first, and then join them into one CDatabase instance with the help of DBCombiner. DBCombiner performs the joining on a per DB-file basis, so users still need to repeat the joining for all DB files in the database.

Usage

Basic Usage

 DBCombiner -i <genes.txt> -x <db_list.txt> -d <input_dir> -D <output_dir> [-s]

Combines the DB files listed in the db_list.txt into one DB.

DBCombiner accepts DB files that are generated from different Sleipnir::CDatabase instances, as long as the same gene map was used. In order for DBCombiner to work, only DB files covering the same genes may be combined. This can be ensured by using only DB files with the same ID in the file name (see some sample lines in db_list.txt below). The final joined DB will have datasets listed in the order defined by db_list.txt.

The -s option further splits the combined Sleipnir::CDatabaselet into one gene per DB file. This -s must be enabled for Seek coexpression integrations. (SeekMiner, SeekServer).

Sample lines from the genes.txt file:

 1    1
 2    10
 3    100
 4    1000
 5    10000
 6    100008589

Sample lines from the db_list.txt file:

 /x/y/database1/00000004.db
 /x/y/database2/00000004.db
 /x/y/database3/00000004.db

Note that database1, database2, database3 are three Sleipnir::CDatabase's generated for different datasets.

Note how we use the same ID 00000004 to ensure that the DB files cover the same genes.

Detailed Usage

package "DBCombiner"
version "1.0"
purpose "Combines a list of DB files with the same gene content"

section "Mode"
option  "combine"           C   "Combine a set of DB's, each coming from a different dataset subset"
                                flag    off
option  "reorganize"        R   "Reorganize a set of DB's, such as from 21000 DB files to 1000 DB files, ie expanding/shrinking the number of genes a DB contains"
                                flag    off

section "Main"
option  "input"             i   "Input gene mapping"
                                string  typestr="filename"  yes 

section "Combine Mode"
option  "db"                x   "Input a set of databaselet filenames (including path)"
                                string typestr="filename"
option  "dir_out"           D   "Output database directory"
                                string  typestr="directory" default="."
option  "is_nibble"         N   "Whether the input DB is nibble type"
                                flag    off
option  "split"             s   "Split to one-gene per file"
                                flag    off

section "Reorganize Mode"
option  "dataset"           A   "Dataset-platform mapping file"
                                string typestr="filename"
option  "db_dir"            d   "Source DB collection directory"
                                string typestr="directory"
option  "src_db_num"        n   "Source DB number of files"
                                int
option  "dest_db_num"       b   "Destination DB number of files"
                                int
option  "dest_db_dir"       B   "Destination DB directory"
                                string typestr="directory"

Flag	Default	Type	Description
-i	None	Text file	Tab-delimited text file containing two columns, numerical gene IDs (one-based) and unique gene names (matching those in the input DAT/DAB files).
-d	None	Directory	Input directory containing `*`.db files
-D	None	Directory	Output directory in which database files will be stored.
-x	None	Text file	Input file containing a list of Sleipnir::CDatabaselet's to combine
-s	None	off	If enabled, split the combined Sleipnir::CDatabaselet to one gene per `DB` file

Usage

Basic Usage

 DBCombiner -i <genes.txt> -x <db list> -d <input directory> -D <output_dir>

Detailed Usage

package "DBCombiner"
version "1.0"
purpose "Combines a list of DB files with the same gene content"

section "Mode"
option  "combine"           C   "Combine a set of DB's, each coming from a different dataset subset"
                                flag    off
option  "reorganize"        R   "Reorganize a set of DB's, such as from 21000 DB files to 1000 DB files, ie expanding/shrinking the number of genes a DB contains"
                                flag    off

section "Main"
option  "input"             i   "Input gene mapping"
                                string  typestr="filename"  yes 

section "Combine Mode"
option  "db"                x   "Input a set of databaselet filenames (including path)"
                                string typestr="filename"
option  "dir_out"           D   "Output database directory"
                                string  typestr="directory" default="."
option  "is_nibble"         N   "Whether the input DB is nibble type"
                                flag    off
option  "split"             s   "Split to one-gene per file"
                                flag    off

section "Reorganize Mode"
option  "dataset"           A   "Dataset-platform mapping file"
                                string typestr="filename"
option  "db_dir"            d   "Source DB collection directory"
                                string typestr="directory"
option  "src_db_num"        n   "Source DB number of files"
                                int
option  "dest_db_num"       b   "Destination DB number of files"
                                int
option  "dest_db_dir"       B   "Destination DB directory"
                                string typestr="directory"

Flag	Default	Type	Description
-i	stdin	Text file	Tab-delimited text file containing two columns, numerical gene IDs (one-based) and unique gene names (matching those in the input DAT/DAB files).
-d	.	Directory	Input directory containing DB files
-D	.	Directory	Output directory in which database files will be stored.
-x	.	Text file	Input file containing list of CDatabaselets to combine