Sleipnir
|
Encapsulates a simple indexless database allowing rapid per-gene access to values from many datasets. More...
#include <database.h>
Public Member Functions | |
bool | Open (const std::vector< std::string > &vecstrGenes, const std::string &strInputDirectory, const IBayesNet *pBayesNet, const std::string &strOutputDirectory, size_t iFiles, const map< string, size_t > &mapstriZeros) |
Construct a new database over the given genes from the given datasets and Bayes net. | |
bool | Open (const std::vector< std::string > &vecstrGenes, const std::vector< std::string > &vecstrDatasets, const std::string &strInputDirectory, const std::string &strOutputDirectory, size_t iFiles, const map< string, size_t > &mapstriZeros) |
CDatabase (bool isNibble) | |
bool | Reorganize (const char *, const size_t &) |
bool | GetGene (const string &, vector< unsigned char > &) const |
bool | GetGene (const size_t &, vector< unsigned char > &) const |
bool | Open (const std::string &strInputDirectory) |
Open an existing database from subfiles in the given directory. | |
bool | Get (size_t iOne, size_t iTwo, std::vector< unsigned char > &vecbData) const |
Retrieve data values from all datasets for a given gene pair. | |
bool | Get (size_t iGene, std::vector< unsigned char > &vecbData, bool fReplace=false) const |
Retrieve data values from all gene pairs over all datasets for a given gene. | |
bool | Get (size_t iGene, const std::vector< size_t > &veciGenes, std::vector< unsigned char > &vecbData, bool fReplace=false) const |
Retrieve data values from the indicated gene pairs over all datasets for a given gene. | |
size_t | GetGenes () const |
Returns the total number of genes in the database. | |
size_t | GetGene (const std::string &strGene) const |
Return the index of the given gene name, or -1 if it is not included in the database. | |
const std::string & | GetGene (size_t iGene) const |
Returns the gene name at the given index. | |
size_t | GetDatasets () const |
Return the number of datasets stored in the database. | |
bool | Open (const string &, const vector< string > &, const size_t &, const size_t &) |
bool | Open (const char *, const vector< string > &, const size_t &, const size_t &) |
void | SetMemmap (bool fMemmap) |
Set memory mapping behavior when opening DAB files. | |
void | SetBlockOut (size_t iSize) |
Set output block size (number of subfiles to create at once). | |
void | SetBlockIn (size_t iSize) |
Set input block size (number of datasets to load at once). | |
void | SetBuffer (bool fBuffer) |
Set buffering behavior when creating new database subfiles. |
Encapsulates a simple indexless database allowing rapid per-gene access to values from many datasets.
CDatabase stores essentially the same data as IDataset; that is, a list of genes and, for each gene pair, zero or more discrete values drawn from many datasets. IDataset exposes this information so that values for many genes can be drawn rapidly from a single dataset. CDatabase exposes the same information so that values for one gene can be drawn rapidly from many datasets. A database can be constructed from a Bayes net and its accompanying data files (DABs/QUANTs), although this amounts to something like a very large matrix transposition, so care must be taken with blocking for large sets of data. CDatabase consumes very little memory, as data is organized on disk and read efficiently on an as-needed basis.
4 byte unsigned int, size of header (non-data) 4 byte unsigned int G, total number of genes 4 byte unsigned int N, number of datasets 4 byte unsigned int X = ~G/N, number of genes in subfile X null-terminated ASCII strings storing gene names X rows for genes g of the form: 4-bit data values for pairs g,0 through g,G in dataset 0 4-bit data values for pairs g,0 through g,G in dataset 1 ... 4-bit data values for pairs g,0 through g,G in dataset N
Definition at line 73 of file database.h.
bool Sleipnir::CDatabase::Get | ( | size_t | iOne, |
size_t | iTwo, | ||
std::vector< unsigned char > & | vecbData | ||
) | const [inline] |
Retrieve data values from all datasets for a given gene pair.
iOne | First gene index. |
iTwo | Second gene index. |
vecbData | Output vector containing the retrieved values, two per byte. |
Definition at line 179 of file database.h.
bool Sleipnir::CDatabase::Get | ( | size_t | iGene, |
std::vector< unsigned char > & | vecbData, | ||
bool | fReplace = false |
||
) | const [inline] |
Retrieve data values from all gene pairs over all datasets for a given gene.
iGene | Gene index. |
vecbData | Output vector containing the retrieved values, two per byte. |
fReplace | If true, replace values in output vector rather than appending. |
Definition at line 206 of file database.h.
bool Sleipnir::CDatabase::Get | ( | size_t | iGene, |
const std::vector< size_t > & | veciGenes, | ||
std::vector< unsigned char > & | vecbData, | ||
bool | fReplace = false |
||
) | const [inline] |
Retrieve data values from the indicated gene pairs over all datasets for a given gene.
iGene | First gene index. |
veciGenes | Second gene indices. |
vecbData | Output vector containing the retrieved values, two per byte. |
fReplace | If true, replace values in output vector rather than appending. |
Definition at line 236 of file database.h.
size_t Sleipnir::CDatabase::GetDatasets | ( | ) | const [inline] |
Return the number of datasets stored in the database.
Definition at line 301 of file database.h.
size_t Sleipnir::CDatabase::GetGene | ( | const std::string & | strGene | ) | const [inline] |
Return the index of the given gene name, or -1 if it is not included in the database.
strGene | Gene name to retrieve. |
Reimplemented from Sleipnir::CDatabaseImpl.
Definition at line 263 of file database.h.
const std::string& Sleipnir::CDatabase::GetGene | ( | size_t | iGene | ) | const [inline] |
Returns the gene name at the given index.
iGene | Index of gene name to return. |
Definition at line 284 of file database.h.
size_t Sleipnir::CDatabase::GetGenes | ( | ) | const [inline] |
Returns the total number of genes in the database.
Definition at line 249 of file database.h.
bool Sleipnir::CDatabase::Open | ( | const std::vector< std::string > & | vecstrGenes, |
const std::string & | strInputDirectory, | ||
const IBayesNet * | pBayesNet, | ||
const std::string & | strOutputDirectory, | ||
size_t | iFiles, | ||
const map< string, size_t > & | mapstriZeros | ||
) |
Construct a new database over the given genes from the given datasets and Bayes net.
vecstrGenes | Gene names (and size) for the new database. |
strInputDirectory | Directory containing DAB datasets to be collected into the new database. |
pBayesNet | Bayes net with which the resulting database will be used (indicates the order in which the datasets are stored in the database). |
strOutputDirectory | Directory in which the new database files are generated. |
iFiles | Number of files across which the database should be constructed. |
Constructs a new database over the indicated G genes by creating the requested number of new files in the output directory and, for each gene g, collecting its data in row major order: values for pairs g,0 through g,G in dataset 0, then dataset 1, and so forth. Datasets are loaded from the indicated directory and ordered as specified in the given Bayes net; this allows data to be loaded very rapidly from the resulting database and immediately inserted into the Bayes net (or one with identical structure) for inference.
Note that for large collections of data, this can be an extremely time- and memory-intensive process. This is due to the fact that, to construct a database row for some gene, every single dataset must be inspected to collect (and discretize) all values for all pairs including that gene. For organisms with large genomes, it's impossible to load more than a few datasets into memory simultaneously - but reloading every dataset for every gene wastes far too much time loading and reloading data from DAB files, even if they're memory mapped.
The solution is to use blocking: for gene block size b and dataset block size B, load the first B datasets. Then copy out values for the first b genes. Then load the next B datasets, copy b genes' values, and so forth. Then repeat the whole process for the next b genes, and so on until the entire given gene set has been covered. Correctly balancing the input block size (number of datasets to load) and output block size (number of database subfiles, and thus genes, to produce at once) can reduce the generation time of large databases from months to hours.
bool Sleipnir::CDatabase::Open | ( | const std::string & | strInputDirectory | ) |
Open an existing database from subfiles in the given directory.
strInputDirectory | Directory from which the database files are read. |
Definition at line 964 of file database.cpp.
References Sleipnir::CMeta::Basename(), Sleipnir::CMeta::Deextension(), and Sleipnir::CMeta::IsExtension().
void Sleipnir::CDatabase::SetBlockIn | ( | size_t | iSize | ) | [inline] |
Set input block size (number of datasets to load at once).
iSize | Number of datasets to load at once. |
Set the number of input DAB files (and thus the number of datasets) to process simultaneously when creating a new database. Defaults to -1, indicating that all datasets should be loaded at once.
Definition at line 363 of file database.h.
void Sleipnir::CDatabase::SetBlockOut | ( | size_t | iSize | ) | [inline] |
Set output block size (number of subfiles to create at once).
iSize | Number of subfiles to create at once. |
Set the number of output files (and thus the number of genes) to process simultaneously when creating a new database from DAB files. Defaults to -1, indicating that all genes should be processed in one pass.
Definition at line 342 of file database.h.
void Sleipnir::CDatabase::SetBuffer | ( | bool | fBuffer | ) | [inline] |
Set buffering behavior when creating new database subfiles.
fBuffer | Value to store for buffering behavior. |
If buffering is true, database subfiles will be constructed in memory and written to disk as a single unit at the end of each block. Default is false.
Definition at line 384 of file database.h.
void Sleipnir::CDatabase::SetMemmap | ( | bool | fMemmap | ) | [inline] |
Set memory mapping behavior when opening DAB files.
fMemmap | Value to store for memory mapping behavior. |
If memory mapping is true, DAB files will be memory mapped when a database is constructed using Open. Default is false.
Definition at line 320 of file database.h.