Encapsulates a simple indexless database allowing rapid per-gene access to values from many datasets. More...

#include <database.h>

Inheritance diagram for Sleipnir::CDatabase:

Public Member Functions
bool	Open (const std::vector< std::string > &vecstrGenes, const std::string &strInputDirectory, const IBayesNet *pBayesNet, const std::string &strOutputDirectory, size_t iFiles, const map< string, size_t > &mapstriZeros)
	Construct a new database over the given genes from the given datasets and Bayes net.
bool	Open (const std::vector< std::string > &vecstrGenes, const std::vector< std::string > &vecstrDatasets, const std::string &strInputDirectory, const std::string &strOutputDirectory, size_t iFiles, const map< string, size_t > &mapstriZeros)
	CDatabase (bool isNibble)
bool	Reorganize (const char *, const size_t &)
bool	GetGene (const string &, vector< unsigned char > &) const
bool	GetGene (const size_t &, vector< unsigned char > &) const
bool	Open (const std::string &strInputDirectory)
	Open an existing database from subfiles in the given directory.
bool	Get (size_t iOne, size_t iTwo, std::vector< unsigned char > &vecbData) const
	Retrieve data values from all datasets for a given gene pair.
bool	Get (size_t iGene, std::vector< unsigned char > &vecbData, bool fReplace=false) const
	Retrieve data values from all gene pairs over all datasets for a given gene.
bool	Get (size_t iGene, const std::vector< size_t > &veciGenes, std::vector< unsigned char > &vecbData, bool fReplace=false) const
	Retrieve data values from the indicated gene pairs over all datasets for a given gene.
size_t	GetGenes () const
	Returns the total number of genes in the database.
size_t	GetGene (const std::string &strGene) const
	Return the index of the given gene name, or -1 if it is not included in the database.
const std::string &	GetGene (size_t iGene) const
	Returns the gene name at the given index.
size_t	GetDatasets () const
	Return the number of datasets stored in the database.
bool	Open (const string &, const vector< string > &, const size_t &, const size_t &)
bool	Open (const char *, const vector< string > &, const size_t &, const size_t &)
void	SetMemmap (bool fMemmap)
	Set memory mapping behavior when opening DAB files.
void	SetBlockOut (size_t iSize)
	Set output block size (number of subfiles to create at once).
void	SetBlockIn (size_t iSize)
	Set input block size (number of datasets to load at once).
void	SetBuffer (bool fBuffer)
	Set buffering behavior when creating new database subfiles.

Detailed Description

Encapsulates a simple indexless database allowing rapid per-gene access to values from many datasets.

CDatabase stores essentially the same data as IDataset; that is, a list of genes and, for each gene pair, zero or more discrete values drawn from many datasets. IDataset exposes this information so that values for many genes can be drawn rapidly from a single dataset. CDatabase exposes the same information so that values for one gene can be drawn rapidly from many datasets. A database can be constructed from a Bayes net and its accompanying data files (DABs/QUANTs), although this amounts to something like a very large matrix transposition, so care must be taken with blocking for large sets of data. CDatabase consumes very little memory, as data is organized on disk and read efficiently on an as-needed basis.

Remarks:

Data is stored in discretized form (and thus drawn from DAB/QUANT pairs) using only four bits per gene pair per dataset. This of course means that no dataset can be discretized into more than 15 bins (one value is reserved to indicate missing data). Data is spread across an arbitrary number of database subfiles, each containing all data associated with one or more genes. For G genes spread across N subfiles, the nth subfile will contain the G/N genes n, n + G/N, n + 2G/N, and so forth. Each gene has one value associated with it per gene per dataset, and these are stored in row-major order: gene g's values are stored first for dataset 0, genes 0 through G, then dataset 1, genes 0 through G, and so forth. The layout of each subfile on disk is thus:

 4 byte unsigned int, size of header (non-data)
 4 byte unsigned int G, total number of genes
 4 byte unsigned int N, number of datasets
 4 byte unsigned int X = ~G/N, number of genes in subfile
 X null-terminated ASCII strings storing gene names
 X rows for genes g of the form:
   4-bit data values for pairs g,0 through g,G in dataset 0
   4-bit data values for pairs g,0 through g,G in dataset 1
   ...
   4-bit data values for pairs g,0 through g,G in dataset N

It is possible for a CDatabase to be larger than the input DABs, even though it reduces 32-bit floating point values to 4-bit discrete values, since A) each data point is stored twice, once for pair g,h and once for pair h,g, and B) a value must be stored for every gene pair in every dataset, even if it's missing. These two facts guarantee very rapid database lookups from disk, but they can in the worst case roughly double the size of the data on disk relative to input DABs.

See also:: IBayesNet | CDat | CDataPair

Definition at line 73 of file database.h.

Member Function Documentation

bool Sleipnir::CDatabase::Get	(	size_t	iOne,
		size_t	iTwo,
		std::vector< unsigned char > &	vecbData
	)		const `[inline]`

Retrieve data values from all datasets for a given gene pair.

Parameters:

iOne	First gene index.
iTwo	Second gene index.
vecbData	Output vector containing the retrieved values, two per byte.

Returns:: True if data was retrieved successfully.

Remarks:: For efficiency, no bounds checking is performed; however, gene indices will be wrapped into the appropriate range using modulus, so something will be returned without a crash even for bad input. vecbData is automatically resized to the appropriate length. Note that each data value is returned using only four bits, with even numbered datasets in the high order bits.

Definition at line 179 of file database.h.

bool Sleipnir::CDatabase::Get	(	size_t	iGene,
		std::vector< unsigned char > &	vecbData,
		bool	fReplace = `false`
	)		const `[inline]`

Retrieve data values from all gene pairs over all datasets for a given gene.

Parameters:

iGene	Gene index.
vecbData	Output vector containing the retrieved values, two per byte.
fReplace	If true, replace values in output vector rather than appending.

Returns:: True if data was retrieved successfully.

Remarks:: For efficiency, no bounds checking is performed; however, the given gene index will be wrapped into the appropriate range using modulus, so something will be returned without a crash even for bad input. vecbData is automatically resized to the appropriate length. Note that each data value is returned using only four bits, with even numbered datasets in the high order bits. This is equivalent to calling Get with two gene indices repeatedly for iTwo from 0 to G and concatenating the results.

Definition at line 206 of file database.h.

bool Sleipnir::CDatabase::Get	(	size_t	iGene,
		const std::vector< size_t > &	veciGenes,
		std::vector< unsigned char > &	vecbData,
		bool	fReplace = `false`
	)		const `[inline]`

Retrieve data values from the indicated gene pairs over all datasets for a given gene.

Parameters:

iGene	First gene index.
veciGenes	Second gene indices.
vecbData	Output vector containing the retrieved values, two per byte.
fReplace	If true, replace values in output vector rather than appending.

Returns:: True if data was retrieved successfully.

Remarks:: For efficiency, no bounds checking is performed; however, the gene indices will be wrapped into the appropriate range using modulus, so something will be returned without a crash even for bad input. vecbData is automatically resized to the appropriate length. Note that each data value is returned using only four bits, with even numbered datasets in the high order bits. This is equivalent to calling Get with two gene indices repeatedly for iTwo in veciGenes and concatenating the results.

Definition at line 236 of file database.h.

size_t Sleipnir::CDatabase::GetDatasets ( ) const [inline]

Return the number of datasets stored in the database.

Returns:: Number of datasets in the database.

Remarks:: Number of datasets is constant across subfiles, so the number in the first subfile (if present) is actually returned.

Definition at line 301 of file database.h.

size_t Sleipnir::CDatabase::GetGene ( const std::string & strGene ) const [inline]

Return the index of the given gene name, or -1 if it is not included in the database.

Parameters:

strGene Gene name to retrieve.

Returns:: Index of the requested gene name, or -1 if it is not in the CDat.

Reimplemented from Sleipnir::CDatabaseImpl.

Definition at line 263 of file database.h.

const std::string& Sleipnir::CDatabase::GetGene ( size_t iGene ) const [inline]

Returns the gene name at the given index.

Parameters:

iGene Index of gene name to return.

Returns:: Gene name at the requested index.

Remarks:: For efficiency, no bounds checking is performed; however, the given gene index will be wrapped into the appropriate range using modulus, so something will be returned without a crash even for bad input.

See also:: GetGenes

Definition at line 284 of file database.h.

size_t Sleipnir::CDatabase::GetGenes ( ) const [inline]

Returns the total number of genes in the database.

Returns:: Total number of genes in the database.

Definition at line 249 of file database.h.

bool Sleipnir::CDatabase::Open	(	const std::vector< std::string > &	vecstrGenes,
		const std::string &	strInputDirectory,
		const IBayesNet *	pBayesNet,
		const std::string &	strOutputDirectory,
		size_t	iFiles,
		const map< string, size_t > &	mapstriZeros
	)

Construct a new database over the given genes from the given datasets and Bayes net.

Parameters:

vecstrGenes	Gene names (and size) for the new database.
strInputDirectory	Directory containing DAB datasets to be collected into the new database.
pBayesNet	Bayes net with which the resulting database will be used (indicates the order in which the datasets are stored in the database).
strOutputDirectory	Directory in which the new database files are generated.
iFiles	Number of files across which the database should be constructed.

Returns:: True if the database was constructed successfully.

Constructs a new database over the indicated G genes by creating the requested number of new files in the output directory and, for each gene g, collecting its data in row major order: values for pairs g,0 through g,G in dataset 0, then dataset 1, and so forth. Datasets are loaded from the indicated directory and ordered as specified in the given Bayes net; this allows data to be loaded very rapidly from the resulting database and immediately inserted into the Bayes net (or one with identical structure) for inference.

Note that for large collections of data, this can be an extremely time- and memory-intensive process. This is due to the fact that, to construct a database row for some gene, every single dataset must be inspected to collect (and discretize) all values for all pairs including that gene. For organisms with large genomes, it's impossible to load more than a few datasets into memory simultaneously - but reloading every dataset for every gene wastes far too much time loading and reloading data from DAB files, even if they're memory mapped.

The solution is to use blocking: for gene block size b and dataset block size B, load the first B datasets. Then copy out values for the first b genes. Then load the next B datasets, copy b genes' values, and so forth. Then repeat the whole process for the next b genes, and so on until the entire given gene set has been covered. Correctly balancing the input block size (number of datasets to load) and output block size (number of database subfiles, and thus genes, to produce at once) can reduce the generation time of large databases from months to hours.

Remarks:: iFiles should not be substantially larger than 1000 due to filesystem and file handle limits.

See also:: SetBlockIn | SetBlockOut

bool Sleipnir::CDatabase::Open ( const std::string & strInputDirectory )

Open an existing database from subfiles in the given directory.

Parameters:

strInputDirectory Directory from which the database files are read.

Returns:: True if the database was opened successfully.

Definition at line 964 of file database.cpp.

References Sleipnir::CMeta::Basename(), Sleipnir::CMeta::Deextension(), and Sleipnir::CMeta::IsExtension().

void Sleipnir::CDatabase::SetBlockIn ( size_t iSize ) [inline]

Set input block size (number of datasets to load at once).

Parameters:

iSize Number of datasets to load at once.

Set the number of input DAB files (and thus the number of datasets) to process simultaneously when creating a new database. Defaults to -1, indicating that all datasets should be loaded at once.

Remarks:: Optimal settings depend on genome size, number of datasets, physical memory of the host machine, and the output block size.

See also:: Open | SetBlockOut

Definition at line 363 of file database.h.

void Sleipnir::CDatabase::SetBlockOut ( size_t iSize ) [inline]

Set output block size (number of subfiles to create at once).

Parameters:

iSize Number of subfiles to create at once.

Set the number of output files (and thus the number of genes) to process simultaneously when creating a new database from DAB files. Defaults to -1, indicating that all genes should be processed in one pass.

Remarks:: Optimal settings depend on genome size, number of datasets, physical memory of the host machine, and the input block size.

See also:: Open | SetBlockIn

Definition at line 342 of file database.h.

void Sleipnir::CDatabase::SetBuffer ( bool fBuffer ) [inline]

Set buffering behavior when creating new database subfiles.

Parameters:

fBuffer Value to store for buffering behavior.

If buffering is true, database subfiles will be constructed in memory and written to disk as a single unit at the end of each block. Default is false.

Remarks:: The jury's still out on whether this improves performance or not; it may be OS-dependent, since some operating systems do a better job of write-buffering than others.

See also:: Open

Definition at line 384 of file database.h.

void Sleipnir::CDatabase::SetMemmap ( bool fMemmap ) [inline]

Set memory mapping behavior when opening DAB files.

Parameters:

fMemmap Value to store for memory mapping behavior.

If memory mapping is true, DAB files will be memory mapped when a database is constructed using Open. Default is false.

Definition at line 320 of file database.h.

The documentation for this class was generated from the following files:

src/database.h
src/database.cpp

Public Member Functions

Detailed Description

Member Function Documentation