Sleipnir
Public Member Functions
Sleipnir::CDatasetCompact Class Reference

An implementation of IDataset optimized for compactly storying discrete data. More...

#include <dataset.h>

Inheritance diagram for Sleipnir::CDatasetCompact:
Sleipnir::CDatasetCompactImpl Sleipnir::IDataset Sleipnir::CDataImpl Sleipnir::CDatasetCompactMap

Public Member Functions

bool Open (const CDataPair &Answers, const char *szDataDirectory, const IBayesNet *pBayesNet, bool fEverything=false)
 Construct a dataset corresponding to the given Bayes net using the provided answer file and data files from the given directory.
bool Open (const CDataPair &Answers, const char *szDataDirectory, const IBayesNet *pBayesNet, const CGenes &GenesInclude, const CGenes &GenesExclude, bool fEverything=false)
 Construct a dataset corresponding to the given Bayes net using the provided answer file and data files from the given directory.
bool Open (const std::vector< std::string > &vecstrDataFiles, bool fMemmap=false)
 Construct a dataset corresponding to the given files.
bool Open (std::istream &istm)
 Load a binary DAD dataset from the given binary stream.
bool Open (const CGenes &GenesInclude, const CGenes &GenesExclude, const CDataPair &Answers, const std::vector< std::string > &vecstrPCLs, size_t iSkip, const IMeasure *pMeasure, const std::vector< float > &vecdBinEdges)
 Constructs a dataset corresponding to the given Bayes net using the provided answer file and data matrices generated from the given PCLs and similarity measure.
bool Open (const CDataPair &Answers, const std::vector< std::string > &vecstrDataFiles, bool fEverything=false, bool fMemmap=false, size_t iSkip=2, bool fZScore=false)
 Construct a dataset corresponding to the given Bayes net using the provided answer file and data files.
bool FilterGenes (const char *szGenes, CDat::EFilter eFilter)
 Remove values from the dataset based on the given gene file and filter type.
void FilterAnswers ()
 Removes all data for gene pairs lacking a value in the answer (0th) data file.
void Randomize ()
 Randomizes the contents all except the answer (0th) data file.
bool Open (const char *szDataDirectory, const IBayesNet *pBayesNet)
 Construct a dataset corresponding to the given Bayes net using files from the given directory.
bool Open (const char *szDataDirectory, const IBayesNet *pBayesNet, const CGenes &GenesInclude, const CGenes &GenesExclude)
 Construct a dataset corresponding to the given Bayes net using files from the given directory.
bool OpenGenes (const std::vector< std::string > &vecstrDataFiles)
 Open only the merged gene list from the given data files.
void Save (std::ostream &ostm, bool fBinary) const
 Save a dataset to the given stream in binary or tabular (human readable) form.
float GetContinuous (size_t iY, size_t iX, size_t iNode) const
 Return the continuous value at the requested position.
const std::string & GetGene (size_t iGene) const
 Returns the gene name at the requested index.
size_t GetGenes () const
 Returns the number of genes in the dataset.
bool IsExample (size_t iY, size_t iX) const
 Returns true if some data file can be accessed at the requested position.
void FilterGenes (const CGenes &Genes, CDat::EFilter eFilter)
 Remove values from the dataset based on the given gene set and filter type.
bool IsHidden (size_t iNode) const
 Returns true if the requested experimental node is hidden (does not correspond to a data file).
size_t GetDiscrete (size_t iY, size_t iX, size_t iNode) const
 Return the discretized value at the requested position.
const std::vector< std::string > & GetGeneNames () const
 Return a vector of all gene names in the dataset.
size_t GetExperiments () const
 Return the number of experimental nodes in the dataset.
size_t GetGene (const std::string &strGene) const
 Return the index of the given gene name, or -1 if it is not included in the dataset.
size_t GetBins (size_t iNode) const
 Return the number of discrete values in the requested experimental node; -1 if the node is hidden or continuous.
void Remove (size_t iY, size_t iX)
 Remove all data for the given dataset position.

Detailed Description

An implementation of IDataset optimized for compactly storying discrete data.

A compact dataset represents a collection of pre-discretized data files in a compact form. This can be stored independently of the original continuous data (usually DAB/QUANT file pairs) for rapid reloading (or memory mapping) without the overhead of repeated discretization. Such a file is referred to as a DAD file and can be stored in either binary or text (human readable) form. As text, it is a large tab-delimited table of the form:

 GENE1  GENE2   VALUE1  VALUE2  ... VALUEN
 GENE1  GENE3   VALUE1  VALUE2  ... VALUEN
 GENE2  GENE3   VALUE1  VALUE2  ... VALUEN

Like a DAT file, gene pair order is arbitrary, and duplicate gene pairs are not recommended. Missing values are indicated by blank cells, and all other values should be small integers (i.e. discretized values).

Remarks:
Stores all loaded data in CCompactMatrix objects. Attempts to use with continuous Bayes nets or with QUANTless CDats will fail in one way or another.
See also:
CDat

Definition at line 454 of file dataset.h.


Member Function Documentation

Removes all data for gene pairs lacking a value in the answer (0th) data file.

See also:
Remove

Definition at line 386 of file datasetcompact.cpp.

References GetDiscrete(), GetGenes(), IsExample(), and Remove().

bool Sleipnir::CDatasetCompact::FilterGenes ( const char *  szGenes,
CDat::EFilter  eFilter 
)

Remove values from the dataset based on the given gene file and filter type.

Parameters:
szGenesFile from which gene names are loaded, one per line.
eFilterWay in which to use the given genes to remove values.

Remove values and genes (by removing all incident edges) from the dataset based on one of several algorithms. For details, see CDat::EFilter.

Remarks:
Generally implemented using Remove; clears the filtered data.
See also:
CDat::FilterGenes

Definition at line 367 of file datasetcompact.cpp.

References Sleipnir::CGenes::Open().

Referenced by FilterGenes(), and Open().

void Sleipnir::CDatasetCompact::FilterGenes ( const CGenes Genes,
CDat::EFilter  eFilter 
) [inline, virtual]

Remove values from the dataset based on the given gene set and filter type.

Parameters:
GenesGene set used to filter the dataset.
eFilterWay in which to use the given genes to remove values.

Remove values and genes (by removing all incident edges) from the dataset based on one of several algorithms. For details, see CDat::EFilter.

Remarks:
Generally implemented using Remove, so may not be supported by all implementations and may either mask or unload the filtered data.
See also:
CDat::FilterGenes

Implements Sleipnir::IDataset.

Definition at line 578 of file dataset.h.

References FilterGenes().

size_t Sleipnir::CDatasetCompact::GetBins ( size_t  iNode) const [inline, virtual]

Return the number of discrete values in the requested experimental node; -1 if the node is hidden or continuous.

Parameters:
iNodeExperimental node for which bin number should be returned.
Returns:
Number of discrete values taken by the given experimental node; -1 if the node is hidden or continuous.
See also:
GetDiscrete

Implements Sleipnir::IDataset.

Definition at line 602 of file dataset.h.

float Sleipnir::CDatasetCompact::GetContinuous ( size_t  iY,
size_t  iX,
size_t  iNode 
) const [inline, virtual]

Return the continuous value at the requested position.

Parameters:
iYData row.
iXData column.
iNodeExperimental node from which to retrieve the requested pair's value.
Returns:
Continuous value from the requested position and data file; not-a-number (NaN) if the value is missing.
Remarks:
Equivalent to using CDataPair::Get on the encapsulated data file with the appropriate indices. Behavior not defined when the corresponding data node is inherently discrete.
See also:
GetDiscrete

Implements Sleipnir::IDataset.

Definition at line 559 of file dataset.h.

References Sleipnir::CMeta::GetNaN().

size_t Sleipnir::CDatasetCompact::GetDiscrete ( size_t  iY,
size_t  iX,
size_t  iNode 
) const [inline, virtual]

Return the discretized value at the requested position.

Parameters:
iYData row.
iXData column.
iNodeExperimental node from which to retrieve the requested pair's value.
Returns:
Discretized value from the requested position and data file using that file's quantization information; -1 if the value is missing.
Remarks:
Equivalent to using CDataPair::Quantize and GetContinuous or CDataPair::Get on the encapsulated data file with the appropriate indices. Behavior not defined when no discretization information is available for the requested data node.

Implements Sleipnir::IDataset.

Definition at line 586 of file dataset.h.

Referenced by FilterAnswers().

size_t Sleipnir::CDatasetCompact::GetExperiments ( ) const [inline, virtual]

Return the number of experimental nodes in the dataset.

Returns:
Number of experimental nodes in the dataset.
Remarks:
For most datasets (those not containing hidden nodes), this will be equal to the number of encapsulated data files.

Implements Sleipnir::IDataset.

Definition at line 594 of file dataset.h.

const std::string& Sleipnir::CDatasetCompact::GetGene ( size_t  iGene) const [inline, virtual]

Returns the gene name at the requested index.

Parameters:
iGeneIndex of gene name to return.
Returns:
Gene name at the requested index.
Remarks:
For efficiency, no bounds checking is performed.
See also:
GetGenes

Implements Sleipnir::IDataset.

Definition at line 566 of file dataset.h.

Referenced by GetGene().

size_t Sleipnir::CDatasetCompact::GetGene ( const std::string &  strGene) const [inline, virtual]

Return the index of the given gene name, or -1 if it is not included in the dataset.

Parameters:
strGeneGene name to retrieve.
Returns:
Index of the requested gene name, or -1 if it is not in the dataset.
See also:
GetGeneNames

Implements Sleipnir::IDataset.

Definition at line 598 of file dataset.h.

References GetGene().

const std::vector<std::string>& Sleipnir::CDatasetCompact::GetGeneNames ( ) const [inline, virtual]

Return a vector of all gene names in the dataset.

Returns:
Vector of gene names in the dataset.
See also:
GetGenes | GetGene

Implements Sleipnir::IDataset.

Definition at line 590 of file dataset.h.

size_t Sleipnir::CDatasetCompact::GetGenes ( ) const [inline, virtual]

Returns the number of genes in the dataset.

Returns:
Number of genes in the dataset.
Remarks:
Equal to the union of all genes in encapsulated data files.
See also:
GetGene

Implements Sleipnir::IDataset.

Definition at line 570 of file dataset.h.

Referenced by FilterAnswers(), and Sleipnir::CDatasetCompactMap::Open().

bool Sleipnir::CDatasetCompact::IsExample ( size_t  iY,
size_t  iX 
) const [inline, virtual]

Returns true if some data file can be accessed at the requested position.

Parameters:
iYData row.
iXData column.
Returns:
True if a data file can be accessed at the requested position.

A dataset position is a usable example if at least one data file can be accessed at that position; that is, if some data file provides a non-missing value for that gene pair. Implementations that filter pairs in some manner can also prevent particular positions from being usable examples.

Implements Sleipnir::IDataset.

Reimplemented in Sleipnir::CDatasetCompactMap.

Definition at line 574 of file dataset.h.

Referenced by FilterAnswers(), and Sleipnir::CDatasetCompactMap::Open().

bool Sleipnir::CDatasetCompact::IsHidden ( size_t  iNode) const [inline, virtual]

Returns true if the requested experimental node is hidden (does not correspond to a data file).

Parameters:
iNodeExperimental node to investigate.
Returns:
True if the requested experimental node is hidden.

Since a dataset can be constructed either directly on a collection of data files or by tying a model such as a Bayes net to data files, IDataset can determine which model nodes are hidden by testing whether a data file exists for them. If no such file exists, the node is hidden and, for example, can be treated specially during Bayesian learning.

Remarks:
Datasets constructed directly from data files will never have hidden nodes.

Implements Sleipnir::IDataset.

Definition at line 582 of file dataset.h.

bool Sleipnir::CDatasetCompact::Open ( const CDataPair Answers,
const char *  szDataDirectory,
const IBayesNet pBayesNet,
bool  fEverything = false 
)

Construct a dataset corresponding to the given Bayes net using the provided answer file and data files from the given directory.

Parameters:
AnswersPre-loaded answer file which will become the first node of the dataset.
szDataDirectoryDirectory from which data files are loaded.
pBayesNetBayes nets whose nodes will correspond to files in the dataset.
fEverythingIf true, load all data; if false, load only data for gene pairs with values in the given answer file.
Returns:
True if dataset was constructed successfully.

Creates a dataset with nodes corresponding to the given Bayes net structure; the given answer file is always inserted as the first (0th) data file, and thus corresponds to the first node in the Bayes net (generally the class node predicting functional relationships). Nodes for which a corresponding data file (i.e. one with the same name followed by an appropriate CDat extension) cannot be located are marked as hidden.

See also:
CDataset::Open

Definition at line 112 of file datasetcompact.cpp.

Referenced by Open().

bool Sleipnir::CDatasetCompact::Open ( const CDataPair Answers,
const char *  szDataDirectory,
const IBayesNet pBayesNet,
const CGenes GenesInclude,
const CGenes GenesExclude,
bool  fEverything = false 
)

Construct a dataset corresponding to the given Bayes net using the provided answer file and data files from the given directory.

Parameters:
AnswersPre-loaded answer file which will become the first node of the dataset.
szDataDirectoryDirectory from which data files are loaded.
pBayesNetBayes nets whose nodes will correspond to files in the dataset.
GenesIncludeData is filtered using FilterGenes with CDat::EFilterInclude and the given gene set (unless empty).
GenesExcludeData is filtered using FilterGenes with CDat::EFilterExclude and the given gene set (unless empty).
fEverythingIf true, load all data; if false, load only data for gene pairs with values in the given answer file.
Returns:
True if dataset was constructed successfully.

Creates a dataset with nodes corresponding to the given Bayes net structure; the given answer file is always inserted as the first (0th) data file, and thus corresponds to the first node in the Bayes net (generally the class node predicting functional relationships). Nodes for which a corresponding data file (i.e. one with the same name followed by an appropriate CDat extension) cannot be located are marked as hidden.

See also:
CDataset::Open

Definition at line 154 of file datasetcompact.cpp.

References Sleipnir::CDat::GetGene(), Sleipnir::CDat::GetGenes(), Sleipnir::CGenes::GetGenes(), Sleipnir::IBayesNet::GetNodes(), Sleipnir::IBayesNet::IsContinuous(), and Sleipnir::CDataPair::Open().

bool Sleipnir::CDatasetCompact::Open ( const std::vector< std::string > &  vecstrDataFiles,
bool  fMemmap = false 
)

Construct a dataset corresponding to the given files.

Parameters:
vecstrDataFilesVector of file paths to load.
fMemmapIf true, memory map data files while they are being discretized rather than loading them into memory.
Returns:
True if dataset was constructed successfully.

Creates a dataset with nodes corresponding to the given data files.

Definition at line 54 of file datasetcompact.cpp.

References Sleipnir::CDataPair::Open(), and OpenGenes().

bool Sleipnir::CDatasetCompact::Open ( std::istream &  istm)

Load a binary DAD dataset from the given binary stream.

Parameters:
istmStream from which dataset is loaded.
Returns:
True if dataset was loaded successfully.
Remarks:
Should be generated by Save; only used with binary DADs.

Definition at line 447 of file datasetcompact.cpp.

References Open().

bool Sleipnir::CDatasetCompact::Open ( const CGenes GenesInclude,
const CGenes GenesExclude,
const CDataPair Answers,
const std::vector< std::string > &  vecstrPCLs,
size_t  iSkip,
const IMeasure pMeasure,
const std::vector< float > &  vecdBinEdges 
)

Constructs a dataset corresponding to the given Bayes net using the provided answer file and data matrices generated from the given PCLs and similarity measure.

Parameters:
GenesIncludeData is filtered using FilterGenes with CDat::EFilterInclude and the given gene set (unless empty).
GenesExcludeData is filtered using FilterGenes with CDat::EFilterExclude and the given gene set (unless empty).
AnswersPre-loaded answer file which will become the first node of the dataset.
vecstrPCLsVector of PCL file paths from which pairwise scores are calculated.
iSkipThe number of columns to skip between the ID and experiments in each PCL.
pMeasureSimilarity measure used to calculate pairwise scores between genes.
vecdBinEdgesVector of values corresponding to discretization bin edges (the last of which is ignored) for all PCLs.
Returns:
True if the dataset was constructed successfully.

Constructs a dataset by loading each PCL, converting it to pairwise scores using the given similarity measure, and discretizing these scores using the given bin edges. The given answer file and the resulting matrices are collected together in order in the dataset.

Remarks:
The same number of skip columns and discretization bin edges will be used for all PCLs.
See also:
CPCL

Definition at line 525 of file datasetcompact.cpp.

References Sleipnir::CDat::ENormalizeZScore, Sleipnir::CPCL::Get(), Sleipnir::CPCL::GetExperiments(), Sleipnir::CDat::GetGene(), Sleipnir::CPCL::GetGene(), Sleipnir::CGenes::GetGene(), Sleipnir::CPCL::GetGeneNames(), Sleipnir::CDat::GetGenes(), Sleipnir::CPCL::GetGenes(), Sleipnir::CGenes::GetGenes(), Sleipnir::CGene::GetName(), Sleipnir::CMeta::GetNaN(), Sleipnir::CHalfMatrix< tType >::GetSize(), Sleipnir::CDataPair::GetValues(), Sleipnir::CHalfMatrix< tType >::Initialize(), Sleipnir::CGenes::IsGene(), Sleipnir::IMeasure::IsRank(), Sleipnir::IMeasure::Measure(), Sleipnir::CDat::Normalize(), Sleipnir::CDataPair::Open(), Sleipnir::CPCL::Open(), Sleipnir::CPCL::RankTransform(), Sleipnir::CHalfMatrix< tType >::Set(), and Sleipnir::CDataPair::SetQuants().

bool Sleipnir::CDatasetCompact::Open ( const CDataPair Answers,
const std::vector< std::string > &  vecstrDataFiles,
bool  fEverything = false,
bool  fMemmap = false,
size_t  iSkip = 2,
bool  fZScore = false 
)

Construct a dataset corresponding to the given Bayes net using the provided answer file and data files.

Parameters:
AnswersPre-loaded answer file which will become the first node of the dataset.
vecstrDataFilesVector of file paths to load.
fEverythingIf true, load all data; if false, load only data for gene pairs with values in the given answer file.
fMemmapIf true, memory map data files while they are being discretized rather than loading them into memory.
iSkipIf any of the given files is a PCL, the number of columns to skip between the ID and experiments.
fZScoreIf true and any of the given files is a PCL, z-score similarity measures after pairwise calculation.
Returns:
True if dataset was constructed successfully.

Creates a dataset with nodes corresponding to the given answer and data files. The given data file names are each loaded using CDat::Open. The answer file is always inserted as the first (0th) data file.

Remarks:
The same number of skip columns and z-score setting will be used for all PCLs.

Definition at line 235 of file datasetcompact.cpp.

References Sleipnir::CDat::GetGene(), Sleipnir::CDat::GetGenes(), Sleipnir::CDataPair::IsContinuous(), Sleipnir::CDataPair::Open(), Sleipnir::CDat::OpenGenes(), and Sleipnir::CCompactMatrix::Set().

bool Sleipnir::CDatasetCompact::Open ( const char *  szDataDirectory,
const IBayesNet pBayesNet 
) [inline]

Construct a dataset corresponding to the given Bayes net using files from the given directory.

Parameters:
szDataDirectoryDirectory from which data files are loaded.
pBayesNetBayes net whose nodes will correspond to files in the dataset.
Returns:
True if dataset was constructed successfully.

Creates a dataset (without an answer file) with nodes corresponding to the given Bayes net structure. Nodes for which a corresponding data file (i.e. one with the same name followed by an appropriate CDat extension) cannot be located are marked as hidden.

Remarks:
Missing QUANT files or requests for continuous data from the Bayes net will result in an error.

Definition at line 491 of file dataset.h.

References Open().

bool Sleipnir::CDatasetCompact::Open ( const char *  szDataDirectory,
const IBayesNet pBayesNet,
const CGenes GenesInclude,
const CGenes GenesExclude 
) [inline]

Construct a dataset corresponding to the given Bayes net using files from the given directory.

Parameters:
szDataDirectoryDirectory from which data files are loaded.
pBayesNetBayes net whose nodes will correspond to files in the dataset.
GenesIncludeData is filtered using FilterGenes with CDat::EFilterInclude and the given gene set (unless empty).
GenesExcludeData is filtered using FilterGenes with CDat::EFilterExclude and the given gene set (unless empty).
Returns:
True if dataset was constructed successfully.

Creates a dataset (without an answer file) with nodes corresponding to the given Bayes net structure. Nodes for which a corresponding data file (i.e. one with the same name followed by an appropriate CDat extension) cannot be located are marked as hidden.

Remarks:
Missing QUANT files or requests for continuous data from the Bayes net will result in an error.

Definition at line 521 of file dataset.h.

References Sleipnir::CDat::EFilterExclude, Sleipnir::CDat::EFilterInclude, and FilterGenes().

bool Sleipnir::CDatasetCompact::OpenGenes ( const std::vector< std::string > &  vecstrDataFiles) [inline]

Open only the merged gene list from the given data files.

Parameters:
vecstrDataFilesVector of file paths to load.
Returns:
True if gene lists were loaded successfully.

Provides a way to rapidly list the set of all genes present in a given collection of data files while avoiding the overhead of loading the data itself.

Remarks:
Attempting to access data in the dataset without also opening the files themselves won't do anything good.
See also:
CDat::OpenGenes

Reimplemented from Sleipnir::CDataImpl.

Definition at line 551 of file dataset.h.

Referenced by Open().

Randomizes the contents all except the answer (0th) data file.

Remarks:
The first (0th) data node in the dataset is assumed to be an answer file and is left unchanged.
See also:
CCompactMatrix::Randomize

Definition at line 638 of file datasetcompact.cpp.

void Sleipnir::CDatasetCompact::Remove ( size_t  iY,
size_t  iX 
) [inline, virtual]

Remove all data for the given dataset position.

Parameters:
iYData row.
iXData column.

Unloads or masks data from all encapsulated files for the requested gene pair.

Remarks:
For efficiency, bounds checking is not performed; the given row and column should both be less than GetGenes. Not supported by all implementations.

Implements Sleipnir::IDataset.

Reimplemented in Sleipnir::CDatasetCompactMap.

Definition at line 606 of file dataset.h.

Referenced by FilterAnswers().

void Sleipnir::CDatasetCompact::Save ( std::ostream &  ostm,
bool  fBinary 
) const [inline, virtual]

Save a dataset to the given stream in binary or tabular (human readable) form.

Parameters:
ostmStream into which dataset is saved.
fBinaryIf true, save the dataset as a binary file; if false, save it as a text-based tab-delimited file.
Remarks:
If fBinary is true, output stream must be binary.

Implements Sleipnir::IDataset.

Definition at line 555 of file dataset.h.


The documentation for this class was generated from the following files: