Sleipnir
Public Member Functions
Sleipnir::CDataset Class Reference

A simple implementation of IDataset directly loading unmodified CDats for each non-hidden data node. More...

#include <dataset.h>

Inheritance diagram for Sleipnir::CDataset:
Sleipnir::CDatasetImpl Sleipnir::IDataset

Public Member Functions

bool Open (const char *szAnswerFile, const std::vector< std::string > &vecstrDataFiles)
bool Open (const std::vector< std::string > &vecstrDataFiles)
 Construct a dataset corresponding to the given files.
bool Open (const char *szAnswerFile, const char *szDataDirectory, const IBayesNet *pBayesNet)
 Construct a dataset corresponding to the given Bayes net using the provided answer file and data files from the given directory.
bool Open (const CDataPair &Answers, const char *szDataDirectory, const IBayesNet *pBayesNet)
 Construct a dataset corresponding to the given Bayes net using the provided answer file and data files from the given directory.
bool Open (const char *szDataDirectory, const IBayesNet *pBayesNet)
 Construct a dataset corresponding to the given Bayes net using files from the given directory.
bool OpenGenes (const std::vector< std::string > &vecstrDataFiles)
 Open only the merged gene list from the given data files.
size_t GetDiscrete (size_t iY, size_t iX, size_t iNode) const
 Return the discretized value at the requested position.
bool IsExample (size_t iY, size_t iX) const
 Returns true if some data file can be accessed at the requested position.
void Remove (size_t iY, size_t iX)
 Remove all data for the given dataset position.
float GetContinuous (size_t iY, size_t iX, size_t iNode) const
 Return the continuous value at the requested position.
void FilterGenes (const CGenes &Genes, CDat::EFilter eFilter)
 Remove values from the dataset based on the given gene set and filter type.
const std::vector< std::string > & GetGeneNames () const
 Return a vector of all gene names in the dataset.
bool IsHidden (size_t iNode) const
 Returns true if the requested experimental node is hidden (does not correspond to a data file).
const std::string & GetGene (size_t iGene) const
 Returns the gene name at the requested index.
size_t GetGenes () const
 Returns the number of genes in the dataset.
size_t GetExperiments () const
 Return the number of experimental nodes in the dataset.
size_t GetGene (const std::string &strGene) const
 Return the index of the given gene name, or -1 if it is not included in the dataset.
size_t GetBins (size_t iNode) const
 Return the number of discrete values in the requested experimental node; -1 if the node is hidden or continuous.
void Save (std::ostream &ostm, bool fBinary) const
 Save a dataset to the given stream in binary or tabular (human readable) form.

Detailed Description

A simple implementation of IDataset directly loading unmodified CDats for each non-hidden data node.

Remarks:
For any purpose not requiring continuous values, CDatasetCompact is more appropriate. CDistanceMatrix objects are used to store continuous data, CCompactMatrix objects for discrete data.

Definition at line 296 of file dataset.h.


Member Function Documentation

void Sleipnir::CDataset::FilterGenes ( const CGenes Genes,
CDat::EFilter  eFilter 
) [inline, virtual]

Remove values from the dataset based on the given gene set and filter type.

Parameters:
GenesGene set used to filter the dataset.
eFilterWay in which to use the given genes to remove values.

Remove values and genes (by removing all incident edges) from the dataset based on one of several algorithms. For details, see CDat::EFilter.

Remarks:
Generally implemented using Remove, so may not be supported by all implementations and may either mask or unload the filtered data.
See also:
CDat::FilterGenes

Implements Sleipnir::IDataset.

Definition at line 391 of file dataset.h.

size_t Sleipnir::CDataset::GetBins ( size_t  iNode) const [inline, virtual]

Return the number of discrete values in the requested experimental node; -1 if the node is hidden or continuous.

Parameters:
iNodeExperimental node for which bin number should be returned.
Returns:
Number of discrete values taken by the given experimental node; -1 if the node is hidden or continuous.
See also:
GetDiscrete

Implements Sleipnir::IDataset.

Definition at line 419 of file dataset.h.

float Sleipnir::CDataset::GetContinuous ( size_t  iY,
size_t  iX,
size_t  iNode 
) const [inline, virtual]

Return the continuous value at the requested position.

Parameters:
iYData row.
iXData column.
iNodeExperimental node from which to retrieve the requested pair's value.
Returns:
Continuous value from the requested position and data file; not-a-number (NaN) if the value is missing.
Remarks:
Equivalent to using CDataPair::Get on the encapsulated data file with the appropriate indices. Behavior not defined when the corresponding data node is inherently discrete.
See also:
GetDiscrete

Implements Sleipnir::IDataset.

Definition at line 387 of file dataset.h.

size_t Sleipnir::CDataset::GetDiscrete ( size_t  iY,
size_t  iX,
size_t  iNode 
) const [virtual]

Return the discretized value at the requested position.

Parameters:
iYData row.
iXData column.
iNodeExperimental node from which to retrieve the requested pair's value.
Returns:
Discretized value from the requested position and data file using that file's quantization information; -1 if the value is missing.
Remarks:
Equivalent to using CDataPair::Quantize and GetContinuous or CDataPair::Get on the encapsulated data file with the appropriate indices. Behavior not defined when no discretization information is available for the requested data node.

Implements Sleipnir::IDataset.

Definition at line 521 of file dataset.cpp.

size_t Sleipnir::CDataset::GetExperiments ( ) const [inline, virtual]

Return the number of experimental nodes in the dataset.

Returns:
Number of experimental nodes in the dataset.
Remarks:
For most datasets (those not containing hidden nodes), this will be equal to the number of encapsulated data files.

Implements Sleipnir::IDataset.

Definition at line 411 of file dataset.h.

const std::string& Sleipnir::CDataset::GetGene ( size_t  iGene) const [inline, virtual]

Returns the gene name at the requested index.

Parameters:
iGeneIndex of gene name to return.
Returns:
Gene name at the requested index.
Remarks:
For efficiency, no bounds checking is performed.
See also:
GetGenes

Implements Sleipnir::IDataset.

Definition at line 403 of file dataset.h.

Referenced by GetGene().

size_t Sleipnir::CDataset::GetGene ( const std::string &  strGene) const [inline, virtual]

Return the index of the given gene name, or -1 if it is not included in the dataset.

Parameters:
strGeneGene name to retrieve.
Returns:
Index of the requested gene name, or -1 if it is not in the dataset.
See also:
GetGeneNames

Implements Sleipnir::IDataset.

Definition at line 415 of file dataset.h.

References GetGene().

const std::vector<std::string>& Sleipnir::CDataset::GetGeneNames ( ) const [inline, virtual]

Return a vector of all gene names in the dataset.

Returns:
Vector of gene names in the dataset.
See also:
GetGenes | GetGene

Implements Sleipnir::IDataset.

Definition at line 395 of file dataset.h.

size_t Sleipnir::CDataset::GetGenes ( ) const [inline, virtual]

Returns the number of genes in the dataset.

Returns:
Number of genes in the dataset.
Remarks:
Equal to the union of all genes in encapsulated data files.
See also:
GetGene

Implements Sleipnir::IDataset.

Definition at line 407 of file dataset.h.

bool Sleipnir::CDataset::IsExample ( size_t  iY,
size_t  iX 
) const [virtual]

Returns true if some data file can be accessed at the requested position.

Parameters:
iYData row.
iXData column.
Returns:
True if a data file can be accessed at the requested position.

A dataset position is a usable example if at least one data file can be accessed at that position; that is, if some data file provides a non-missing value for that gene pair. Implementations that filter pairs in some manner can also prevent particular positions from being usable examples.

Implements Sleipnir::IDataset.

Definition at line 530 of file dataset.cpp.

References Sleipnir::CMeta::IsNaN().

bool Sleipnir::CDataset::IsHidden ( size_t  iNode) const [inline, virtual]

Returns true if the requested experimental node is hidden (does not correspond to a data file).

Parameters:
iNodeExperimental node to investigate.
Returns:
True if the requested experimental node is hidden.

Since a dataset can be constructed either directly on a collection of data files or by tying a model such as a Bayes net to data files, IDataset can determine which model nodes are hidden by testing whether a data file exists for them. If no such file exists, the node is hidden and, for example, can be treated specially during Bayesian learning.

Remarks:
Datasets constructed directly from data files will never have hidden nodes.

Implements Sleipnir::IDataset.

Definition at line 399 of file dataset.h.

bool Sleipnir::CDataset::Open ( const char *  szAnswerFile,
const std::vector< std::string > &  vecstrDataFiles 
)

Construct a dataset corresponding to the given answer file and data files.

Parameters:
szAnswerFileAnswer file which will become the first node of the dataset.
vecstrDataFilesVector of file paths to load.
Returns:
True if dataset was constructed successfully.

Creates a dataset with nodes corresponding to the given data files; the given answer file is inserted as the first (0th) node. All files are assumed to be continuous.

Definition at line 450 of file dataset.cpp.

References Sleipnir::CDat::GetGene(), Sleipnir::CDat::GetGenes(), and Sleipnir::CDataPair::Open().

Referenced by Open().

bool Sleipnir::CDataset::Open ( const std::vector< std::string > &  vecstrDataFiles)

Construct a dataset corresponding to the given files.

Parameters:
vecstrDataFilesVector of file paths to load.
Returns:
True if dataset was constructed successfully.

Creates a dataset with nodes corresponding to the given data files. All files are assumed to be continuous.

Definition at line 494 of file dataset.cpp.

References Sleipnir::CDataPair::Open(), and OpenGenes().

bool Sleipnir::CDataset::Open ( const char *  szAnswerFile,
const char *  szDataDirectory,
const IBayesNet pBayesNet 
)

Construct a dataset corresponding to the given Bayes net using the provided answer file and data files from the given directory.

Parameters:
szAnswerFileAnswer file which will become the first node of the dataset.
szDataDirectoryDirectory from which data files are loaded.
pBayesNetBayes nets whose nodes will correspond to files in the dataset.
Returns:
True if dataset was constructed successfully.

Creates a dataset with nodes corresponding to the given Bayes net structure; the given answer file is always inserted as the first (0th) data file, and thus corresponds to the first node in the Bayes net (generally the class node predicting functional relationships). Data is loaded continuously or discretely as indicated by the Bayes net, and nodes for which a corresponding data file (i.e. one with the same name followed by an appropriate CDat extension) cannot be located are marked as hidden.

Remarks:
Each data file is loaded more or less as-is; continuous data files will be loaded directly into memory, and discrete files are pre-discretized and stored in compact matrices.

Definition at line 356 of file dataset.cpp.

References Sleipnir::IBayesNet::IsContinuous(), Sleipnir::CDataPair::Open(), and Open().

bool Sleipnir::CDataset::Open ( const CDataPair Answers,
const char *  szDataDirectory,
const IBayesNet pBayesNet 
) [inline]

Construct a dataset corresponding to the given Bayes net using the provided answer file and data files from the given directory.

Parameters:
AnswersPre-loaded answer file which will become the first node of the dataset.
szDataDirectoryDirectory from which data files are loaded.
pBayesNetBayes net whose nodes will correspond to files in the dataset.
Returns:
True if dataset was constructed successfully.

Creates a dataset with nodes corresponding to the given Bayes net structure; the given answer file is always inserted as the first (0th) data file, and thus corresponds to the first node in the Bayes net (generally the class node predicting functional relationships). Data is loaded continuously or discretely as indicated by the Bayes net, and nodes for which a corresponding data file (i.e. one with the same name followed by an appropriate CDat extension) cannot be located are marked as hidden.

Remarks:
Each data file is loaded more or less as-is; continuous data files will be loaded directly into memory, and discrete files are pre-discretized and stored in compact matrices.

Definition at line 329 of file dataset.h.

References Open().

bool Sleipnir::CDataset::Open ( const char *  szDataDirectory,
const IBayesNet pBayesNet 
) [inline]

Construct a dataset corresponding to the given Bayes net using files from the given directory.

Parameters:
szDataDirectoryDirectory from which data files are loaded.
pBayesNetBayes net whose nodes will correspond to files in the dataset.
Returns:
True if dataset was constructed successfully.

Creates a dataset (without an answer file) with nodes corresponding to the given Bayes net structure. Data is loaded continuously or discretely as indicated by the Bayes net, and nodes for which a corresponding data file (i.e. one with the same name followed by an appropriate CDat extension) cannot be located are marked as hidden.

Remarks:
Each data file is loaded more or less as-is; continuous data files will be loaded directly into memory, and discrete files are pre-discretized and stored in compact matrices.

Definition at line 355 of file dataset.h.

References Open().

bool Sleipnir::CDataset::OpenGenes ( const std::vector< std::string > &  vecstrDataFiles) [inline]

Open only the merged gene list from the given data files.

Parameters:
vecstrDataFilesVector of file paths to load.
Returns:
True if gene lists were loaded successfully.

Provides a way to rapidly list the set of all genes present in a given collection of data files while avoiding the overhead of loading the data itself.

Remarks:
Attempting to access data in the dataset without also opening the files themselves won't do anything good.
See also:
CDat::OpenGenes

Reimplemented from Sleipnir::CDataImpl.

Definition at line 379 of file dataset.h.

Referenced by Open().

void Sleipnir::CDataset::Remove ( size_t  iY,
size_t  iX 
) [virtual]

Remove all data for the given dataset position.

Parameters:
iYData row.
iXData column.

Unloads or masks data from all encapsulated files for the requested gene pair.

Remarks:
For efficiency, bounds checking is not performed; the given row and column should both be less than GetGenes. Not supported by all implementations.

Implements Sleipnir::IDataset.

Definition at line 542 of file dataset.cpp.

References Sleipnir::CMeta::GetNaN().

void Sleipnir::CDataset::Save ( std::ostream &  ostm,
bool  fBinary 
) const [inline, virtual]

Save a dataset to the given stream in binary or tabular (human readable) form.

Parameters:
ostmStream into which dataset is saved.
fBinaryIf true, save the dataset as a binary file; if false, save it as a text-based tab-delimited file.
Remarks:
If fBinary is true, output stream must be binary.

Implements Sleipnir::IDataset.

Definition at line 423 of file dataset.h.


The documentation for this class was generated from the following files: