Sleipnir
Public Member Functions
Sleipnir::IDataset Class Reference

An IDataset abstracts a collection of individual datasets, usually CDats, using various continuous and/or discrete encodings. More...

#include <dataset.h>

Inheritance diagram for Sleipnir::IDataset:
Sleipnir::CDataFilter Sleipnir::CDataMask Sleipnir::CDataset Sleipnir::CDatasetCompact Sleipnir::CDataSubset Sleipnir::CDatasetCompactMap

Public Member Functions

virtual bool IsHidden (size_t iNode) const =0
 Returns true if the requested experimental node is hidden (does not correspond to a data file).
virtual size_t GetDiscrete (size_t iY, size_t iX, size_t iNode) const =0
 Return the discretized value at the requested position.
virtual float GetContinuous (size_t iY, size_t iX, size_t iNode) const =0
 Return the continuous value at the requested position.
virtual const std::string & GetGene (size_t iGene) const =0
 Returns the gene name at the requested index.
virtual size_t GetGenes () const =0
 Returns the number of genes in the dataset.
virtual bool IsExample (size_t iY, size_t iX) const =0
 Returns true if some data file can be accessed at the requested position.
virtual const std::vector
< std::string > & 
GetGeneNames () const =0
 Return a vector of all gene names in the dataset.
virtual size_t GetExperiments () const =0
 Return the number of experimental nodes in the dataset.
virtual size_t GetGene (const std::string &strGene) const =0
 Return the index of the given gene name, or -1 if it is not included in the dataset.
virtual size_t GetBins (size_t iNode) const =0
 Return the number of discrete values in the requested experimental node; -1 if the node is hidden or continuous.
virtual void Remove (size_t iY, size_t iX)=0
 Remove all data for the given dataset position.
virtual void FilterGenes (const CGenes &Genes, CDat::EFilter eFilter)=0
 Remove values from the dataset based on the given gene set and filter type.
virtual void Save (std::ostream &ostm, bool fBinary) const =0
 Save a dataset to the given stream in binary or tabular (human readable) form.

Detailed Description

An IDataset abstracts a collection of individual datasets, usually CDats, using various continuous and/or discrete encodings.

An IDataset is intended to manage a collection of individual datasets, usually CDats. This is often used for integration of many datasets in a model such as a Bayes net or SVM, and as such, IDatasets can be used to learn or evaluate these models. Although most datasets will be backed by discretized CDats with no hidden data (e.g. CDatasetCompact), the IDataset interface allows:

The IDataset interface merges the gene lists from all contained data files into a single gene list, which it exposes through GetGenes/GetGene/GetGeneNames/etc. Gene indices are similarly normalized; requesting gene pair i,j will "mean" the same thing in each encapsulated dataset. Missing values will be filled in as necessary for data files not containing information for the requested pair. QUANT files associated with non-continuous data files will be loaded automatically.

Remarks:
The IDataset interface is something of a mess, since it evolved over time from something meant to match data files with Bayes nets to something meant to generically load lots of data. You're usually best off using CDatasetCompact and/or CDataFilter directly.
See also:
CDat | CPCLSet | CSVM | IBayesNet

Definition at line 64 of file dataset.h.


Member Function Documentation

virtual void Sleipnir::IDataset::FilterGenes ( const CGenes Genes,
CDat::EFilter  eFilter 
) [pure virtual]

Remove values from the dataset based on the given gene set and filter type.

Parameters:
GenesGene set used to filter the dataset.
eFilterWay in which to use the given genes to remove values.

Remove values and genes (by removing all incident edges) from the dataset based on one of several algorithms. For details, see CDat::EFilter.

Remarks:
Generally implemented using Remove, so may not be supported by all implementations and may either mask or unload the filtered data.
See also:
CDat::FilterGenes

Implemented in Sleipnir::CDataSubset, Sleipnir::CDataFilter, Sleipnir::CDataMask, Sleipnir::CDatasetCompact, and Sleipnir::CDataset.

virtual size_t Sleipnir::IDataset::GetBins ( size_t  iNode) const [pure virtual]

Return the number of discrete values in the requested experimental node; -1 if the node is hidden or continuous.

Parameters:
iNodeExperimental node for which bin number should be returned.
Returns:
Number of discrete values taken by the given experimental node; -1 if the node is hidden or continuous.
See also:
GetDiscrete

Implemented in Sleipnir::CDataSubset, Sleipnir::CDataFilter, Sleipnir::CDataMask, Sleipnir::CDatasetCompact, and Sleipnir::CDataset.

Referenced by Sleipnir::CTrie< tType >::CTrie(), and Sleipnir::CBayesNetSmile::Open().

virtual float Sleipnir::IDataset::GetContinuous ( size_t  iY,
size_t  iX,
size_t  iNode 
) const [pure virtual]

Return the continuous value at the requested position.

Parameters:
iYData row.
iXData column.
iNodeExperimental node from which to retrieve the requested pair's value.
Returns:
Continuous value from the requested position and data file; not-a-number (NaN) if the value is missing.
Remarks:
Equivalent to using CDataPair::Get on the encapsulated data file with the appropriate indices. Behavior not defined when the corresponding data node is inherently discrete.
See also:
GetDiscrete

Implemented in Sleipnir::CDataSubset, Sleipnir::CDataFilter, Sleipnir::CDataMask, Sleipnir::CDatasetCompact, and Sleipnir::CDataset.

virtual size_t Sleipnir::IDataset::GetDiscrete ( size_t  iY,
size_t  iX,
size_t  iNode 
) const [pure virtual]

Return the discretized value at the requested position.

Parameters:
iYData row.
iXData column.
iNodeExperimental node from which to retrieve the requested pair's value.
Returns:
Discretized value from the requested position and data file using that file's quantization information; -1 if the value is missing.
Remarks:
Equivalent to using CDataPair::Quantize and GetContinuous or CDataPair::Get on the encapsulated data file with the appropriate indices. Behavior not defined when no discretization information is available for the requested data node.

Implemented in Sleipnir::CDataSubset, Sleipnir::CDataFilter, Sleipnir::CDataMask, Sleipnir::CDatasetCompact, and Sleipnir::CDataset.

Referenced by Sleipnir::CTrie< tType >::CTrie(), and Sleipnir::CBayesNetFN::Learn().

virtual size_t Sleipnir::IDataset::GetExperiments ( ) const [pure virtual]

Return the number of experimental nodes in the dataset.

Returns:
Number of experimental nodes in the dataset.
Remarks:
For most datasets (those not containing hidden nodes), this will be equal to the number of encapsulated data files.

Implemented in Sleipnir::CDataSubset, Sleipnir::CDataFilter, Sleipnir::CDataMask, Sleipnir::CDatasetCompact, and Sleipnir::CDataset.

Referenced by Sleipnir::CTrie< tType >::CTrie(), and Sleipnir::CBayesNetSmile::Open().

virtual const std::string& Sleipnir::IDataset::GetGene ( size_t  iGene) const [pure virtual]

Returns the gene name at the requested index.

Parameters:
iGeneIndex of gene name to return.
Returns:
Gene name at the requested index.
Remarks:
For efficiency, no bounds checking is performed.
See also:
GetGenes

Implemented in Sleipnir::CDataSubset, Sleipnir::CDataFilter, Sleipnir::CDataMask, Sleipnir::CDatasetCompact, and Sleipnir::CDataset.

virtual size_t Sleipnir::IDataset::GetGene ( const std::string &  strGene) const [pure virtual]

Return the index of the given gene name, or -1 if it is not included in the dataset.

Parameters:
strGeneGene name to retrieve.
Returns:
Index of the requested gene name, or -1 if it is not in the dataset.
See also:
GetGeneNames

Implemented in Sleipnir::CDataSubset, Sleipnir::CDataFilter, Sleipnir::CDataMask, Sleipnir::CDatasetCompact, and Sleipnir::CDataset.

virtual const std::vector<std::string>& Sleipnir::IDataset::GetGeneNames ( ) const [pure virtual]

Return a vector of all gene names in the dataset.

Returns:
Vector of gene names in the dataset.
See also:
GetGenes | GetGene

Implemented in Sleipnir::CDataSubset, Sleipnir::CDataFilter, Sleipnir::CDataMask, Sleipnir::CDatasetCompact, and Sleipnir::CDataset.

virtual size_t Sleipnir::IDataset::GetGenes ( ) const [pure virtual]

Returns the number of genes in the dataset.

Returns:
Number of genes in the dataset.
Remarks:
Equal to the union of all genes in encapsulated data files.
See also:
GetGene

Implemented in Sleipnir::CDataSubset, Sleipnir::CDataFilter, Sleipnir::CDataMask, Sleipnir::CDatasetCompact, and Sleipnir::CDataset.

Referenced by Sleipnir::CDataMask::Attach(), Sleipnir::CTrie< tType >::CTrie(), and Sleipnir::CBayesNetFN::Learn().

virtual bool Sleipnir::IDataset::IsExample ( size_t  iY,
size_t  iX 
) const [pure virtual]

Returns true if some data file can be accessed at the requested position.

Parameters:
iYData row.
iXData column.
Returns:
True if a data file can be accessed at the requested position.

A dataset position is a usable example if at least one data file can be accessed at that position; that is, if some data file provides a non-missing value for that gene pair. Implementations that filter pairs in some manner can also prevent particular positions from being usable examples.

Implemented in Sleipnir::CDataSubset, Sleipnir::CDataFilter, Sleipnir::CDataMask, Sleipnir::CDatasetCompactMap, Sleipnir::CDatasetCompact, and Sleipnir::CDataset.

Referenced by Sleipnir::CDataMask::Attach(), Sleipnir::CTrie< tType >::CTrie(), Sleipnir::CDataFilter::IsExample(), and Sleipnir::CBayesNetFN::Learn().

virtual bool Sleipnir::IDataset::IsHidden ( size_t  iNode) const [pure virtual]

Returns true if the requested experimental node is hidden (does not correspond to a data file).

Parameters:
iNodeExperimental node to investigate.
Returns:
True if the requested experimental node is hidden.

Since a dataset can be constructed either directly on a collection of data files or by tying a model such as a Bayes net to data files, IDataset can determine which model nodes are hidden by testing whether a data file exists for them. If no such file exists, the node is hidden and, for example, can be treated specially during Bayesian learning.

Remarks:
Datasets constructed directly from data files will never have hidden nodes.

Implemented in Sleipnir::CDataSubset, Sleipnir::CDataFilter, Sleipnir::CDataMask, Sleipnir::CDatasetCompact, and Sleipnir::CDataset.

Referenced by Sleipnir::CTrie< tType >::CTrie().

virtual void Sleipnir::IDataset::Remove ( size_t  iY,
size_t  iX 
) [pure virtual]

Remove all data for the given dataset position.

Parameters:
iYData row.
iXData column.

Unloads or masks data from all encapsulated files for the requested gene pair.

Remarks:
For efficiency, bounds checking is not performed; the given row and column should both be less than GetGenes. Not supported by all implementations.

Implemented in Sleipnir::CDataSubset, Sleipnir::CDataFilter, Sleipnir::CDataMask, Sleipnir::CDatasetCompactMap, Sleipnir::CDatasetCompact, and Sleipnir::CDataset.

virtual void Sleipnir::IDataset::Save ( std::ostream &  ostm,
bool  fBinary 
) const [pure virtual]

Save a dataset to the given stream in binary or tabular (human readable) form.

Parameters:
ostmStream into which dataset is saved.
fBinaryIf true, save the dataset as a binary file; if false, save it as a text-based tab-delimited file.
Remarks:
If fBinary is true, output stream must be binary.

Implemented in Sleipnir::CDataFilter, Sleipnir::CDataMask, Sleipnir::CDatasetCompact, and Sleipnir::CDataset.


The documentation for this class was generated from the following file: