Sleipnir
|
An implementation of IDataset optimized for compactly storying discrete data. More...
#include <dataset.h>
Public Member Functions | |
bool | Open (const CDataPair &Answers, const char *szDataDirectory, const IBayesNet *pBayesNet, bool fEverything=false) |
Construct a dataset corresponding to the given Bayes net using the provided answer file and data files from the given directory. | |
bool | Open (const CDataPair &Answers, const char *szDataDirectory, const IBayesNet *pBayesNet, const CGenes &GenesInclude, const CGenes &GenesExclude, bool fEverything=false) |
Construct a dataset corresponding to the given Bayes net using the provided answer file and data files from the given directory. | |
bool | Open (const std::vector< std::string > &vecstrDataFiles, bool fMemmap=false) |
Construct a dataset corresponding to the given files. | |
bool | Open (std::istream &istm) |
Load a binary DAD dataset from the given binary stream. | |
bool | Open (const CGenes &GenesInclude, const CGenes &GenesExclude, const CDataPair &Answers, const std::vector< std::string > &vecstrPCLs, size_t iSkip, const IMeasure *pMeasure, const std::vector< float > &vecdBinEdges) |
Constructs a dataset corresponding to the given Bayes net using the provided answer file and data matrices generated from the given PCLs and similarity measure. | |
bool | Open (const CDataPair &Answers, const std::vector< std::string > &vecstrDataFiles, bool fEverything=false, bool fMemmap=false, size_t iSkip=2, bool fZScore=false) |
Construct a dataset corresponding to the given Bayes net using the provided answer file and data files. | |
bool | FilterGenes (const char *szGenes, CDat::EFilter eFilter) |
Remove values from the dataset based on the given gene file and filter type. | |
void | FilterAnswers () |
Removes all data for gene pairs lacking a value in the answer (0th) data file. | |
void | Randomize () |
Randomizes the contents all except the answer (0th) data file. | |
bool | Open (const char *szDataDirectory, const IBayesNet *pBayesNet) |
Construct a dataset corresponding to the given Bayes net using files from the given directory. | |
bool | Open (const char *szDataDirectory, const IBayesNet *pBayesNet, const CGenes &GenesInclude, const CGenes &GenesExclude) |
Construct a dataset corresponding to the given Bayes net using files from the given directory. | |
bool | OpenGenes (const std::vector< std::string > &vecstrDataFiles) |
Open only the merged gene list from the given data files. | |
void | Save (std::ostream &ostm, bool fBinary) const |
Save a dataset to the given stream in binary or tabular (human readable) form. | |
float | GetContinuous (size_t iY, size_t iX, size_t iNode) const |
Return the continuous value at the requested position. | |
const std::string & | GetGene (size_t iGene) const |
Returns the gene name at the requested index. | |
size_t | GetGenes () const |
Returns the number of genes in the dataset. | |
bool | IsExample (size_t iY, size_t iX) const |
Returns true if some data file can be accessed at the requested position. | |
void | FilterGenes (const CGenes &Genes, CDat::EFilter eFilter) |
Remove values from the dataset based on the given gene set and filter type. | |
bool | IsHidden (size_t iNode) const |
Returns true if the requested experimental node is hidden (does not correspond to a data file). | |
size_t | GetDiscrete (size_t iY, size_t iX, size_t iNode) const |
Return the discretized value at the requested position. | |
const std::vector< std::string > & | GetGeneNames () const |
Return a vector of all gene names in the dataset. | |
size_t | GetExperiments () const |
Return the number of experimental nodes in the dataset. | |
size_t | GetGene (const std::string &strGene) const |
Return the index of the given gene name, or -1 if it is not included in the dataset. | |
size_t | GetBins (size_t iNode) const |
Return the number of discrete values in the requested experimental node; -1 if the node is hidden or continuous. | |
void | Remove (size_t iY, size_t iX) |
Remove all data for the given dataset position. |
An implementation of IDataset optimized for compactly storying discrete data.
A compact dataset represents a collection of pre-discretized data files in a compact form. This can be stored independently of the original continuous data (usually DAB/QUANT file pairs) for rapid reloading (or memory mapping) without the overhead of repeated discretization. Such a file is referred to as a DAD file and can be stored in either binary or text (human readable) form. As text, it is a large tab-delimited table of the form:
GENE1 GENE2 VALUE1 VALUE2 ... VALUEN GENE1 GENE3 VALUE1 VALUE2 ... VALUEN GENE2 GENE3 VALUE1 VALUE2 ... VALUEN
Like a DAT file, gene pair order is arbitrary, and duplicate gene pairs are not recommended. Missing values are indicated by blank cells, and all other values should be small integers (i.e. discretized values).
Removes all data for gene pairs lacking a value in the answer (0th) data file.
Definition at line 386 of file datasetcompact.cpp.
References GetDiscrete(), GetGenes(), IsExample(), and Remove().
bool Sleipnir::CDatasetCompact::FilterGenes | ( | const char * | szGenes, |
CDat::EFilter | eFilter | ||
) |
Remove values from the dataset based on the given gene file and filter type.
szGenes | File from which gene names are loaded, one per line. |
eFilter | Way in which to use the given genes to remove values. |
Remove values and genes (by removing all incident edges) from the dataset based on one of several algorithms. For details, see CDat::EFilter.
Definition at line 367 of file datasetcompact.cpp.
References Sleipnir::CGenes::Open().
Referenced by FilterGenes(), and Open().
void Sleipnir::CDatasetCompact::FilterGenes | ( | const CGenes & | Genes, |
CDat::EFilter | eFilter | ||
) | [inline, virtual] |
Remove values from the dataset based on the given gene set and filter type.
Genes | Gene set used to filter the dataset. |
eFilter | Way in which to use the given genes to remove values. |
Remove values and genes (by removing all incident edges) from the dataset based on one of several algorithms. For details, see CDat::EFilter.
Implements Sleipnir::IDataset.
Definition at line 578 of file dataset.h.
References FilterGenes().
size_t Sleipnir::CDatasetCompact::GetBins | ( | size_t | iNode | ) | const [inline, virtual] |
Return the number of discrete values in the requested experimental node; -1 if the node is hidden or continuous.
iNode | Experimental node for which bin number should be returned. |
Implements Sleipnir::IDataset.
float Sleipnir::CDatasetCompact::GetContinuous | ( | size_t | iY, |
size_t | iX, | ||
size_t | iNode | ||
) | const [inline, virtual] |
Return the continuous value at the requested position.
iY | Data row. |
iX | Data column. |
iNode | Experimental node from which to retrieve the requested pair's value. |
Implements Sleipnir::IDataset.
Definition at line 559 of file dataset.h.
References Sleipnir::CMeta::GetNaN().
size_t Sleipnir::CDatasetCompact::GetDiscrete | ( | size_t | iY, |
size_t | iX, | ||
size_t | iNode | ||
) | const [inline, virtual] |
Return the discretized value at the requested position.
iY | Data row. |
iX | Data column. |
iNode | Experimental node from which to retrieve the requested pair's value. |
Implements Sleipnir::IDataset.
Definition at line 586 of file dataset.h.
Referenced by FilterAnswers().
size_t Sleipnir::CDatasetCompact::GetExperiments | ( | ) | const [inline, virtual] |
Return the number of experimental nodes in the dataset.
Implements Sleipnir::IDataset.
const std::string& Sleipnir::CDatasetCompact::GetGene | ( | size_t | iGene | ) | const [inline, virtual] |
Returns the gene name at the requested index.
iGene | Index of gene name to return. |
Implements Sleipnir::IDataset.
Definition at line 566 of file dataset.h.
Referenced by GetGene().
size_t Sleipnir::CDatasetCompact::GetGene | ( | const std::string & | strGene | ) | const [inline, virtual] |
Return the index of the given gene name, or -1 if it is not included in the dataset.
strGene | Gene name to retrieve. |
Implements Sleipnir::IDataset.
Definition at line 598 of file dataset.h.
References GetGene().
const std::vector<std::string>& Sleipnir::CDatasetCompact::GetGeneNames | ( | ) | const [inline, virtual] |
Return a vector of all gene names in the dataset.
Implements Sleipnir::IDataset.
size_t Sleipnir::CDatasetCompact::GetGenes | ( | ) | const [inline, virtual] |
Returns the number of genes in the dataset.
Implements Sleipnir::IDataset.
Definition at line 570 of file dataset.h.
Referenced by FilterAnswers(), and Sleipnir::CDatasetCompactMap::Open().
bool Sleipnir::CDatasetCompact::IsExample | ( | size_t | iY, |
size_t | iX | ||
) | const [inline, virtual] |
Returns true if some data file can be accessed at the requested position.
iY | Data row. |
iX | Data column. |
A dataset position is a usable example if at least one data file can be accessed at that position; that is, if some data file provides a non-missing value for that gene pair. Implementations that filter pairs in some manner can also prevent particular positions from being usable examples.
Implements Sleipnir::IDataset.
Reimplemented in Sleipnir::CDatasetCompactMap.
Definition at line 574 of file dataset.h.
Referenced by FilterAnswers(), and Sleipnir::CDatasetCompactMap::Open().
bool Sleipnir::CDatasetCompact::IsHidden | ( | size_t | iNode | ) | const [inline, virtual] |
Returns true if the requested experimental node is hidden (does not correspond to a data file).
iNode | Experimental node to investigate. |
Since a dataset can be constructed either directly on a collection of data files or by tying a model such as a Bayes net to data files, IDataset can determine which model nodes are hidden by testing whether a data file exists for them. If no such file exists, the node is hidden and, for example, can be treated specially during Bayesian learning.
Implements Sleipnir::IDataset.
bool Sleipnir::CDatasetCompact::Open | ( | const CDataPair & | Answers, |
const char * | szDataDirectory, | ||
const IBayesNet * | pBayesNet, | ||
bool | fEverything = false |
||
) |
Construct a dataset corresponding to the given Bayes net using the provided answer file and data files from the given directory.
Answers | Pre-loaded answer file which will become the first node of the dataset. |
szDataDirectory | Directory from which data files are loaded. |
pBayesNet | Bayes nets whose nodes will correspond to files in the dataset. |
fEverything | If true, load all data; if false, load only data for gene pairs with values in the given answer file. |
Creates a dataset with nodes corresponding to the given Bayes net structure; the given answer file is always inserted as the first (0th) data file, and thus corresponds to the first node in the Bayes net (generally the class node predicting functional relationships). Nodes for which a corresponding data file (i.e. one with the same name followed by an appropriate CDat extension) cannot be located are marked as hidden.
Definition at line 112 of file datasetcompact.cpp.
Referenced by Open().
bool Sleipnir::CDatasetCompact::Open | ( | const CDataPair & | Answers, |
const char * | szDataDirectory, | ||
const IBayesNet * | pBayesNet, | ||
const CGenes & | GenesInclude, | ||
const CGenes & | GenesExclude, | ||
bool | fEverything = false |
||
) |
Construct a dataset corresponding to the given Bayes net using the provided answer file and data files from the given directory.
Answers | Pre-loaded answer file which will become the first node of the dataset. |
szDataDirectory | Directory from which data files are loaded. |
pBayesNet | Bayes nets whose nodes will correspond to files in the dataset. |
GenesInclude | Data is filtered using FilterGenes with CDat::EFilterInclude and the given gene set (unless empty). |
GenesExclude | Data is filtered using FilterGenes with CDat::EFilterExclude and the given gene set (unless empty). |
fEverything | If true, load all data; if false, load only data for gene pairs with values in the given answer file. |
Creates a dataset with nodes corresponding to the given Bayes net structure; the given answer file is always inserted as the first (0th) data file, and thus corresponds to the first node in the Bayes net (generally the class node predicting functional relationships). Nodes for which a corresponding data file (i.e. one with the same name followed by an appropriate CDat extension) cannot be located are marked as hidden.
Definition at line 154 of file datasetcompact.cpp.
References Sleipnir::CDat::GetGene(), Sleipnir::CDat::GetGenes(), Sleipnir::CGenes::GetGenes(), Sleipnir::IBayesNet::GetNodes(), Sleipnir::IBayesNet::IsContinuous(), and Sleipnir::CDataPair::Open().
bool Sleipnir::CDatasetCompact::Open | ( | const std::vector< std::string > & | vecstrDataFiles, |
bool | fMemmap = false |
||
) |
Construct a dataset corresponding to the given files.
vecstrDataFiles | Vector of file paths to load. |
fMemmap | If true, memory map data files while they are being discretized rather than loading them into memory. |
Creates a dataset with nodes corresponding to the given data files.
Definition at line 54 of file datasetcompact.cpp.
References Sleipnir::CDataPair::Open(), and OpenGenes().
bool Sleipnir::CDatasetCompact::Open | ( | std::istream & | istm | ) |
Load a binary DAD dataset from the given binary stream.
istm | Stream from which dataset is loaded. |
Definition at line 447 of file datasetcompact.cpp.
References Open().
bool Sleipnir::CDatasetCompact::Open | ( | const CGenes & | GenesInclude, |
const CGenes & | GenesExclude, | ||
const CDataPair & | Answers, | ||
const std::vector< std::string > & | vecstrPCLs, | ||
size_t | iSkip, | ||
const IMeasure * | pMeasure, | ||
const std::vector< float > & | vecdBinEdges | ||
) |
Constructs a dataset corresponding to the given Bayes net using the provided answer file and data matrices generated from the given PCLs and similarity measure.
GenesInclude | Data is filtered using FilterGenes with CDat::EFilterInclude and the given gene set (unless empty). |
GenesExclude | Data is filtered using FilterGenes with CDat::EFilterExclude and the given gene set (unless empty). |
Answers | Pre-loaded answer file which will become the first node of the dataset. |
vecstrPCLs | Vector of PCL file paths from which pairwise scores are calculated. |
iSkip | The number of columns to skip between the ID and experiments in each PCL. |
pMeasure | Similarity measure used to calculate pairwise scores between genes. |
vecdBinEdges | Vector of values corresponding to discretization bin edges (the last of which is ignored) for all PCLs. |
Constructs a dataset by loading each PCL, converting it to pairwise scores using the given similarity measure, and discretizing these scores using the given bin edges. The given answer file and the resulting matrices are collected together in order in the dataset.
Definition at line 525 of file datasetcompact.cpp.
References Sleipnir::CDat::ENormalizeZScore, Sleipnir::CPCL::Get(), Sleipnir::CPCL::GetExperiments(), Sleipnir::CDat::GetGene(), Sleipnir::CPCL::GetGene(), Sleipnir::CGenes::GetGene(), Sleipnir::CPCL::GetGeneNames(), Sleipnir::CDat::GetGenes(), Sleipnir::CPCL::GetGenes(), Sleipnir::CGenes::GetGenes(), Sleipnir::CGene::GetName(), Sleipnir::CMeta::GetNaN(), Sleipnir::CHalfMatrix< tType >::GetSize(), Sleipnir::CDataPair::GetValues(), Sleipnir::CHalfMatrix< tType >::Initialize(), Sleipnir::CGenes::IsGene(), Sleipnir::IMeasure::IsRank(), Sleipnir::IMeasure::Measure(), Sleipnir::CDat::Normalize(), Sleipnir::CDataPair::Open(), Sleipnir::CPCL::Open(), Sleipnir::CPCL::RankTransform(), Sleipnir::CHalfMatrix< tType >::Set(), and Sleipnir::CDataPair::SetQuants().
bool Sleipnir::CDatasetCompact::Open | ( | const CDataPair & | Answers, |
const std::vector< std::string > & | vecstrDataFiles, | ||
bool | fEverything = false , |
||
bool | fMemmap = false , |
||
size_t | iSkip = 2 , |
||
bool | fZScore = false |
||
) |
Construct a dataset corresponding to the given Bayes net using the provided answer file and data files.
Answers | Pre-loaded answer file which will become the first node of the dataset. |
vecstrDataFiles | Vector of file paths to load. |
fEverything | If true, load all data; if false, load only data for gene pairs with values in the given answer file. |
fMemmap | If true, memory map data files while they are being discretized rather than loading them into memory. |
iSkip | If any of the given files is a PCL, the number of columns to skip between the ID and experiments. |
fZScore | If true and any of the given files is a PCL, z-score similarity measures after pairwise calculation. |
Creates a dataset with nodes corresponding to the given answer and data files. The given data file names are each loaded using CDat::Open. The answer file is always inserted as the first (0th) data file.
Definition at line 235 of file datasetcompact.cpp.
References Sleipnir::CDat::GetGene(), Sleipnir::CDat::GetGenes(), Sleipnir::CDataPair::IsContinuous(), Sleipnir::CDataPair::Open(), Sleipnir::CDat::OpenGenes(), and Sleipnir::CCompactMatrix::Set().
bool Sleipnir::CDatasetCompact::Open | ( | const char * | szDataDirectory, |
const IBayesNet * | pBayesNet | ||
) | [inline] |
Construct a dataset corresponding to the given Bayes net using files from the given directory.
szDataDirectory | Directory from which data files are loaded. |
pBayesNet | Bayes net whose nodes will correspond to files in the dataset. |
Creates a dataset (without an answer file) with nodes corresponding to the given Bayes net structure. Nodes for which a corresponding data file (i.e. one with the same name followed by an appropriate CDat extension) cannot be located are marked as hidden.
Definition at line 491 of file dataset.h.
References Open().
bool Sleipnir::CDatasetCompact::Open | ( | const char * | szDataDirectory, |
const IBayesNet * | pBayesNet, | ||
const CGenes & | GenesInclude, | ||
const CGenes & | GenesExclude | ||
) | [inline] |
Construct a dataset corresponding to the given Bayes net using files from the given directory.
szDataDirectory | Directory from which data files are loaded. |
pBayesNet | Bayes net whose nodes will correspond to files in the dataset. |
GenesInclude | Data is filtered using FilterGenes with CDat::EFilterInclude and the given gene set (unless empty). |
GenesExclude | Data is filtered using FilterGenes with CDat::EFilterExclude and the given gene set (unless empty). |
Creates a dataset (without an answer file) with nodes corresponding to the given Bayes net structure. Nodes for which a corresponding data file (i.e. one with the same name followed by an appropriate CDat extension) cannot be located are marked as hidden.
Definition at line 521 of file dataset.h.
References Sleipnir::CDat::EFilterExclude, Sleipnir::CDat::EFilterInclude, and FilterGenes().
bool Sleipnir::CDatasetCompact::OpenGenes | ( | const std::vector< std::string > & | vecstrDataFiles | ) | [inline] |
Open only the merged gene list from the given data files.
vecstrDataFiles | Vector of file paths to load. |
Provides a way to rapidly list the set of all genes present in a given collection of data files while avoiding the overhead of loading the data itself.
Reimplemented from Sleipnir::CDataImpl.
Definition at line 551 of file dataset.h.
Referenced by Open().
void Sleipnir::CDatasetCompact::Randomize | ( | ) |
Randomizes the contents all except the answer (0th) data file.
Definition at line 638 of file datasetcompact.cpp.
void Sleipnir::CDatasetCompact::Remove | ( | size_t | iY, |
size_t | iX | ||
) | [inline, virtual] |
Remove all data for the given dataset position.
iY | Data row. |
iX | Data column. |
Unloads or masks data from all encapsulated files for the requested gene pair.
Implements Sleipnir::IDataset.
Reimplemented in Sleipnir::CDatasetCompactMap.
Definition at line 606 of file dataset.h.
Referenced by FilterAnswers().
void Sleipnir::CDatasetCompact::Save | ( | std::ostream & | ostm, |
bool | fBinary | ||
) | const [inline, virtual] |
Save a dataset to the given stream in binary or tabular (human readable) form.
ostm | Stream into which dataset is saved. |
fBinary | If true, save the dataset as a binary file; if false, save it as a text-based tab-delimited file. |
Implements Sleipnir::IDataset.