Sleipnir
Public Types | Public Member Functions
Sleipnir::CDat Class Reference

Stores a continuously valued half matrix paired with a list of names for matrix elements. More...

#include <dat.h>

Inheritance diagram for Sleipnir::CDat:
Sleipnir::CDatImpl Sleipnir::CDataPairImpl Sleipnir::CDataPair

Public Types

enum  EFilter {
  EFilterInclude = 0, EFilterTerm = EFilterInclude + 1, EFilterExclude = EFilterTerm + 1, EFilterPixie = EFilterExclude + 1,
  EFilterEdge = EFilterPixie + 1, EFilterHefalmp = EFilterEdge + 1, EFilterIncludePos = EFilterHefalmp +1, EFilterExEdge = EFilterIncludePos +1
}
 Ways in which nodes/edges can be removed to filter a CDat. More...
enum  EFormat {
  EFormatBinary = 0, EFormatText = EFormatBinary + 1, EFormatPCL = EFormatText + 1, EFormatSparse = EFormatPCL + 1,
  EFormatQdab = EFormatSparse + 1
}
 Ways in which a CDat can be persisted to/from disk. More...
enum  ENormalize {
  ENormalizeNone = 0, ENormalizeMinMax = ENormalizeNone + 1, ENormalizeMinMaxNPone = ENormalizeMinMax + 1, ENormalizeZScore = ENormalizeMinMaxNPone + 1,
  ENormalizeSigmoid = ENormalizeZScore + 1, ENormalizeNormCDF = ENormalizeSigmoid + 1, ENormalizePCC = ENormalizeNormCDF + 1
}
 Ways in which a CDat can have its edge values normalized. More...

Public Member Functions

bool Open (const char *szFile, bool fMemmap=false, size_t iSkip=2, bool fZScore=false, bool fDuplicates=false, bool fSeek=false)
 Open a CDat stored in the given file, guessing the format from the file's extension.
bool Open (std::istream &istm, EFormat eFormat=EFormatBinary, float dDefault=HUGE_VAL, bool fDuplicates=false, size_t iSkip=2, bool fZScore=false, bool fSeek=false)
 Open a CDat stored in the given stream with the given format, processing missing values and duplicates as specified.
bool Open (const CSlim &Slim)
 Construct a CDat from the given ontology slim.
bool Open (const CSlim &SlimPositives, const CSlim &SlimNonnegatives)
 Construct a CDat from the given ontology slims.
bool Open (const std::vector< std::string > &vecstrGenes, bool fClear=true, const char *szFile=NULL)
 Construct a new CDat with the given gene names.
bool Open (const std::vector< std::string > &vecstrGenes, const CDistanceMatrix &MatValues)
 Construct a new CDat with the given gene names and values.
bool Open (const std::vector< CGenes * > &vecpPositives, const std::vector< CGenes * > &vecpNonnegatives, float dPValue, const CGenome &Genome, bool fIncident=false)
 Construct a CDat from the given gene sets.
bool Open (const CDat &DatKnown, const std::vector< CGenes * > &vecpOther, const CGenome &Genome, bool fKnownNegatives, bool fIncident=false)
 Construct a CDat from the given known gene relationships and gene sets.
bool Open (const CPCL &PCL, const IMeasure *pMeasure, bool fMeasureMemory)
 Construct a new CDat backed by the given PCL and similarity measure.
bool Open (const CDat &Dat)
 Construct a copy of the given CDat.
bool OpenGenes (std::istream &istm, bool fBinary, bool fPCL=false)
 Open the genes from a CDat stored in the stream with the given format.
bool OpenGenes (const char *szFile, size_t iSkip=2)
 Open the genes from a CDat stored in the given file, guessing the format from the file's extension.
void Save (std::ostream &ostm, EFormat eFormat=EFormatBinary) const
 Save a CDat to the given stream in the requested format.
void Save (const char *szFile) const
 Save a CDat to the given file, guessing the format from the file's extension.
void SaveDOT (std::ostream &ostm, float dCutoff=HUGE_VAL, const CGenome *pGenome=NULL, bool fUnlabeled=false, bool fHashes=true, const std::vector< float > *pvecdColors=NULL, const std::vector< float > *pvecdBorders=NULL) const
 Generate a DOT-formatted graph from the CDat, suitable for processing with AT&T's Graphviz software.
void SaveGDF (std::ostream &ostm, float dCutoff=HUGE_VAL) const
 Generate a GDF-formatted graph from the CDat, suitable for processing with the GUESS graph exploration system.
void SaveNET (std::ostream &ostm, float dCutoff=HUGE_VAL) const
 Generate a NET-formatted graph from the CDat.
void SaveMATISSE (std::ostream &ostm, float dCutoff=HUGE_VAL, const CGenome *pGenome=NULL) const
 Generate a MATISSE-formatted graph from the CDat, suitable for processing with the MATISSE software.
void Invert ()
 Replace each finite value in the CDat with one minus that value.
void Rank ()
 Replace each finite value in the CDat with its integer rank.
bool FilterGenes (const char *szGenes, EFilter eFilter, size_t iLimit=-1)
 Remove edges from the CDat based on the given gene file and filter type.
void FilterGenes (const CGenes &Genes, EFilter eFilter, size_t iLimit=-1, float dEdgeAggressiveness=0.5, bool fAbsolute=false, const std::vector< float > *pvecdWeights=NULL)
 Remove edges from the CDat based on the given gene set and filter type.
void NormalizeQuantiles (size_t iQuantiles)
float * GetRowSeek (const string &strGene)
float * GetRowSeek (const size_t &i)
size_t GetGeneIndex (const string &strGene) const
void AveStd (double &a, double &b, size_t &c)
void Clear (float dValue)
bool AddGene (const std::string &strGene)
bool AddGenes (const std::vector< std::string > &vecstrGenes)
void Normalize (ENormalize eNormalize)
 Normalize each finite value in the CDat by a specific function.
size_t GetGene (const std::string &strGene) const
 Return the index of the given gene name, or -1 if it is not included in the CDat.
float * GetFullRow (const size_t &iY)
float & Get (size_t iY, size_t iX) const
 Return the value at the requested CDat position.
size_t GetGenes () const
 Returns the number of elements (genes) in the CDat.
const CDistanceMatrixGet () const
 Returns the symmetric matrix containing the CDat's values.
CDistanceMatrixGet ()
 Returns the symmetric matrix containing the CDat's values.
bool Set (size_t iY, size_t iX, float dValue)
 Set the value at the requested CDat position.
std::string GetGene (size_t iGene) const
 Returns the gene name at the given CDat position.
const std::vector< std::string > & GetGeneNames () const
 Returns the vector of gene names associated with this CDat.
void Set (size_t iY, const float *adValues)
 Set an entire row of CDat values efficiently.
const float * Get (size_t iY) const
 Get an entire row of CDat values efficiently.
float * Get (size_t iY)
 Get an entire row of CDat values efficiently.
void SetGene (size_t iGene, const std::string &strGene)
 Set the gene name at the given index.
void Randomize ()
 Randomizes the CDat's values by iterated swapping.

Detailed Description

Stores a continuously valued half matrix paired with a list of names for matrix elements.

Conceptually, a CDat stores a list of weighted pairs; this is equivalent to a weighted undirected graph with labeled nodes, or a symmetric matrix with labels for each matrix element. CDat entries are stored as continuous values, although they can be discretized in various ways. CDats can be constructed in several ways, read from disk, persisted to disk in multiple file formats, or calculated from existing gene sets, microarray data, or gold standards. In practice, a CDat is simply a continuously valued symmetric matrix (in which zero or more values may be missing) paired with a list of element names (assumed to be genes), but this data structure is sufficiently flexible to represent nearly any biological dataset.

CDats can be loaded (by Open) and/or stored (by Save) from/to disk in the following formats:

See also:
CDataPair | CHalfMatrix

Definition at line 75 of file dat.h.


Member Enumeration Documentation

Ways in which nodes/edges can be removed to filter a CDat.

See also:
FilterGenes
Enumerator:
EFilterInclude 

Remove any edge including a node outside the given set.

EFilterTerm 

Remove any positive edge including a node outside the given set.

EFilterExclude 

Remove any edge including a node in the given set.

EFilterPixie 

Perform a bioPIXIE query using the given set and remove any edge not in the resulting subgraph.

EFilterEdge 

Remove any edge not including a node in the given set.

EFilterHefalmp 

Perform a HEFalMp query using the given set and remove any edge not in the resulting subgraph.

EFilterIncludePos 

Remove any positive edge including a node outside the given set.

EFilterExEdge 

Remove edges which both gene is in the given set.

Definition at line 84 of file dat.h.

Ways in which a CDat can be persisted to/from disk.

See also:
Open | Save
Enumerator:
EFormatBinary 

Binary format listing null-terminated element name strings followed by floating point values.

EFormatText 

Text format listing element name pairs followed by numerical value strings.

EFormatPCL 

PCL file from which pairwise scores are calculated using some similarity measure.

EFormatSparse 

Binary format listing null-terminated element name strings followed by index/value pairs.

EFormatQdab 

Binary format listing null-terminated element name strings followed by bits representing the quantized bins.

Definition at line 135 of file dat.h.

Ways in which a CDat can have its edge values normalized.

See also:
Normalize
Enumerator:
ENormalizeMinMax 

Linearly transform the minimum score to 0 and the maximum to 1.

ENormalizeMinMaxNPone 

Linearly transform the minimum score to -1 and the maximum to 1.

ENormalizeZScore 

Z-score all edges (subtract mean, divide by standard deviation).

ENormalizeSigmoid 

Sigmoid transform scores to the range [0, 1].

Definition at line 172 of file dat.h.


Member Function Documentation

bool Sleipnir::CDat::FilterGenes ( const char *  szGenes,
EFilter  eFilter,
size_t  iLimit = -1 
)

Remove edges from the CDat based on the given gene file and filter type.

Parameters:
szGenesFile from which gene names are loaded, one per line.
eFilterWay in which to use the given genes to remove edges.
iLimitFor EFilterPixie and EFilterHefalmp, the maximum number of genes to retain.
Returns:
True if the filter was executed successfully.

Remove edges and nodes (by removing all incident edges) from the CDat based on one of several algorithms. For details, see EFilter.

Remarks:
EFilterTerm and to some degree EFilterEdge don't make a lot of sense for CDats that do not represent gold standards.

Definition at line 1450 of file dat.cpp.

References Sleipnir::CGenes::Open().

void Sleipnir::CDat::FilterGenes ( const CGenes Genes,
EFilter  eFilter,
size_t  iLimit = -1,
float  dEdgeAggressiveness = 0.5,
bool  fAbsolute = false,
const std::vector< float > *  pvecdWeights = NULL 
)

Remove edges from the CDat based on the given gene set and filter type.

Parameters:
GenesGene set used to filter the CDat.
eFilterWay in which to use the given genes to remove edges.
iLimitFor EFilterPixie and EFilterHefalmp, the maximum number of genes to retain.
dEdgeAggressivenessFor EFilterPixie and EFilterHefalmp, higher values result in more aggressive edge trimming. NaN completely skips edge trimming.

Remove edges and nodes (by removing all incident edges) from the CDat based on one of several algorithms. For details, see EFilter.

Remarks:
EFilterTerm and to some degree EFilterEdge don't make a lot of sense for CDats that do not represent gold standards.

Definition at line 1487 of file dat.cpp.

References EFilterEdge, EFilterExclude, EFilterExEdge, EFilterHefalmp, EFilterInclude, EFilterIncludePos, EFilterPixie, EFilterTerm, Get(), GetGene(), Sleipnir::CGenes::GetGene(), GetGenes(), Sleipnir::CGenes::GetGenes(), Sleipnir::CGene::GetName(), Sleipnir::CMeta::GetNaN(), and Set().

float& Sleipnir::CDat::Get ( size_t  iY,
size_t  iX 
) const [inline]

Return the value at the requested CDat position.

Parameters:
iYCDat row.
iXCDat column.
Returns:
Value at the requested CDat position.
Remarks:
For efficiency, no bounds checking is performed. The given row and column must be smaller than GetGenes. As a symmetric matrix, the value at position XY will always equal the value at position YX.
See also:
Set

Reimplemented from Sleipnir::CDatImpl.

Definition at line 347 of file dat.h.

References Get().

Referenced by Sleipnir::CPCL::Distance(), Sleipnir::CDatFilter::Get(), Sleipnir::CPCL::Impute(), Sleipnir::CDataFilter::IsExample(), Open(), Sleipnir::CPCL::Open(), and Sleipnir::CStatistics::WilcoxonRankSum().

const CDistanceMatrix& Sleipnir::CDat::Get ( ) const [inline]

Returns the symmetric matrix containing the CDat's values.

Returns:
Symmetric matrix containing the CDat's values.

Definition at line 373 of file dat.h.

Referenced by FilterGenes(), Get(), Invert(), Open(), Sleipnir::CDataPair::Quantize(), Randomize(), Rank(), Sleipnir::CDataPair::Save(), SaveDOT(), SaveGDF(), SaveMATISSE(), and SaveNET().

Returns the symmetric matrix containing the CDat's values.

Returns:
Symmetric matrix containing the CDat's values.

Definition at line 384 of file dat.h.

const float* Sleipnir::CDat::Get ( size_t  iY) const [inline]

Get an entire row of CDat values efficiently.

Parameters:
iYCDat row.
Returns:
Retrieved values.
Remarks:
For efficiency, no bounds checking is performed. The given row must be smaller than GetGenes and the returned array will have length exactly (size - iY - 1).
See also:
Set

Definition at line 484 of file dat.h.

References Sleipnir::CHalfMatrix< tType >::Get().

float* Sleipnir::CDat::Get ( size_t  iY) [inline]

Get an entire row of CDat values efficiently.

Parameters:
iYCDat row.
Returns:
Retrieved values.
Remarks:
For efficiency, no bounds checking is performed. The given row must be smaller than GetGenes and the returned array will have length exactly (size - iY - 1).
See also:
Set

Definition at line 505 of file dat.h.

References Sleipnir::CHalfMatrix< tType >::Get().

size_t Sleipnir::CDat::GetGene ( const std::string &  strGene) const [inline]

Return the index of the given gene name, or -1 if it is not included in the CDat.

Parameters:
strGeneGene name to retrieve.
Returns:
Index of the requested gene name, or -1 if it is not in the CDat.
See also:
GetGeneNames

Reimplemented from Sleipnir::CDatImpl.

Definition at line 318 of file dat.h.

Referenced by Sleipnir::CDataFilter::Attach(), FilterGenes(), Sleipnir::CDatFilter::GetGene(), GetGene(), Open(), Sleipnir::CDataset::Open(), Sleipnir::CDatasetCompact::Open(), SaveDOT(), SaveGDF(), SaveMATISSE(), SaveNET(), and Sleipnir::CStatistics::WilcoxonRankSum().

std::string Sleipnir::CDat::GetGene ( size_t  iGene) const [inline]

Returns the gene name at the given CDat position.

Parameters:
iGeneIndex of gene name to return.
Returns:
Gene name at the requested index.
Remarks:
For efficiency, no bounds checking is performed. The given index must be smaller than GetGenes.

Reimplemented from Sleipnir::CDatImpl.

Definition at line 428 of file dat.h.

References GetGene().

const std::vector<std::string>& Sleipnir::CDat::GetGeneNames ( ) const [inline]

Returns the vector of gene names associated with this CDat.

Returns:
Vector of this CDat's gene names.
Remarks:
Returned vector size will be identical to GetGenes.

Reimplemented from Sleipnir::CDatImpl.

Definition at line 442 of file dat.h.

Referenced by Open(), and Sleipnir::CPCL::Open().

size_t Sleipnir::CDat::GetGenes ( ) const [inline]

Returns the number of elements (genes) in the CDat.

Returns:
Number of elements (genes) in the CDat.
Remarks:
Since a symmetric matrix must be square, the number of rows equals the number of columns and is thus referred to as the number of elements (genes).

Reimplemented from Sleipnir::CDatImpl.

Definition at line 362 of file dat.h.

Referenced by Sleipnir::CPCL::Distance(), FilterGenes(), Sleipnir::CPCL::Impute(), Invert(), Open(), Sleipnir::CPCL::Open(), Sleipnir::CDataset::Open(), Sleipnir::CDatasetCompact::Open(), Randomize(), Rank(), Sleipnir::CDataPair::Save(), SaveDOT(), SaveGDF(), SaveMATISSE(), SaveNET(), and Sleipnir::CStatistics::WilcoxonRankSum().

Replace each finite value in the CDat with one minus that value.

Remarks:
Most useful in combination with CDat::Normalize ( true ).

Definition at line 1418 of file dat.cpp.

References Get(), GetGenes(), Sleipnir::CMeta::IsNaN(), and Set().

void Sleipnir::CDat::Normalize ( ENormalize  eNormalize) [inline]

Normalize each finite value in the CDat by a specific function.

Parameters:
eNormalizeMethod by which scores are normalized.
Remarks:
Values are left unchanged if ( dMax == dMin ) or ( dStd == 0 ).
See also:
ENormalize | Invert

Definition at line 279 of file dat.h.

References ENormalizeMinMax, ENormalizeMinMaxNPone, and ENormalizeZScore.

Referenced by Sleipnir::CPCL::Distance(), and Sleipnir::CDatasetCompact::Open().

bool Sleipnir::CDat::Open ( const char *  szFile,
bool  fMemmap = false,
size_t  iSkip = 2,
bool  fZScore = false,
bool  fDuplicates = false,
bool  fSeek = false 
)

Open a CDat stored in the given file, guessing the format from the file's extension.

Parameters:
szFileFilename from which CDat is loaded.
fMemmapIf true, memory map file rather than allocating memory and copying its contents.
iSkipIf the given file is a PCL, the number of columns to skip between the ID and experiments.
fZScoreIf true and the given file is a PCL, z-score similarity measures after pairwise calculation.
fDuplicatesIf true, allow duplicates (DAT format only), ignoring all but the last value for each gene pair.
Returns:
True if CDat was successfully opened.

Opens a CDat from the given file, guessing the file type from its extension: DAT, DAB, DAS, or PCL.

Remarks:
A memory mapped CDat cannot be modified; doing so will generally cause a crash. fMemmap is ignored if the given file is not a DAB, and iSkip and fZScore are ignored if the given file is not a PCL. PCLs are always converted to pairwise scores using CMeasurePearNorm. If the extension is not recognized, DAB format is assumed.
See also:
Save | CPCL

Definition at line 1220 of file dat.cpp.

References EFormatBinary, EFormatPCL, EFormatText, and Sleipnir::CMeta::MapRead().

Referenced by Sleipnir::CPCL::Distance(), Sleipnir::CPCL::Impute(), Sleipnir::CDataPair::Open(), Open(), and Sleipnir::CPCL::Open().

bool Sleipnir::CDat::Open ( std::istream &  istm,
EFormat  eFormat = EFormatBinary,
float  dDefault = HUGE_VAL,
bool  fDuplicates = false,
size_t  iSkip = 2,
bool  fZScore = false,
bool  fSeek = false 
)

Open a CDat stored in the given stream with the given format, processing missing values and duplicates as specified.

Parameters:
istmStream from which CDat is loaded.
eFormatFormat in which the stream should be parsed.
dDefaultDefault value inserted for missing pairs (DAT format only).
fDuplicatesIf true, allow duplicates (DAT format only), ignoring all but the last value for each gene pair.
iSkipIf the given stream contains a PCL, the number of columns to skip between the ID and experiments.
fZScoreIf true and the given stream contains a PCL, z-score similarity measures after pairwise calculation.
fSeekIf true, read by seeking in the file, particularly useful if reading a few values, since there is no need to read the entire file. (for binary format only)
Returns:
True if CDat was successfully opened.

Opens a CDat from the given stream with the given format.

Remarks:
dDefault and fDuplicates are ignored for non-DAT formats; iSkip and fZScore are ignored for non-PCL formats. Specifying the format incorrectly will generally cause Bad Things.
See also:
Save | CPCL

Definition at line 492 of file dat.cpp.

References EFormatPCL, EFormatQdab, EFormatSparse, and EFormatText.

bool Sleipnir::CDat::Open ( const CSlim Slim)

Construct a CDat from the given ontology slim.

Parameters:
SlimSet of ontology terms from which to generate a CDat.
Returns:
True if CDat was generated successfully.

Generates a CDat over all genes within the given slim. Gene pairs coannotated to at least one slim term are given a value of 1; all other gene pairs are given a value of 0. This is useful for rapidly generating gold standards from functional catalogs.

See also:
IOntology

Reimplemented in Sleipnir::CDataPair.

Definition at line 129 of file dat.cpp.

References Get(), Sleipnir::CSlim::GetGenes(), Sleipnir::CSlim::GetSlim(), Sleipnir::CSlim::GetSlims(), Sleipnir::CMeta::IsNaN(), Open(), and Set().

bool Sleipnir::CDat::Open ( const CSlim SlimPositives,
const CSlim SlimNonnegatives 
)

Construct a CDat from the given ontology slims.

Parameters:
SlimPositivesSet of ontology terms from which to generate related gene pairs.
SlimNonnegativesSet of ontology terms from which to generate agnostic gene pairs (neither related nor unrelated).
Returns:
True if CDat was generated successfully.

Generates a CDat over all genes within the given slims. Gene pairs coannotated to at least one positive slim term are given a value of 1, other gene pairs coannotated to at least one nonnegative slim term are given no value (NaN), and all other gene pairs are given a value of 0. This is useful for rapidly generating gold standards from functional catalogs.

See also:
IOntology

Definition at line 176 of file dat.cpp.

References Get(), Sleipnir::CSlim::GetGenes(), Sleipnir::CSlim::GetSlim(), Sleipnir::CSlim::GetSlims(), Sleipnir::CMeta::IsNaN(), Open(), and Set().

bool Sleipnir::CDat::Open ( const std::vector< std::string > &  vecstrGenes,
bool  fClear = true,
const char *  szFile = NULL 
)

Construct a new CDat with the given gene names.

Parameters:
vecstrGenesGene names and size to associate with the CDat.
fClearIf true, set each value of the new CDat to missing (NaN).
szFileIf non-null, use the given file as memory-mapped backing for the new CDat.
Returns:
True if CDat was generated successfully.

Generates a CDat over the given genes (and of the same size), optionally initializing it to contain no values or backing it with a memory-mapped file rather than physical memory.

Definition at line 1095 of file dat.cpp.

References GetGene(), GetGenes(), Sleipnir::CMeta::GetNaN(), Sleipnir::CHalfMatrix< float >::GetSpace(), Sleipnir::CHalfMatrix< tType >::Initialize(), Sleipnir::CMeta::MapWrite(), and Set().

bool Sleipnir::CDat::Open ( const std::vector< std::string > &  vecstrGenes,
const CDistanceMatrix MatScores 
)

Construct a new CDat with the given gene names and values.

Parameters:
vecstrGenesGene names to associate with the CDat.
MatScoresValues to associate with the CDat.
Returns:
True if CDat was generated successfully.

Generates a CDat over the given genes (and of the same size) and associate it with the given existing pairwise values.

Remarks:
vecstrGenes must be the same size as MatScores. MatScores will not be copied; the new CDat will thus not consume a substantial amount of memory, but MatScores must not be disposed of before the current CDat.

Reimplemented in Sleipnir::CDataPair.

Definition at line 1157 of file dat.cpp.

References Sleipnir::CHalfMatrix< tType >::GetSize(), and Sleipnir::CHalfMatrix< tType >::Initialize().

bool Sleipnir::CDat::Open ( const std::vector< CGenes * > &  vecpPositives,
const std::vector< CGenes * > &  vecpNonnegatives,
float  dPValue,
const CGenome Genome,
bool  fIncident = false 
)

Construct a CDat from the given gene sets.

Parameters:
vecpPositivesSet of gene sets from which to generate related gene pairs.
vecpNonnegativesSet of ontology terms from which to generate agnostic gene pairs (neither related nor unrelated), possibly empty.
dPValueHypergeometric p-value of overlap below which nonnegative gene set pairs are considered agnostic rather than negative.
GenomeGenome containing all genes of interest.
fIncidentIf true, only allow negative pairs incident to at least one agnostic gene set. Otherwise, negative gene pairs can include any non-co-annotated genes.
Returns:
True if CDat was generated successfully.

Generates a CDat over all genes in the given genome in which:

  • Any gene pair coannotated to at least one positive gene set is assigned value 1.
  • Any other gene pair coannotated to at least one nonnegative gene set is assigned no value (NaN).
  • Any other gene pair annotated to nonnegative gene sets with hypergeometric p-value of overlap less than the given cutoff is assigned no value (NaN).
  • Any other gene pair is assigned value 0. This is useful for rapidly generating gold standards from gene sets of interest: any gene pair known to participate in some function of interest will be marked as related, any gene pair known to participate in two unrelated functions will be marked as unrelated, and other gene pairs will be left unmarked.

Definition at line 367 of file dat.cpp.

References Get(), GetGene(), Sleipnir::CGenes::GetGene(), Sleipnir::CGenome::GetGeneNames(), GetGenes(), Sleipnir::CGenes::GetGenes(), Sleipnir::CGene::GetName(), Sleipnir::CMeta::GetNaN(), Sleipnir::CStatistics::HypergeometricCDF(), Sleipnir::CGenes::IsGene(), Sleipnir::CMeta::IsNaN(), Open(), and Set().

bool Sleipnir::CDat::Open ( const CDat DatKnown,
const std::vector< CGenes * > &  vecpOther,
const CGenome Genome,
bool  fKnownNegatives,
bool  fIncident = false 
)

Construct a CDat from the given known gene relationships and gene sets.

Parameters:
DatKnownKnown pairwise scores, either positive or negative as indicated.
vecpOtherGene sets, either positive or nonnegative as indicated (possibly empty).
GenomeGenome containing all genes of interest.
fKnownNegativesIf true, DatKnown contains known negative gene pairs (0 scores); if false, it contains known related gene pairs (1 scores). In the former case, positives are generated from pairs coannotated to the given gene sets; in the latter, negatives are generated from pairs not coannotated to the given gene sets.
fIncidentIf true, only allow negative pairs incident to at least one agnostic gene set. Otherwise, negative gene pairs can include any non-co-annotated genes.
Returns:
True if CDat was generated successfully.

Constructs a CDat by copying either the positive (1) or negative (0) values from DatKnown and calculating negatives or positives from vecpOther. In either case, values are calculated for all genes in the given genome.

  • If fKnownNegatives is true, negative (0) gene pairs are first copied from DatKnown. Then any other gene pair coannotated to at least one of the given gene sets is given a positive (1) score. Remaining gene pairs are given no value (NaN).
  • If fKnownNegatives is false, positive (1) gene pairs are first copied from DatKnown. Then any other gene pair coannotated to at least one of the given gene sets is given a missing (NaN) value. Remaining gene pairs are given a negative (0) score.

Definition at line 253 of file dat.cpp.

References Get(), GetGene(), Sleipnir::CGenome::GetGeneNames(), GetGenes(), Sleipnir::CMeta::GetNaN(), Sleipnir::CMeta::IsNaN(), Open(), and Set().

bool Sleipnir::CDat::Open ( const CPCL PCL,
const IMeasure pMeasure,
bool  fMeasureMemory 
)

Construct a new CDat backed by the given PCL and similarity measure.

Parameters:
PCLPCL from which CDat genes and pairwise values are drawn.
pMeasureSimilarity measure used to calculate pairwise scores between genes.
fMeasureMemoryIf true, the CDat is responsible for freeing the given similarity measure.
Returns:
True if CDat was generated successfully.

This opens a new CDat without precalculated similarity scores; the CDat retains a reference to the given PCL and similarity measure and calculates pairwise scores as needed. This can greatly reduce the amount of memory required by the CDat, but it can increase runtime when specific pairwise values are requested repeatedly.

Remarks:
fMeasureMemory should be false if a static or stack-based similarity measure object is used; it can be set to true for cloned/new allocated similarity measure objects that should be cleaned up with the CDat. The given PCL is not copied and should thus not be destroyed before the current CDat.

Definition at line 562 of file dat.cpp.

bool Sleipnir::CDat::Open ( const CDat Dat)

Construct a copy of the given CDat.

Parameters:
DatData to be copied.
Returns:
True if the copy was successful.

Reimplemented in Sleipnir::CDataPair.

Definition at line 321 of file dat.cpp.

References Get(), GetGeneNames(), GetGenes(), and Open().

bool Sleipnir::CDat::OpenGenes ( std::istream &  istm,
bool  fBinary,
bool  fPCL = false 
)

Open the genes from a CDat stored in the stream with the given format.

Parameters:
istmStream from which genes are loaded.
fBinaryIf true, assume DAB format.
fPCLIf true, assume PCL format.
Returns:
True if genes were successfully loaded.

Like Open, OpenGenes will load a CDat from the stream in the requested format, but it will load only the CDat's size and gene list from the stream. This is useful for rapidly obtaining gene lists and counts from large CDats without incurring the memory/time penalty of loading or memory mapping the whole file.

Remarks:
Attempting to access a CDat's data after opening only its genes is a poor idea.
See also:
CPCL

Reimplemented from Sleipnir::CDatImpl.

Definition at line 899 of file dat.cpp.

Referenced by Sleipnir::CDatasetCompact::Open(), and OpenGenes().

bool Sleipnir::CDat::OpenGenes ( const char *  szFile,
size_t  iSkip = 2 
)

Open the genes from a CDat stored in the given file, guessing the format from the file's extension.

Parameters:
szFileFilename from which CDat genes are loaded.
iSkipIf the given file is a PCL, the number of columns to skip between the ID and experiments.
Returns:
True if genes were successfully loaded.

Like Open, OpenGenes will guess the appropriate file format from the given filename's extension, but it will load only a CDat's size and gene list from the file. This is useful for rapidly obtaining gene lists and counts from large CDats without incurring the memory/time penalty of loading or memory mapping the whole file.

Remarks:
Attempting to access a CDat's data after opening only its genes is a poor idea. If the extension is not recognized, DAB format is assumed.

Definition at line 857 of file dat.cpp.

References EFormatPCL, EFormatText, and OpenGenes().

Replace each finite value in the CDat with its integer rank.

Remarks:
This can be costly in memory/processing time for large CDats.

Definition at line 1909 of file dat.cpp.

References Get(), GetGenes(), Sleipnir::CMeta::IsNaN(), and Set().

void Sleipnir::CDat::Save ( std::ostream &  ostm,
EFormat  eFormat = EFormatBinary 
) const

Save a CDat to the given stream in the requested format.

Parameters:
ostmStream into which CDat is saved.
eFormatFormat in which the CDat should be saved.
Remarks:
The binary flag of the stream must match the intended format (text for DAT, binary for DAB/DAS). CDats cannot be saved to PCLs, only loaded from them.
See also:
Open

Definition at line 1010 of file dat.cpp.

References EFormatSparse, and EFormatText.

Referenced by Save().

void Sleipnir::CDat::Save ( const char *  szFile) const

Save a CDat to the given file, guessing the format from the file's extension.

Parameters:
szFileFilename into which CDat is saved.

Save a CDat to the given file, guessing the format (DAT, DAB, or DAS) from the extension. If null, the CDat will be saved as a DAT to standard output.

Remarks:
CDats cannot be saved to PCLs, only loaded from them. If the extension is not recognized, DAB format is assumed.
See also:
Open

Reimplemented in Sleipnir::CDataPair.

Definition at line 974 of file dat.cpp.

References EFormatPCL, EFormatText, and Save().

void Sleipnir::CDat::SaveDOT ( std::ostream &  ostm,
float  dCutoff = HUGE_VAL,
const CGenome pGenome = NULL,
bool  fUnlabeled = false,
bool  fHashes = true,
const std::vector< float > *  pvecdColors = NULL,
const std::vector< float > *  pvecdBorders = NULL 
) const

Generate a DOT-formatted graph from the CDat, suitable for processing with AT&T's Graphviz software.

Parameters:
ostmStream into which DOT is saved.
dCutoffIf finite, edge weights below this cutoff are not included in the DOT.
pGenomeIf non-null, name synonyms from this genome are used to label gene nodes.
fUnlabeledIf true, do not label gene nodes and use only minimal unique identifiers when generating the DOT.
fHashesIf true, include # marks in hexadecimal color strings within the DOT. This is moronic, but some DOT parsers (*coughboostcough*) will randomly crash if # characters are included in the DOT.
pvecdColorsIf non-null, contains weights between 0 and 1 interpolating node colors between cyan and yellow.
pvecdBordersIf non-null, contains border widths in pixels for each node.

SaveDOT is one of the most useful methods for visualizing the weighted graph implicit in a CDat, particularly in combination with FilterGenes and EFilterPixie/EFilterHefalmp. Calling SaveDOT will generate a DOT graph file containing (optionally) each node and edge in the CDat, with edges colored by weight (scaled from green for the minimum weight to red at the maximum). Nodes can optionally be colored or given varying border widths based on external information, or labeled with different gene names (synonyms) than used internally by the CDat. DOT files can be visualized by a variety of software, notably the Graphviz package from AT&T.

Remarks:
If given, pvecdColors and pvecdBorders must be of the same size as the CDat.
See also:
SaveGDF | SaveNET | SaveMATISSE

Definition at line 1700 of file dat.cpp.

References Sleipnir::CColor::c_Cyan, Sleipnir::CColor::c_White, Sleipnir::CColor::c_Yellow, Sleipnir::CMeta::Filename(), Get(), GetGene(), Sleipnir::CGenome::GetGene(), GetGenes(), Sleipnir::CGene::GetName(), Sleipnir::CGene::GetSynonym(), Sleipnir::CGene::GetSynonyms(), Sleipnir::CColor::Interpolate(), Sleipnir::CMeta::IsNaN(), and Sleipnir::CColor::ToRGB().

void Sleipnir::CDat::SaveGDF ( std::ostream &  ostm,
float  dCutoff = HUGE_VAL 
) const

Generate a GDF-formatted graph from the CDat, suitable for processing with the GUESS graph exploration system.

Parameters:
ostmStream into which GDF is saved.
dCutoffIf finite, edge weights below this cutoff are not included in the GDF.

Calling SaveGDF will generate a GDF graph file containing (optionally) each node and edge in the CDat. Nodes are labeled minimally, and edges are not given any inherent color information (although this can be added later). GDFs are visualizable using the GUESS graph exploration software, although it doesn't deal well with large or highly connected graphs.

Remarks:
Edge weights are scaled by 10x in order to fall roughly within GDF's expected range.
See also:
SaveDOT | SaveNET | SaveMATISSE

Definition at line 1797 of file dat.cpp.

References Sleipnir::CMeta::Filename(), Get(), GetGene(), GetGenes(), and Sleipnir::CMeta::IsNaN().

void Sleipnir::CDat::SaveMATISSE ( std::ostream &  ostm,
float  dCutoff = HUGE_VAL,
const CGenome pGenome = NULL 
) const

Generate a MATISSE-formatted graph from the CDat, suitable for processing with the MATISSE software.

Parameters:
ostmStream into which MATISSE file is saved.
dCutoffIf finite, edge weights below this cutoff are not included in the MATISSE file.
pGenomeIf non-null, name synonyms from this genome are used to label gene nodes.

Calling SaveMATISSE will generate a MATISSE graph file containing (optionally) each node and edge in the CDat. Nodes are labeled either minimally or using the given genome's names/sysnonyms, and edges are not given any inherent color information (although this can be added later). MATISSE files are visualizable using the MATISSE software, although it doesn't deal well with large or highly connected graphs.

See also:
SaveDOT | SaveGDF | SaveNET

Definition at line 1873 of file dat.cpp.

References Get(), GetGene(), Sleipnir::CGenome::GetGene(), GetGenes(), Sleipnir::CGene::GetGloss(), Sleipnir::CGene::GetSynonym(), Sleipnir::CGene::GetSynonyms(), and Sleipnir::CMeta::IsNaN().

void Sleipnir::CDat::SaveNET ( std::ostream &  ostm,
float  dCutoff = HUGE_VAL 
) const

Generate a NET-formatted graph from the CDat.

Parameters:
ostmStream into which NET is saved.
dCutoffIf finite, edge weights below this cutoff are not included in the NET.

Calling SaveNET will generate a NET graph file containing (optionally) each node and edge in the CDat. Nodes are labeled minimally, and edges are not given any inherent color information (although this can be added later). I honestly don't remember at all what software uses the NET format, and "net"'s impossible to Google these days (thanks, Microsoft...)

See also:
SaveDOT | SaveGDF | SaveMATISSE

Definition at line 1836 of file dat.cpp.

References Get(), GetGene(), GetGenes(), and Sleipnir::CMeta::IsNaN().

bool Sleipnir::CDat::Set ( size_t  iY,
size_t  iX,
float  dValue 
) [inline]

Set the value at the requested CDat position.

Parameters:
iYCDat row.
iXCDat column.
dValueValue to store.
Returns:
True if the value was stored successfully.
Remarks:
For efficiency, no bounds checking is performed. The given row and column must be smaller than GetGenes.
See also:
Get

Reimplemented from Sleipnir::CDatImpl.

Definition at line 411 of file dat.h.

Referenced by Sleipnir::CPCL::Distance(), FilterGenes(), Sleipnir::CPCL::Impute(), Invert(), Open(), Randomize(), and Rank().

void Sleipnir::CDat::Set ( size_t  iY,
const float *  adValues 
) [inline]

Set an entire row of CDat values efficiently.

Parameters:
iYCDat row.
adValuesValues to store.
Remarks:
For efficiency, no bounds checking is performed. The given row must be smaller than GetGenes, and the given array must be non-null and have length exactly (size - iY - 1).
See also:
Get

Reimplemented from Sleipnir::CDatImpl.

Definition at line 463 of file dat.h.

References Sleipnir::CHalfMatrix< tType >::Set().

void Sleipnir::CDat::SetGene ( size_t  iGene,
const std::string &  strGene 
) [inline]

Set the gene name at the given index.

Parameters:
iGeneIndex of gene name to modify.
strGeneGene name to store at the requested index.
Remarks:
For efficiency, no bounds checking is performed. The given index must be smaller than GetGenes.
See also:
GetGene

Definition at line 525 of file dat.h.

References Sleipnir::CPCL::SetGene().


The documentation for this class was generated from the following files: