Sleipnir::CSeekDataset Class Reference

Representation of a microarray dataset that is used by Seek. More...

#include <seekdataset.h>

Public Types

enum  DistanceMeasure { CORRELATION = 0, Z_SCORE = CORRELATION + 1 }
 Distance measure (see main section for descriptions) More...

Public Member Functions

 CSeekDataset ()
 ~CSeekDataset ()
bool ReadDatasetAverageStdev (const string &)
 Read the *.sinfo file.
bool ReadGeneAverage (const string &)
 Read the gene average correlation file *.gavg.
bool ReadGeneVariance (const string &)
 Read the gene variance file *.gvar.
bool ReadGenePresence (const string &)
 Read the gene presence file *.gpres.
bool InitializeGeneMap ()
 Initialize the genome presence map.
bool InitializeQuery (const vector< utype > &)
 Initialize the query presence map.
bool InitializeQueryBlock (const vector< utype > &)
 Initialize a presence map for a block of queries.
bool DeleteQuery ()
 Delete the query.
bool DeleteQueryBlock ()
 Delete query block.
bool InitializeDataMatrix (utype **, const vector< float > &, const utype &, const utype &, const bool=true, const bool=false, const bool=false, const enum DistanceMeasure=Z_SCORE, const float cutoff=-1.0 *CMeta::GetNaN(), const bool=false, gsl_rng *rand=NULL)
 Initialize the gene-gene correlation matrix.
bool Copy (CSeekDataset *)
 Copy constructor.
utype ** GetDataMatrix ()
 Get the gene-gene correlation matrix.
unsigned char ** GetMatrix ()
 Get the gene-gene correlation matrix.
CSeekIntIntMapGetGeneMap ()
 Get the genome presence map.
CSeekIntIntMapGetDBMap ()
 Get the query-block presence map.
CSeekIntIntMapGetQueryMap ()
 Get the query presence map.
const vector< utype > & GetQuery () const
 Get the query genes.
const vector< utype > & GetQueryIndex () const
 Get the query gene indices.
float GetGeneVariance (const utype &) const
 Get the gene expression variance vector.
float GetGeneAverage (const utype &) const
 Get the gene average correlation vector.
float GetDatasetAverage () const
 Get the mean of the global gene-gene Pearson distribution.
float GetDatasetStdev () const
 Get the standard deviation of the global gene-gene Pearson distribution.
utype GetNumGenes () const
 Get the genome size.
bool InitializeCVWeight (const utype &)
 Initialize the weight of the dataset.
bool SetCVWeight (const utype &, const float &)
 Set the score for a particular cross-validation.
float GetCVWeight (const utype &)
 Get the score for a particular cross-validation.
const vector< float > & GetCVWeight () const
 Get all the cross-validation scores.
float GetDatasetSumWeight ()
 Get the dataset weight.
void SetPlatform (CSeekPlatform &)
 Set the platform.
CSeekPlatformGetPlatform () const
 Get the platform.

Detailed Description

Representation of a microarray dataset that is used by Seek.

A CSeekDataset encapsulates the following information about the dataset:

This dataset structure is designed to be used by Seek.

The word correlation refers to the standardized z-scores of Pearson correlations, which is derived from a 2-step process:


where $p(x,y)$ is the Pearson correlation, $f(x,y)$ is the Fisher's transformed score.

\[z(x,y)=\frac{f(x,y) - \bar{f}}{\sigma_{f}}\]

where $z(x,y)$ is the z-score, $\bar{f}$ is the mean, and $\sigma_{f}$ is the standard deviation.

From here on, correlation always refers to the above z-score definition.

Member Enumeration Documentation

Distance measure (see main section for descriptions)


Pearson correlations


Z-score of Pearson correlations

Member Function Documentation

Copy constructor.

srcA given dataset

Delete the query.

Resets all query-related data, such as dataset weight, query presence map, etc.

Definition at line 240 of file seekdataset.cpp.

Delete query block.

Resets all query-block related data.

Definition at line 255 of file seekdataset.cpp.

float Sleipnir::CSeekDataset::GetCVWeight ( const utype &  i)

Get the score for a particular cross-validation.

iThe index

const vector< float > & Sleipnir::CSeekDataset::GetCVWeight ( ) const

Get all the cross-validation scores.

A vector of cross-validation scores

Get the gene-gene correlation matrix.

A two-dimensional array of type utype. Note that the correlation has been scaled to a integer range from 0 to 640. See CSeekDataset::InitializeDataMatrix.

Get the mean of the global gene-gene Pearson distribution.

The mean Pearson for the dataset

Get the standard deviation of the global gene-gene Pearson distribution.

The standard deviation of the Pearson distribution

Get the dataset weight.

The dataset weight

Get the query-block presence map.

The query-block presence map

float Sleipnir::CSeekDataset::GetGeneAverage ( const utype &  i) const

Get the gene average correlation vector.

The average correlation vector

Get the genome presence map.

The genome presence map

float Sleipnir::CSeekDataset::GetGeneVariance ( const utype &  i) const

Get the gene expression variance vector.

The variance vector

unsigned char ** Sleipnir::CSeekDataset::GetMatrix ( )

Get the gene-gene correlation matrix.

A two-dimensional array of type unsigned char**.

Get the genome size.

The genome size

Get the platform.

The platform of this dataset

const vector< utype > & Sleipnir::CSeekDataset::GetQuery ( ) const

Get the query genes.

A vector of queries

const vector< utype > & Sleipnir::CSeekDataset::GetQueryIndex ( ) const

Get the query gene indices.

A vector of query gene indices

Get the query presence map.

The query presence map

bool Sleipnir::CSeekDataset::InitializeCVWeight ( const utype &  i)

Initialize the weight of the dataset.

iThe number of cross-validations

Initializes the total dataset weight, and the score of the individual cross-validation (CV) runs.

bool Sleipnir::CSeekDataset::InitializeDataMatrix ( utype **  ,
const vector< float > &  ,
const utype &  ,
const utype &  ,
const bool  = true,
const bool  = false,
const bool  = false,
const enum  DistanceMeasure = Z_SCORE,
const float  cutoff = -1.0*CMeta::GetNaN(),
const bool  = false,
gsl_rng *  rand = NULL 

Initialize the gene-gene correlation matrix.

rDA two-dimensional array storing the discretized gene-gene correlations
quantThe discretization function
iRowsThe number of rows for the correlation matrix
iColumnsThe number of columns for the correlation matrix
bSubtractAvgIf true, subtract the correlation by the dataset average
bNormPlatformIf true, subtract the correlation by the platform average and divide by standard deviation
logitIf true, apply the logit transform on correlations
dist_measureDistance measure: z-score or correlations
cutoffApply a hard cutoff on correlations
bRandomIf true, shuffle the correlation vector
randThe random generator for the shuffling operation above
The discretized correlation in the matrix rD is bounded by 0 to 255 (the limit of unsigned char). The parameter quant specifies how a correlation is discretized. For example, if the quant has 5 bins:
 [0, 1, 2, 3, 4]
Then if a correlation is 2.5, the discretized value would be 2.

Initialize the genome presence map.

Indicates which genes of the genome are present in the dataset.

bool Sleipnir::CSeekDataset::InitializeQuery ( const vector< utype > &  query)

Initialize the query presence map.

queryThe query genes

Indicates which query genes are present in the dataset.

Definition at line 190 of file seekdataset.cpp.

bool Sleipnir::CSeekDataset::InitializeQueryBlock ( const vector< utype > &  queryBlock)

Initialize a presence map for a block of queries.

queryBlockA vector of queries

Flattens all the queries into one vector that contains only the unique query genes, then constructs a presence map based on this vector.

Definition at line 154 of file seekdataset.cpp.

bool Sleipnir::CSeekDataset::ReadDatasetAverageStdev ( const string &  strFileName)

Read the *.sinfo file.

strFileNameThe file name

The *.sinfo file contains the mean and the standard deviation of the global gene-gene Pearson distribution for this dataset.

bool Sleipnir::CSeekDataset::ReadGeneAverage ( const string &  strFileName)

Read the gene average correlation file *.gavg.

strFileNameThe file name

The *.gavg is an array that stores the average correlation of each gene.

Definition at line 119 of file seekdataset.cpp.

bool Sleipnir::CSeekDataset::ReadGenePresence ( const string &  strFileName)

Read the gene presence file *.gpres.

strFileNameThe file name

The *.gpres is a 2-value array that contains the presence (1), absence (0) status of genes.

Definition at line 123 of file seekdataset.cpp.

bool Sleipnir::CSeekDataset::ReadGeneVariance ( const string &  strFileName)

Read the gene variance file *.gvar.

strFileNameThe file name

The *.gvar file is an array that stores the expression variance of each gene.

Definition at line 127 of file seekdataset.cpp.

bool Sleipnir::CSeekDataset::SetCVWeight ( const utype &  i,
const float &  f 

Set the score for a particular cross-validation.

iThe index
fThe validation score

Set the platform.

cpThe platform

