Sleipnir
Public Types | Public Member Functions
Sleipnir::CSeekCentral Class Reference

A suite of search algorithms that are supported by Seek. More...

#include <seekcentral.h>

Public Types

enum  SearchMode {
  CV = 0, EQUAL = 1, USE_WEIGHT = 2, CV_CUSTOM = 3,
  ORDER_STATISTICS = 4, AVERAGE_Z = 5
}
 Search modes (see section Detailed Descriptions) More...

Public Member Functions

 CSeekCentral ()
 Constructor.
 ~CSeekCentral ()
 Destructor.
bool Initialize (const vector< CSeekDBSetting * > &vecDBSetting, const char *search_dset, const char *query, const char *output_dir, const utype buffer=20, const bool to_output_text=false, const bool bOutputWeightComponent=false, const bool bSimulateWeight=false, const enum CSeekDataset::DistanceMeasure dist_measure=CSeekDataset::Z_SCORE, const bool bSubtractAvg=true, const bool bNormPlatform=false, const bool bLogit=false, const float fCutOff=-9999, const float fPercentQueryRequired=0, const float fPercentGenomeRequired=0, const bool bSquareZ=false, const bool bRandom=false, const int iNumRandom=10, gsl_rng *rand=NULL, const bool useNibble=false, const int numThreads=8)
 Initialize function.
bool Initialize (const vector< CSeekDBSetting * > &vecDBSetting, const utype buffer=20, const bool to_output_text=false, const bool bOutputWeightComponent=false, const bool bSimulateWeight=false, const enum CSeekDataset::DistanceMeasure dist_measure=CSeekDataset::Z_SCORE, const bool bSubtractAvg=true, const bool bNormPlatform=false, const bool bLogit=false, const float fCutOff=-9999, const float fPercentQueryRequired=0, const float fPercentGenomeRequired=0, const bool bSquareZ=false, const bool bRandom=false, const int iNumRandom=10, gsl_rng *rand=NULL, const bool useNibble=false, const int numThreads=8)
 Initialize function.
bool Initialize (const string &output_dir, const string &query, const string &search_dset, CSeekCentral *src, const int iClient, const float query_min_required=0, const float genome_min_required=0, const enum CSeekDataset::DistanceMeasure=CSeekDataset::Z_SCORE, const bool bSubtractGeneAvg=true, const bool bNormPlatform=false)
 Initialize function.
bool CVSearch (gsl_rng *, const CSeekQuery::PartitionMode &, const utype &, const float &)
 Run Seek with the cross-validated dataset weighting.
bool CVCustomSearch (const vector< vector< string > > &, gsl_rng *, const CSeekQuery::PartitionMode &, const utype &, const float &)
 Run Seek with the custom dataset weighting.
bool EqualWeightSearch ()
 Run Seek with the equal dataset weighting.
bool WeightSearch (const vector< vector< float > > &)
 Run Seek with the user-given dataset weights.
bool VarianceWeightSearch ()
 Run Seek with the variance weighted search.
bool AverageWeightSearch ()
 Run Seek with the SPELL search.
bool OrderStatistics ()
 Run Seek with the order statistics dataset weighting algorithm.
const vector< vector
< AResultFloat > > & 
GetAllResult () const
 Get the final gene-ranking for all the queries.
const vector< CSeekQuery > & GetAllQuery () const
 Get all the queries.
const vector< vector< float > > & GetAllWeight () const
 Get the dataset weight vector for all the queries.
utype GetGene (const string &strGene) const
 Get the gene-map ID for a given gene-name.
string GetGene (const utype &geneID) const
 Get the gene-name for a given gene-map ID.
bool Destruct ()
 Destruct this search instance.
int GetMaxGenomeCoverage ()
 Get the maximum genome coverage among the datasets in the compendium.

Detailed Description

A suite of search algorithms that are supported by Seek.

The Seek search algorithms perform the coexpression search of the user's query genes in a large compendium of microarray datasets. The output of the search algorithms is a ranking of genes based on their gene score, which is determined by the overall weighted coexpression to the query genes.

One of the first steps in a search is to weight the datasets in such a way to prioritize informative datasets. Then, with the weights generated, the final gene-score is given by:

\[FS(g, Q)=\alpha\sum_{d \in D}{w_d \cdot s_d(g, Q)}\]

where $w_d$ is the weight of the dataset, $s_d(g, Q)$ is the score of $g$ to the query in the dataset, $\alpha$ is the normalization constant.

Currently the following dataset weighting algorithms are supported in Seek.

CSeekCentral can handle multiple queries at a time, but the search parameters must remain the same for all queries.

Definition at line 81 of file seekcentral.h.


Member Enumeration Documentation

Search modes (see section Detailed Descriptions)

Enumerator:
CV 

Cross-validated weighting

EQUAL 

Equal weighting

USE_WEIGHT 

User-supplied weights

CV_CUSTOM 

Cross-validated weighting, but instead of using the query genes to cross-validate, use the user supplied gene-sets to validate each query partition

ORDER_STATISTICS 

MEM algorithm

AVERAGE_Z 

Average z-scores between query, SPELL algorithm

Definition at line 88 of file seekcentral.h.


Member Function Documentation

Run Seek with the SPELL search.

Remarks:
Assumes that the CSeekCentral::Initialize() has been called.

Definition at line 1431 of file seekcentral.cpp.

References AVERAGE_Z.

bool Sleipnir::CSeekCentral::CVCustomSearch ( const vector< vector< string > > &  newGoldStd,
gsl_rng *  rnd,
const CSeekQuery::PartitionMode PART_M,
const utype &  FOLD,
const float &  RATE 
)

Run Seek with the custom dataset weighting.

Parameters:
newGoldStdThe gold-standard gene-set that is used for weighting datasets
rndThe random number generator
PART_MQuery partition mode
FOLDNumber of partitions to generate from the query
RATEThe weighting parameter p * Same as CVSearch, except that the weighting is not based on the coexpression of the query genes, but based on the similarity of the query genes to some custom gold standard gene-set.
Remarks:
The random number generator is used for partitioning the query.
Assumes that the CSeekCentral::Initialize() has been called.

Definition at line 1413 of file seekcentral.cpp.

References CV_CUSTOM.

bool Sleipnir::CSeekCentral::CVSearch ( gsl_rng *  rnd,
const CSeekQuery::PartitionMode PART_M,
const utype &  FOLD,
const float &  RATE 
)

Run Seek with the cross-validated dataset weighting.

Parameters:
rndThe random number generator
PART_MQuery partition mode
FOLDNumber of partitions to generate from the query
RATEThe weighting parameter p
Remarks:
The random number generator is used for partitioning the query.
Assumes that the CSeekCentral::Initialize() has been called.

Definition at line 1404 of file seekcentral.cpp.

References CV.

Destruct this search instance.

Returns:
True if successful.

Definition at line 1463 of file seekcentral.cpp.

Run Seek with the equal dataset weighting.

Remarks:
Assumes that the CSeekCentral::Initialize() has been called.

Definition at line 1399 of file seekcentral.cpp.

References EQUAL.

const vector< CSeekQuery > & Sleipnir::CSeekCentral::GetAllQuery ( ) const

Get all the queries.

Returns:
A vector of queries.

Definition at line 1478 of file seekcentral.cpp.

const vector< vector< AResultFloat > > & Sleipnir::CSeekCentral::GetAllResult ( ) const

Get the final gene-ranking for all the queries.

Returns:
A two-dimensional array that stores the gene-rankings

Definition at line 1474 of file seekcentral.cpp.

const vector< vector< float > > & Sleipnir::CSeekCentral::GetAllWeight ( ) const

Get the dataset weight vector for all the queries.

Returns:
A two-dimensional float array that stores the weights
Remarks:
The first dimension is the query. The second dimension is the dataset.

Definition at line 1491 of file seekcentral.cpp.

utype Sleipnir::CSeekCentral::GetGene ( const string &  strGene) const

Get the gene-map ID for a given gene-name.

Parameters:
strGeneThe gene-name as a string
Returns:
The gene-map ID

Definition at line 1482 of file seekcentral.cpp.

References Sleipnir::CSeekTools::GetNaN().

string Sleipnir::CSeekCentral::GetGene ( const utype &  geneID) const

Get the gene-name for a given gene-map ID.

Parameters:
geneIDThe gene-map ID
Returns:
The gene-name as a string

Definition at line 1487 of file seekcentral.cpp.

bool Sleipnir::CSeekCentral::Initialize ( const vector< CSeekDBSetting * > &  vecDBSetting,
const char *  search_dset,
const char *  query,
const char *  output_dir,
const utype  buffer = 20,
const bool  to_output_text = false,
const bool  bOutputWeightComponent = false,
const bool  bSimulateWeight = false,
const enum CSeekDataset::DistanceMeasure  dist_measure = CSeekDataset::Z_SCORE,
const bool  bSubtractAvg = true,
const bool  bNormPlatform = false,
const bool  bLogit = false,
const float  fCutOff = -9999,
const float  fPercentQueryRequired = 0,
const float  fPercentGenomeRequired = 0,
const bool  bSquareZ = false,
const bool  bRandom = false,
const int  iNumRandom = 10,
gsl_rng *  rand = NULL,
const bool  useNibble = false,
const int  numThreads = 8 
)

Initialize function.

Performs the following operations:

  • Read the search parameters
  • Read the gene mapping gene_map.txt
  • Read a list of queries
  • Read the dataset mapping and the search datasets
  • Read the CDatabaselets (ie, the gene-gene correlations for the query genes)
Parameters:
geneThe gene mapping file name, gene_map.txt
quantThe quant file name
dsetThe dataset mapping file name, dataset_platform.txt
search_dsetThe file which contains the dataset names to be used for the search
queryThe query file name
platformThe platform directory, which contains the platform correlation averages and standard deviations
dbThe CDatabaselet directory, which contains the gene-centric compendium-wide correlations, *.db files
prepThe Prep directory, which contains the gene correlation average *.gavg, and the gene presence *.gpres.
gvarThe gene variance directory, which contains the *.gvar files
sinfoThe sinfo directory, which contains the *.sinfo files
num_dbThe total number of CDatabaselet files
bufferThe number of query genes to store in the memory
output_dirThe output directory
to_output_textIf true, output the gene-ranking in textual format
bOutputWeightComponentIf true, output the dataset weight components (ie the score of cross-validations)
bSimulateWeightIf true, use simulated weight as dataset weight
dist_measureDistance measure, either CORRELATION or Z_SCORE
bSubtractAvgIf true, subtract the average z-score on a per-gene basis
bNormPlatformIf true, subtract the platform gene average, divide by platform gene standard deviation
bLogitIf true, apply the logit transformation on the correlations
fCutOffCutoff the correlation values
fPercentRequiredThe fraction of the query genes required to be present in a dataset in order to consider the dataset for integration
bSquareZIf true, square the correlations
bRandomIf true, shuffle the correlation vector
iNumRandomThe number of random simulations to perform per query
randThe random number generator
useNibbleDefault to false
Remarks:
The word correlation refers to the z-scored, standardized Pearson.
The parameters bSubtractAvg, bNormPlatform, bLogit, and bSquareZ are options to transform the correlation values.
The bSimulateWeight option is for equal weighting or order statistics where the final gene ranking is not derived from a weighted integration of datasets. In this case, if the user still wants to see the contribution of each dataset, the simulated weight is computed from the distance of a dataset's coexpression ranking to the final gene ranking.
This function is designed to be used by SeekMiner.

Definition at line 655 of file seekcentral.cpp.

References Sleipnir::CSeekTools::ReadMultipleQueries().

bool Sleipnir::CSeekCentral::Initialize ( const vector< CSeekDBSetting * > &  vecDBSetting,
const utype  buffer = 20,
const bool  to_output_text = false,
const bool  bOutputWeightComponent = false,
const bool  bSimulateWeight = false,
const enum CSeekDataset::DistanceMeasure  dist_measure = CSeekDataset::Z_SCORE,
const bool  bSubtractAvg = true,
const bool  bNormPlatform = false,
const bool  bLogit = false,
const float  fCutOff = -9999,
const float  fPercentQueryRequired = 0,
const float  fPercentGenomeRequired = 0,
const bool  bSquareZ = false,
const bool  bRandom = false,
const int  iNumRandom = 10,
gsl_rng *  rand = NULL,
const bool  useNibble = false,
const int  numThreads = 8 
)

Initialize function.

Load everything except the query, the search datasets, and the output directory

Parameters:
geneThe gene mapping file name, gene_map.txt
quantThe quant file name
dsetThe dataset mapping file name, dataset_platform.txt
platformThe platform directory, which contains the platform correlation average and standard deviation
dbThe CDatabaselet directory, which contains the gene-centric compendium-wide correlations, *.db files
prepThe Prep directory, which contains the gene correlation average *.gavg, and the gene presence *.gpres. Divided by datasets.
gvarThe gene variance directory, which contains the *.gvar files
sinfoThe sinfo directory, which contains the *.sinfo files
num_dbThe total number of CDatabaselet files
bufferThe number of query genes to store in the memory
to_output_textIf true, output the gene-ranking in the textual format
bOutputWeightComponentIf true, output the dataset weight components (ie the score of cross-validations)
bSimulateWeightIf true, use simulated weight as dataset weight
dist_measureDistance measure, either CORRELATION or Z_SCORE
bSubtractAvgIf true, subtract the average z-score on a per-gene basis
bNormPlatformIf true, subtract the platform gene average, divide by platform gene standard deviation
bLogitIf true, apply the logit transformation on the correlations
fCutOffCutoff the correlations
fPercentRequiredThe fraction of the query genes required to be present in a dataset
bSquareZIf true, square the correlations
bRandomIf true, shuffle the correlation vector
iNumRandomThe number of random simulations to perform per query
randThe random number generator
useNibbleDefault to false
Remarks:
The word correlation refers to the z-scored, standardized Pearson.
The parameters bSubtractAvg, bNormPlatform, bLogit, and bSquareZ are options to transform the correlation values.
The bSimulateWeight option is for equal weighting or order statistics where the final gene ranking is not derived from a weighted integration of datasets. In this case, if the user still wants to see the contribution of each dataset, the simulated weight is computed from the distance of a dataset's coexpression ranking to the final gene ranking.
This function is designed to be used by SeekMiner.

Definition at line 521 of file seekcentral.cpp.

References Sleipnir::CSeekDataset::CORRELATION, Sleipnir::CSeekTools::LoadDatabase(), Sleipnir::CSeekTools::ReadListTwoColumns(), Sleipnir::CSeekTools::ReadPlatforms(), and Sleipnir::CSeekTools::ReadQuantFile().

bool Sleipnir::CSeekCentral::Initialize ( const string &  output_dir,
const string &  query,
const string &  search_dset,
CSeekCentral src,
const int  iClient,
const float  query_min_required = 0,
const float  genome_min_required = 0,
const enum CSeekDataset::DistanceMeasure  eDistMeasure = CSeekDataset::Z_SCORE,
const bool  bSubtractGeneAvg = true,
const bool  bNormPlatform = false 
)

Initialize function.

Prepares Seek to be used in a client-server environment

Parameters:
output_dirThe output directory
queryThe query file name
search_dsetThe file that contains the name of datasets to be used for the search
srcThe CSeekCentral instance, where some settings will be copied to here
iClientThe client's socket connection
query_min_requiredThe minimum number of query genes required to be present in a dataset
dist_measureDistance measure, either CORRELATION or Z_SCORE.
bSubtractAvgIf true, subtract the average z-score on a per-gene basis
bNormPlatformIf true, subtract the platform gene average, divide by platform gene standard deviation
Remarks:
This function is designed to be used by SeekServer.
The parameters bSubtractAvg, bNormPlatform are options to transform the correlation values.
Assumes that the CDatabaselets have been read, and the *.gvar, *.sinfo files have been loaded.
Assumes that the dataset and gene mapping files have been read.

Definition at line 237 of file seekcentral.cpp.

References Sleipnir::CSeekTools::LoadDatabase(), and Sleipnir::CMeta::Tokenize().

Run Seek with the order statistics dataset weighting algorithm.

Remarks:
Assumes that the CSeekCentral::Initialize() has been called.

Definition at line 1426 of file seekcentral.cpp.

References ORDER_STATISTICS.

Run Seek with the variance weighted search.

Same as CSeekCentral::WeightSearch(), except that the user-given weights are the query gene expression variances.

Remarks:
Assumes that the CSeekCentral::Initialize() has been called.

Definition at line 1436 of file seekcentral.cpp.

References Sleipnir::CSeekQuery::GetQuery(), Sleipnir::CSeekTools::InitVector(), Sleipnir::CMeta::IsNaN(), and WeightSearch().

bool Sleipnir::CSeekCentral::WeightSearch ( const vector< vector< float > > &  weights)

Run Seek with the user-given dataset weights.

Parameters:
weightsA two-dimensional array that stores the user-given weights
Remarks:
The two-dimensional array weights is Q by D : where Q is the number of queries, D is the number of datasets. weights[i][j] stores the weight of dataset j in query i.
Assumes that the CSeekCentral::Initialize() has been called.

Definition at line 1421 of file seekcentral.cpp.

References USE_WEIGHT.

Referenced by VarianceWeightSearch().


The documentation for this class was generated from the following files: