Sleipnir
Public Member Functions
Sleipnir::CCoalesceCluster Class Reference

Manages a single converging regulatory module for CCoalesce. More...

#include <coalescecluster.h>

Inheritance diagram for Sleipnir::CCoalesceCluster:
Sleipnir::CCoalesceClusterImpl

Public Member Functions

bool Initialize (const CPCL &PCL, CCoalesceCluster &Pot, const std::vector< SCoalesceDataset > &vecsDatasets, std::set< std::pair< size_t, size_t > > &setpriiSeeds, const std::vector< float > &vecdSeed, size_t iPairs, float dPValue, float dProbability, size_t iThreads)
 Randomly initializes a new cluster from the given PCL by selecting a pair of correlated genes and a surrounding seed of additional genes.
void Subtract (CPCL &PCL, const CCoalesceCluster &Pot) const
 Subtract the average expression value for each condition in the cluster from each gene in the cluster.
void Subtract (CCoalesceGeneScores &GeneScores) const
 Subtract the average frequency of each motif in the cluster from the score for that motif in each gene in the cluster.
bool SelectConditions (const CPCL &PCL, const CCoalesceCluster &Pot, size_t iThreads, float dPValue, float dZScore)
 Performs feature selection to include significant expression conditions in a converging cluster.
bool SelectMotifs (const CCoalesceGroupHistograms &HistsCluster, const CCoalesceGroupHistograms &HistsPot, float dPValue, float dZScore, size_t iMaxMotifs, size_t iThreads, const CCoalesceMotifLibrary *pMotifs=NULL)
 Performs feature selection to include significant sequence motifs in a converging cluster.
bool SelectGenes (const CPCL &PCL, const CCoalesceGeneScores &GeneScores, const CCoalesceGroupHistograms &HistsCluster, const CCoalesceGroupHistograms &HistsPot, size_t iMinimum, size_t iThreads, CCoalesceCluster &Pot, float dProbability, const CCoalesceMotifLibrary *pMotifs=NULL)
 Add and remove (im)probable genes to a converging cluster using Bayesian integration based on the currently selected expression conditions and sequence motifs.
void CalculateHistograms (const CCoalesceGeneScores &GeneScores, CCoalesceGroupHistograms &HistogramsCluster, CCoalesceGroupHistograms *pHistogramsPot) const
 Updates the given motif score histograms based on the genes in the current cluster using the provided per-gene motif scores.
size_t Open (const std::string &strPCL, size_t iSkip, const CPCL &PCL, CCoalesceMotifLibrary *pMotifs=NULL)
 Opens a cluster based on the given PCL file and, if present, accompanying motif file.
size_t Open (std::istream &istm, const CPCL &PCL, CCoalesceMotifLibrary *pMotifs=NULL)
 Opens a cluster based on the textual representation (including genes, conditions, and motifs) in the given input stream.
bool Open (const CHierarchy &Hierarchy, const std::vector< CCoalesceCluster > &vecClusters, const std::vector< std::string > &vecstrClusters, float dFraction, float dCutoff, size_t iCutoff, CCoalesceMotifLibrary *pMotifs=NULL)
 Creates a new cluster by merging the one or more clusters in the given hierarchy.
bool Save (const std::string &strDirectory, size_t iID, const CPCL &PCL, const CCoalesceMotifLibrary *pMotifs=NULL) const
 Saves cluster in a pair of PCL and motif text files in the given directory.
void Save (std::ostream &ostm, size_t iID, const CPCL &PCL, const CCoalesceMotifLibrary *pMotifs=NULL, float dCutoffPWMs=0, float dPenaltyGap=0, float dPenaltyMismatch=0, bool fNoRCs=false) const
 Saves a textual representation of the cluster (including genes, conditions, and motifs) to the given output stream.
float GetSimilarity (const CCoalesceCluster &Cluster, size_t iGenes, size_t iDatasets) const
 Calculates a similarity score between the given and current clusters.
void Snapshot (const CCoalesceGeneScores &GeneScores, CCoalesceGroupHistograms &Histograms)
 Records the state of the cluster between convergence iterations.
bool LabelMotifs (const CCoalesceMotifLibrary &Motifs, SMotifMatch::EType eMatchType, float dPenaltyGap, float dPenaltyMismatch, float dPValue)
 Labels all significant motifs in the cluster with any known TF motifs that match below the given threshhold.
bool IsConverged ()
 Returns true if the cluster has converged to a previously visited state.
bool IsEmpty () const
 Returns true if the cluster contains no genes or no conditions.
void SetGenes (size_t iGenes)
 Replaces the cluster's genes with the requested number of initial genes.
const std::set< size_t > & GetGenes () const
 Returns the set of genes significant in this cluster.
const std::set< size_t > & GetDatasets () const
 Returns the set of datasets significant in this cluster.
const std::set< SMotifMatch > & GetMotifs () const
 Returns the set of motifs significant in this cluster.
bool IsGene (size_t iGene) const
 Returns true if the given gene is significant in this cluster.
bool IsDataset (size_t iDataset) const
 Returns true if the given dataset is significant in this cluster.
void RemoveGenes (const std::vector< size_t > &veciGenes)
 Removes the given genes from the cluster.

Detailed Description

Manages a single converging regulatory module for CCoalesce.

A COALESCE cluster represents a regulatory module consisting of genes, conditions where they're coexpressed, and putative regulatory motifs. The cluster object is responsible for both the data necessary to represent such a module (read and written using Open and Save) and the methods used to generate the module (primarily Initialize, SelectConditions, SelectMotifs, and SelecGenes). A variety of optimizations make the algorithm both more efficient and more complex, including discretization of various scores and change tracking between iterations.

Remarks:
A cluster can be generated during COALESCE execution, but it can also be read and written independently. Clusters generally need to be associated with the CPCL object from which they were first created, and may also interact with a CCoalesceMotifLibrary if they contain predicted motifs.
See also:
CCoalesce

Definition at line 48 of file coalescecluster.h.


Member Function Documentation

void Sleipnir::CCoalesceCluster::CalculateHistograms ( const CCoalesceGeneScores GeneScores,
CCoalesceGroupHistograms HistogramsCluster,
CCoalesceGroupHistograms pHistogramsPot 
) const

Updates the given motif score histograms based on the genes in the current cluster using the provided per-gene motif scores.

Parameters:
GeneScoresPer-gene motif scores from which score histograms are calculated.
HistogramsClusterMotif score histograms based on genes in the current cluster.
pHistogramsPotIf non-null, motif score histograms based on genes not in the current cluster.

COALESCE finds significant motifs for each cluster by determining which motifs have statistically different distributions in the cluster versus the genomic background. To do this, one histogram of per-gene frequencies is constructed per motif; these histograms are in turn based on the frequencies with which each motif appears in each gene. For example, suppose we have three genes G1 through G3 and three motifs M1 through M3. Based each gene's sequence, we determine that the motifs have the following frequencies:

 Motif G1 G2 G3
 M1    1  1  2
 M2    2  0  2
 M3    1  2  0

If only genes G1 and G2 are in the cluster, we build a frequency histogram as follows:

 Motif 0  1  2
 M1    0  2  0
 M2    1  0  1
 M3    0  1  1

That is, in the cluster, there are zero genes within which M1 occurs zero times, two in which it occurs once, zero in which it occurs twice, and so forth. Each row (i.e. each motif's total histogram) must sum to the number of genes in the cluster, and the resulting inverse histograms for genes not in the cluster (i.e. G3) is:

 Motif 0  1  2
 M1    0  0  1
 M2    0  0  1
 M3    1  0  0

In COALESCE, additional complexity arises since motif frequencies are continuous (and must thus be discretized in the histograms) and are calculated on a per-subsequence-type basis (giving rise to multiple histograms per cluster).

See also:
Snapshot

Definition at line 266 of file coalescecluster.cpp.

References GetGenes().

Referenced by Sleipnir::CCoalesce::Cluster().

const std::set<size_t>& Sleipnir::CCoalesceCluster::GetDatasets ( ) const [inline]

Returns the set of datasets significant in this cluster.

Returns:
Set of datasets included in this cluster.

Definition at line 143 of file coalescecluster.h.

Referenced by Sleipnir::CCoalesce::Cluster().

const std::set<size_t>& Sleipnir::CCoalesceCluster::GetGenes ( ) const [inline]

Returns the set of genes significant in this cluster.

Returns:
Set of genes included in this cluster.

Reimplemented from Sleipnir::CCoalesceClusterImpl.

Definition at line 132 of file coalescecluster.h.

Referenced by CalculateHistograms(), Sleipnir::CCoalesce::Cluster(), GetSimilarity(), Save(), SelectConditions(), SelectGenes(), Snapshot(), and Subtract().

const std::set<SMotifMatch>& Sleipnir::CCoalesceCluster::GetMotifs ( ) const [inline]

Returns the set of motifs significant in this cluster.

Returns:
Set of motifs included in this cluster.

Definition at line 154 of file coalescecluster.h.

Referenced by Sleipnir::CCoalesce::Cluster(), and Save().

float Sleipnir::CCoalesceCluster::GetSimilarity ( const CCoalesceCluster Cluster,
size_t  iGenes,
size_t  iDatasets 
) const

Calculates a similarity score between the given and current clusters.

Parameters:
ClusterCluster to be compared to the current cluster.
iGenesTotal number of available genes.
iDatasetsTotal number of available datasets.
Returns:
Similarity score for the two clusters (minimum zero, increasing indicates greater similarity).
Remarks:
Used by COALESCE to postprocess modules. Generally returns fraction of shared genes proportional to the smaller of the two clusters; Jaccard index can also work well.

Definition at line 1517 of file coalescecluster.cpp.

References GetGenes(), and IsGene().

bool Sleipnir::CCoalesceCluster::Initialize ( const CPCL PCL,
CCoalesceCluster Pot,
const std::vector< SCoalesceDataset > &  vecsDatasets,
std::set< std::pair< size_t, size_t > > &  setpriiSeeds,
const std::vector< float > &  vecdSeed,
size_t  iPairs,
float  dPValue,
float  dProbability,
size_t  iThreads 
)

Randomly initializes a new cluster from the given PCL by selecting a pair of correlated genes and a surrounding seed of additional genes.

Parameters:
PCLExpression data from which cluster is seeded.
PotCluster initialized to the inverse of the current cluster, i.e. all genes not in the new cluster.
vecsDatasetsVector of dataset block structure to be used by the new cluster for subsequent selection of significant conditions.
setpriiSeedsSet of previously failed seed pairs to be excluded for cluster initialization.
vecdSeedIf non-empty, expression vector to be used as a cluster seed.
iPairsMaximum number of gene pairs to be sampled for seed pair discovery.
dPValueP-value threshhold for significant correlation.
dProbabilityPrior probability of gene inclusion.
iThreadsMaximum number of simultaneous threads for gene pair sampling.
Returns:
True if the cluster was successfully initialized; false if no appropriate seed could be found.
Exceptions:
<exceptionclass> Description of criteria for throwing this exception.

A new cluster is initialized by selecting the most highly correlated gene pair from a random sample of the input PCL that is below the significance threshhold. The expression centroid of this pair is then calculated, and all other genes significantly correlated with this centroid are subsequently added. Genes not selected for inclusion are added to the inverse cluster instead. All significance tests are appropriately Bonferroni corrected for multiple hypothesis testing.

Remarks:
Initial seed pair is the most highly correlated and significant gene pair sampled from the input PCL; other genes significantly correlated with the resulting centroid are added subsequently to initialize the cluster. The provided dataset blocks are used for all subsequent condition significance tests and covariance calculations in which the cluster is involved.

Definition at line 89 of file coalescecluster.cpp.

References Sleipnir::CPCL::ENormalizeColumn, Sleipnir::CPCL::GetGenes(), Sleipnir::CPCL::Normalize(), and Sleipnir::CPCL::Open().

Referenced by Sleipnir::CCoalesce::Cluster().

Returns true if the cluster has converged to a previously visited state.

Returns:
True if the cluster is in a previously visited state.
Remarks:
States are a combination of genes, conditions, and motifs, which are hashed into a single integer. This has the potential to cause false collisions, which will at worst prematurely terminate cluster refinement.

Definition at line 93 of file coalescecluster.h.

Referenced by Sleipnir::CCoalesce::Cluster().

bool Sleipnir::CCoalesceCluster::IsDataset ( size_t  iDataset) const [inline]

Returns true if the given dataset is significant in this cluster.

Parameters:
iDatasetDataset index to be tested for inclusion.
Returns:
True if the given dataset is included in this cluster.

Definition at line 182 of file coalescecluster.h.

bool Sleipnir::CCoalesceCluster::IsEmpty ( ) const [inline]

Returns true if the cluster contains no genes or no conditions.

Returns:
True if the cluster contains no genes or no conditions.

Definition at line 104 of file coalescecluster.h.

Referenced by Sleipnir::CCoalesce::Cluster().

bool Sleipnir::CCoalesceCluster::IsGene ( size_t  iGene) const [inline]

Returns true if the given gene is significant in this cluster.

Parameters:
iGeneGene index to be tested for inclusion.
Returns:
True if the given gene is included in this cluster.

Reimplemented from Sleipnir::CCoalesceClusterImpl.

Definition at line 168 of file coalescecluster.h.

Referenced by GetSimilarity(), and Save().

bool Sleipnir::CCoalesceCluster::LabelMotifs ( const CCoalesceMotifLibrary Motifs,
SMotifMatch::EType  eMatchType,
float  dPenaltyGap,
float  dPenaltyMismatch,
float  dPValue 
)

Labels all significant motifs in the cluster with any known TF motifs that match below the given threshhold.

Parameters:
MotifsMotif library to be used for cluster and known motifs.
eMatchTypeType of match to be performed: correlation, rmse, etc.
dPenaltyGapAlignment score penalty for gaps.
dPenaltyMismatchAlignment score penalty for mismatches.
dPValueP-value (or other score) threshhold below which known TFs must match.
Returns:
True if the labeling process succeeded (possibly with no matching labels), false otherwise.

Labels any significant motifs in the cluster with zero or more matching known TF binding motifs. These known labels are scored and will be read and written with subsequence Open/Save calls that include this cluster's motifs.

Remarks:
For each significant motif in the cluster, scans any known TF motifs in Motifs for matches based on one of several PWM matching algorithms (usually correlation p-value).
See also:
CCoalesceMotifLibrary::GetKnown

Definition at line 1596 of file coalescecluster.cpp.

References Sleipnir::CCoalesceMotifLibrary::GetKnowns().

size_t Sleipnir::CCoalesceCluster::Open ( const std::string &  strPCL,
size_t  iSkip,
const CPCL PCL,
CCoalesceMotifLibrary pMotifs = NULL 
)

Opens a cluster based on the given PCL file and, if present, accompanying motif file.

Parameters:
strPCLDescription of parameter strPCL.
iSkipNumber of feature columns to skip between the gene IDs and first experimental column.
PCLPCL with which gene and condition labels are associated.
pMotifsIf non-null, motif library with which cluster motifs are created.
Returns:
Number of successfully read cluster conditions, or -1 on failure.
Remarks:
Input PCL should be in the format generated by Save, i.e. only genes included in the bicluster, all conditions, included conditions sorted to the left and tagged with an initial *. For a given PCL file <filename>.pcl, motifs are read from <filename>_motifs.txt.
See also:
Save

Definition at line 1201 of file coalescecluster.cpp.

References Sleipnir::CPCL::GetExperiment(), Sleipnir::CPCL::GetExperiments(), Sleipnir::CPCL::GetExtension(), Sleipnir::CPCL::GetGene(), Sleipnir::CPCL::GetGenes(), and Sleipnir::CPCL::Open().

size_t Sleipnir::CCoalesceCluster::Open ( std::istream &  istm,
const CPCL PCL,
CCoalesceMotifLibrary pMotifs = NULL 
)

Opens a cluster based on the textual representation (including genes, conditions, and motifs) in the given input stream.

Parameters:
istmInput stream from which cluster is read.
PCLPCL with which gene and condition labels are associated.
pMotifsIf non-null, motif library with which cluster motifs are created.
Returns:
True if cluster was opened successfully, false otherwise.
Remarks:
Opening can fail if the given stream contains gene or cluster IDs not present in the given PCL. Motifs will be ignored if pMotifs is null.
See also:
Save

Definition at line 1268 of file coalescecluster.cpp.

References Sleipnir::CPCL::GetExperiment(), Sleipnir::CPCL::GetGene(), Sleipnir::CCoalesceMotifLibrary::Open(), and Sleipnir::CMeta::Tokenize().

bool Sleipnir::CCoalesceCluster::Open ( const CHierarchy Hierarchy,
const std::vector< CCoalesceCluster > &  vecClusters,
const std::vector< std::string > &  vecstrClusters,
float  dFraction,
float  dCutoff,
size_t  iCutoff,
CCoalesceMotifLibrary pMotifs = NULL 
)

Creates a new cluster by merging the one or more clusters in the given hierarchy.

Parameters:
HierarchyHierarchy of cluster indices to be merged into the newly opened cluster.
vecClustersVector of preexisting clusters, a superset of those in the given hierarchy.
vecstrClustersVector of preexisting cluster IDs.
dFractionMinimum fraction of input cluster in which a gene must appear to be included in the output cluster.
dCutoffEdit distance threshhold beyond which merged motif alignments will be discarded.
iCutoffMaximum number of input motifs which will be merged exactly; larger numbers of motifs will be merged heuristically.
pMotifsIf non-null, motif library by which input and output motifs are managed.
Returns:
True of the new cluster was created successfully, false otherwise.
Remarks:
Currently only genes are filtered by dFraction; all conditions and motifs in any input cluster will be included in the output cluster.
See also:
GetSimilarity | CClustHierarchical

Definition at line 1347 of file coalescecluster.cpp.

void Sleipnir::CCoalesceCluster::RemoveGenes ( const std::vector< size_t > &  veciGenes) [inline]

Removes the given genes from the cluster.

Parameters:
veciGenesVector of gene IDs to be removed.

Definition at line 193 of file coalescecluster.h.

bool Sleipnir::CCoalesceCluster::Save ( const std::string &  strDirectory,
size_t  iID,
const CPCL PCL,
const CCoalesceMotifLibrary pMotifs = NULL 
) const

Saves cluster in a pair of PCL and motif text files in the given directory.

Parameters:
strDirectoryDirectory in which cluster files are generated.
iIDInteger ID assigned to the cluster.
PCLPCL from which gene and condition labels are read.
pMotifsIf non-null, motif library with which cluster motifs are formatted.
Returns:
True if the cluster was saved successfully, false otherwise.

Saves the cluster as a pair of text files (PCL and motifs) in the given directory. Filenames are randomly generated in the format c<id>_<rand>.pcl and c<id>_<rand>_motifs.txt. The PCL file will contain only genes included in the bicluster along with all conditions; conditions in the bicluster will be sorted to the left and tagged with an initial *. Motif files are of the format:

 <type1>    <subtype1>  <score1>    <tf1>:<match1>|<tf2>:<match2>|...|<tfy>:<matchx>
 <pst1>
 <pwm1>
 <type2>    <subtype2>  <score2>    <tf1>:<match1>|<tf2>:<match2>|...|<tfy>:<matchx>
 <pst2>
 <pwm2>
 ...
 <typez>    <subtypez>  <scorez>    <tf1>:<match1>|<tf2>:<match2>|...|<tfy>:<matchx>
 <pstz>
 <pwmz>
See also:
Open

Definition at line 1033 of file coalescecluster.cpp.

References Sleipnir::CPCL::GetExperiment(), Sleipnir::CPCL::GetExperiments(), Sleipnir::CPCL::GetExtension(), Sleipnir::CPCL::GetGenes(), IsGene(), Sleipnir::CPCL::MaskGene(), Sleipnir::CPCL::Open(), Sleipnir::CPCL::Save(), and Sleipnir::CPCL::SetExperiment().

Referenced by Sleipnir::CCoalesce::Cluster().

void Sleipnir::CCoalesceCluster::Save ( std::ostream &  ostm,
size_t  iID,
const CPCL PCL,
const CCoalesceMotifLibrary pMotifs = NULL,
float  dCutoffPWMs = 0,
float  dPenaltyGap = 0,
float  dPenaltyMismatch = 0,
bool  fNoRCs = false 
) const

Saves a textual representation of the cluster (including genes, conditions, and motifs) to the given output stream.

Parameters:
ostmOutput stream to which cluster description is saved.
iIDInteger ID assigned to the cluster.
PCLPCL from which gene and condition labels are read.
pMotifsIf non-null, motif library with which cluster motifs are formatted.
dCutoffPWMsMinimum information threshhold (in bits) for a PWM to be saved.
dPenaltyGapAlignment score penalty for gaps.
dPenaltyMismatchAlignment score penalty for mismatches.
fNoRCsIf true, resolve the given motif into a single strand without reverse complements before saving PWM.

Saves cluster in the following tab-delimited textual format:

 Cluster    <id>
 Genes  <gene1> <gene2> ... <genew>
 Conditions <cond1> <cond2> ... <condx>
 Motifs
 <type1>    <subtype1>  <score1>    <tf1>:<match1>|<tf2>:<match2>|...|<tfy>:<matchy>
 <pst1>
 <pwm1>
 <type2>    <subtype2>  <score2>    <tf1>:<match1>|<tf2>:<match2>|...|<tfy>:<matchy>
 <pst2>
 <pwm2>
 ...
 <typez>    <subtypez>  <scorez>    <tf1>:<match1>|<tf2>:<match2>|...|<tfy>:<matchy>
 <pstz>
 <pwmz>
See also:
Open

Definition at line 1151 of file coalescecluster.cpp.

References Sleipnir::CPCL::GetExperiment(), Sleipnir::CPCL::GetGene(), GetGenes(), and GetMotifs().

bool Sleipnir::CCoalesceCluster::SelectConditions ( const CPCL PCL,
const CCoalesceCluster Pot,
size_t  iThreads,
float  dPValue,
float  dZScore 
)

Performs feature selection to include significant expression conditions in a converging cluster.

Parameters:
PCLExpression dataset from which significant datasets are selected.
PotInverse of current cluster used for genomic background calculations.
iThreadsMaximum number of simultaneous threads for condition significance calculations.
dPValueP-value threshhold for condition significance.
dZScoreZ-score effect size threshhold for condition significance.
Returns:
True if zero or more conditions were selected successfully, false otherwise.

Selects zero or more conditions in which the cluster's current gene set is differentially expressed. That is, in each selected condition, the average expression of genes in the cluster must differ from the genomic background with at least the given significance and effect size threshholds. If dataset blocks were given at cluster initialization time, all conditions in the block are added (or not) simultaneously using a multivariate significance test.

See also:
SelectMotifs | SelectGenes

Definition at line 489 of file coalescecluster.cpp.

References GetGenes().

Referenced by Sleipnir::CCoalesce::Cluster().

bool Sleipnir::CCoalesceCluster::SelectGenes ( const CPCL PCL,
const CCoalesceGeneScores GeneScores,
const CCoalesceGroupHistograms HistsCluster,
const CCoalesceGroupHistograms HistsPot,
size_t  iMinimum,
size_t  iThreads,
CCoalesceCluster Pot,
float  dProbability,
const CCoalesceMotifLibrary pMotifs = NULL 
)

Add and remove (im)probable genes to a converging cluster using Bayesian integration based on the currently selected expression conditions and sequence motifs.

Parameters:
PCLPCL from which genes are selected.
GeneScoresPer-gene motif scores used to determine each gene's probability based on motif frequencies.
HistsClusterPrecalculated histogram of motif frequencies using genes in the cluster.
HistsPotPrecalculated histogram of motif frequencies using genes not in the cluster.
iMinimumMinimum number of genes for which data must be present for a given motif for it to contribute to gene probabilities.
iThreadsMaximum number of simultaneous threads for gene probability calculations.
PotInverse of genes in the cluster; used to determine genomic background probabilities.
dProbabilityProbability threshhold above which genes are included in the cluster.
pMotifsIf non-null, motif library managing significant sequence motifs.
Returns:
True if zero or more genes were included in the cluster, false otherwise.

Adds and removes genes to/from a converging cluster based on the currently selected significant expression conditions and sequence motifs. The probability of gene membership in the cluster given some data, P(g in C|D), is calculated using Bayes rule as P(D|g in C)P(g in C)/(P(D|g in C) + P(D|g notin C)). All data features are assumed to be independent, such that P(D|g in C) is a product over each significant condition/motif of A) the probability of the gene's expression value being drawn from the cluster's distribution of expression values or B) the probability of its frequency for some motif being drawn from the cluster's distribution of frequencies for that motif. Distributions are assumed to be normal, and frequencies are Laplace smoothed to avoid zeros. The prior P(g in C) is used to stabilize cluster convergence, such that it is equal to 1 if a gene was already included in the cluster in a previous iteration and equal to dProbability if the gene was not previously included.

Remarks:
A prior is also applied to the two main data types, expression conditions and sequence motifs. If X and M are the sets of all possible conditions and motifs, respectively, and CX and CM are the significant conditions and motifs in the cluster, then expression probabilities have prior |CX|/|C| and motifs have prior |CM|/|M|.
See also:
SelectConditions | SelectMotifs

Definition at line 755 of file coalescecluster.cpp.

References GetGenes(), and Sleipnir::CPCL::GetGenes().

Referenced by Sleipnir::CCoalesce::Cluster().

bool Sleipnir::CCoalesceCluster::SelectMotifs ( const CCoalesceGroupHistograms HistsCluster,
const CCoalesceGroupHistograms HistsPot,
float  dPValue,
float  dZScore,
size_t  iMaxMotifs,
size_t  iThreads,
const CCoalesceMotifLibrary pMotifs = NULL 
)

Performs feature selection to include significant sequence motifs in a converging cluster.

Parameters:
HistsClusterPrecalculated histogram of motif frequencies using genes in the cluster.
HistsPotPrecalculated histogram of motif frequencies using genes not in the cluster.
dPValueP-value threshhold for motif significance.
dZScoreZ-score effect size threshhold for motif significance.
iMaxMotifsMaximum number of motifs associated with a cluster; if more motifs are present, no new selection is performed.
iThreadsMaximum number of simultaneous threads for motif significance calculations.
pMotifsIf non-null, motif library with which motifs are managed.
Returns:
True if zero or more significant motifs were selected; false otherwise.

Selects zero or more motifs differentially over- or under-enriched in the genes currently in a converging cluster. Motifs are selected on a per-sequence-subtype basis, so a motif may be enriched, for example, in an upstream but not downstream flank. All motifs managed by the given library are tested for significance, including all k-mers, reverse complements, and any currently constructed PSTs. Significance testing is performed by z-scoring the within versus without frequency histograms, and all p-values are Bonferroni corrected.

Remarks:
No motif selection can be performed if pMotifs is null; this will not generate an error, but only expression biclustering (i.e. conditions and genes) will be performed.
See also:
CalculateHistograms | SelectConditions | SelectGenes

Definition at line 588 of file coalescecluster.cpp.

Referenced by Sleipnir::CCoalesce::Cluster().

void Sleipnir::CCoalesceCluster::SetGenes ( size_t  iGenes) [inline]

Replaces the cluster's genes with the requested number of initial genes.

Parameters:
iGenesNumber of genes to be included in the cluster.
Remarks:
Sets the cluster's gene set to 0 through iGenes - 1.

Definition at line 118 of file coalescecluster.h.

Referenced by Sleipnir::CCoalesce::Cluster().

Records the state of the cluster between convergence iterations.

Parameters:
GeneScoresPer-gene sequence scores used to snapshot motif histograms.
HistogramsMotif histograms to be updated using the current gene scores.
Remarks:
Should be called at the start of each COALESCE iteration, after new histograms are calculated but before anything else.
See also:
IsConverged

Definition at line 1553 of file coalescecluster.cpp.

References GetGenes().

Referenced by Sleipnir::CCoalesce::Cluster().

void Sleipnir::CCoalesceCluster::Subtract ( CPCL PCL,
const CCoalesceCluster Pot 
) const

Subtract the average expression value for each condition in the cluster from each gene in the cluster.

Parameters:
PCLExpression matrix from which cluster averages are subtracted.
PotInverse of genes in the cluster; used to determine the difference of the cluster's average from the existing per-condition average.
Remarks:
This effectively masks the average effect of each condition in the cluster from the contained genes' expression values so that the cluster won't be re-found later. Actually subtracts the difference between the cluster average and the overall average from each condition, since the overall average need not be zero.

Definition at line 301 of file coalescecluster.cpp.

References Sleipnir::CPCL::Get(), GetGenes(), and Sleipnir::CMeta::IsNaN().

Referenced by Sleipnir::CCoalesce::Cluster().

Subtract the average frequency of each motif in the cluster from the score for that motif in each gene in the cluster.

Parameters:
GeneScoresPer-gene motif frequency scores from which the cluster averages are subtracted.
Remarks:
This effectively masks the average effect of each motif in the cluster from the contained genes' sequences so that they won't be re-found later.

Definition at line 330 of file coalescecluster.cpp.

References GetGenes().


The documentation for this class was generated from the following files: