Manages a single converging regulatory module for CCoalesce. More...

#include <coalescecluster.h>

Inheritance diagram for Sleipnir::CCoalesceCluster:

Public Member Functions
bool	Initialize (const CPCL &PCL, CCoalesceCluster &Pot, const std::vector< SCoalesceDataset > &vecsDatasets, std::set< std::pair< size_t, size_t > > &setpriiSeeds, const std::vector< float > &vecdSeed, size_t iPairs, float dPValue, float dProbability, size_t iThreads)
	Randomly initializes a new cluster from the given PCL by selecting a pair of correlated genes and a surrounding seed of additional genes.
void	Subtract (CPCL &PCL, const CCoalesceCluster &Pot) const
	Subtract the average expression value for each condition in the cluster from each gene in the cluster.
void	Subtract (CCoalesceGeneScores &GeneScores) const
	Subtract the average frequency of each motif in the cluster from the score for that motif in each gene in the cluster.
bool	SelectConditions (const CPCL &PCL, const CCoalesceCluster &Pot, size_t iThreads, float dPValue, float dZScore)
	Performs feature selection to include significant expression conditions in a converging cluster.
bool	SelectMotifs (const CCoalesceGroupHistograms &HistsCluster, const CCoalesceGroupHistograms &HistsPot, float dPValue, float dZScore, size_t iMaxMotifs, size_t iThreads, const CCoalesceMotifLibrary *pMotifs=NULL)
	Performs feature selection to include significant sequence motifs in a converging cluster.
bool	SelectGenes (const CPCL &PCL, const CCoalesceGeneScores &GeneScores, const CCoalesceGroupHistograms &HistsCluster, const CCoalesceGroupHistograms &HistsPot, size_t iMinimum, size_t iThreads, CCoalesceCluster &Pot, float dProbability, const CCoalesceMotifLibrary *pMotifs=NULL)
	Add and remove (im)probable genes to a converging cluster using Bayesian integration based on the currently selected expression conditions and sequence motifs.
void	CalculateHistograms (const CCoalesceGeneScores &GeneScores, CCoalesceGroupHistograms &HistogramsCluster, CCoalesceGroupHistograms *pHistogramsPot) const
	Updates the given motif score histograms based on the genes in the current cluster using the provided per-gene motif scores.
size_t	Open (const std::string &strPCL, size_t iSkip, const CPCL &PCL, CCoalesceMotifLibrary *pMotifs=NULL)
	Opens a cluster based on the given PCL file and, if present, accompanying motif file.
size_t	Open (std::istream &istm, const CPCL &PCL, CCoalesceMotifLibrary *pMotifs=NULL)
	Opens a cluster based on the textual representation (including genes, conditions, and motifs) in the given input stream.
bool	Open (const CHierarchy &Hierarchy, const std::vector< CCoalesceCluster > &vecClusters, const std::vector< std::string > &vecstrClusters, float dFraction, float dCutoff, size_t iCutoff, CCoalesceMotifLibrary *pMotifs=NULL)
	Creates a new cluster by merging the one or more clusters in the given hierarchy.
bool	Save (const std::string &strDirectory, size_t iID, const CPCL &PCL, const CCoalesceMotifLibrary *pMotifs=NULL) const
	Saves cluster in a pair of PCL and motif text files in the given directory.
void	Save (std::ostream &ostm, size_t iID, const CPCL &PCL, const CCoalesceMotifLibrary *pMotifs=NULL, float dCutoffPWMs=0, float dPenaltyGap=0, float dPenaltyMismatch=0, bool fNoRCs=false) const
	Saves a textual representation of the cluster (including genes, conditions, and motifs) to the given output stream.
float	GetSimilarity (const CCoalesceCluster &Cluster, size_t iGenes, size_t iDatasets) const
	Calculates a similarity score between the given and current clusters.
void	Snapshot (const CCoalesceGeneScores &GeneScores, CCoalesceGroupHistograms &Histograms)
	Records the state of the cluster between convergence iterations.
bool	LabelMotifs (const CCoalesceMotifLibrary &Motifs, SMotifMatch::EType eMatchType, float dPenaltyGap, float dPenaltyMismatch, float dPValue)
	Labels all significant motifs in the cluster with any known TF motifs that match below the given threshhold.
bool	IsConverged ()
	Returns true if the cluster has converged to a previously visited state.
bool	IsEmpty () const
	Returns true if the cluster contains no genes or no conditions.
void	SetGenes (size_t iGenes)
	Replaces the cluster's genes with the requested number of initial genes.
const std::set< size_t > &	GetGenes () const
	Returns the set of genes significant in this cluster.
const std::set< size_t > &	GetDatasets () const
	Returns the set of datasets significant in this cluster.
const std::set< SMotifMatch > &	GetMotifs () const
	Returns the set of motifs significant in this cluster.
bool	IsGene (size_t iGene) const
	Returns true if the given gene is significant in this cluster.
bool	IsDataset (size_t iDataset) const
	Returns true if the given dataset is significant in this cluster.
void	RemoveGenes (const std::vector< size_t > &veciGenes)
	Removes the given genes from the cluster.

Detailed Description

Manages a single converging regulatory module for CCoalesce.

A COALESCE cluster represents a regulatory module consisting of genes, conditions where they're coexpressed, and putative regulatory motifs. The cluster object is responsible for both the data necessary to represent such a module (read and written using Open and Save) and the methods used to generate the module (primarily Initialize, SelectConditions, SelectMotifs, and SelecGenes). A variety of optimizations make the algorithm both more efficient and more complex, including discretization of various scores and change tracking between iterations.

Remarks:: A cluster can be generated during COALESCE execution, but it can also be read and written independently. Clusters generally need to be associated with the CPCL object from which they were first created, and may also interact with a CCoalesceMotifLibrary if they contain predicted motifs.

See also:: CCoalesce

Definition at line 48 of file coalescecluster.h.

Member Function Documentation

void Sleipnir::CCoalesceCluster::CalculateHistograms	(	const CCoalesceGeneScores &	GeneScores,
		CCoalesceGroupHistograms &	HistogramsCluster,
		CCoalesceGroupHistograms *	pHistogramsPot
	)		const

Updates the given motif score histograms based on the genes in the current cluster using the provided per-gene motif scores.

Parameters:

GeneScores	Per-gene motif scores from which score histograms are calculated.
HistogramsCluster	Motif score histograms based on genes in the current cluster.
pHistogramsPot	If non-null, motif score histograms based on genes not in the current cluster.

COALESCE finds significant motifs for each cluster by determining which motifs have statistically different distributions in the cluster versus the genomic background. To do this, one histogram of per-gene frequencies is constructed per motif; these histograms are in turn based on the frequencies with which each motif appears in each gene. For example, suppose we have three genes G1 through G3 and three motifs M1 through M3. Based each gene's sequence, we determine that the motifs have the following frequencies:

 Motif G1 G2 G3
 M1    1  1  2
 M2    2  0  2
 M3    1  2  0

If only genes G1 and G2 are in the cluster, we build a frequency histogram as follows:

 Motif 0  1  2
 M1    0  2  0
 M2    1  0  1
 M3    0  1  1

That is, in the cluster, there are zero genes within which M1 occurs zero times, two in which it occurs once, zero in which it occurs twice, and so forth. Each row (i.e. each motif's total histogram) must sum to the number of genes in the cluster, and the resulting inverse histograms for genes not in the cluster (i.e. G3) is:

 Motif 0  1  2
 M1    0  0  1
 M2    0  0  1
 M3    1  0  0

In COALESCE, additional complexity arises since motif frequencies are continuous (and must thus be discretized in the histograms) and are calculated on a per-subsequence-type basis (giving rise to multiple histograms per cluster).

See also:: Snapshot

Definition at line 266 of file coalescecluster.cpp.

References GetGenes().

Referenced by Sleipnir::CCoalesce::Cluster().

const std::set<size_t>& Sleipnir::CCoalesceCluster::GetDatasets ( ) const [inline]

Returns the set of datasets significant in this cluster.

Returns:: Set of datasets included in this cluster.

Definition at line 143 of file coalescecluster.h.

Referenced by Sleipnir::CCoalesce::Cluster().

const std::set<size_t>& Sleipnir::CCoalesceCluster::GetGenes ( ) const [inline]

Returns the set of genes significant in this cluster.

Returns:: Set of genes included in this cluster.

Reimplemented from Sleipnir::CCoalesceClusterImpl.

Definition at line 132 of file coalescecluster.h.

Referenced by CalculateHistograms(), Sleipnir::CCoalesce::Cluster(), GetSimilarity(), Save(), SelectConditions(), SelectGenes(), Snapshot(), and Subtract().

const std::set<SMotifMatch>& Sleipnir::CCoalesceCluster::GetMotifs ( ) const [inline]

Returns the set of motifs significant in this cluster.

Returns:: Set of motifs included in this cluster.

Definition at line 154 of file coalescecluster.h.

Referenced by Sleipnir::CCoalesce::Cluster(), and Save().

float Sleipnir::CCoalesceCluster::GetSimilarity	(	const CCoalesceCluster &	Cluster,
		size_t	iGenes,
		size_t	iDatasets
	)		const

Calculates a similarity score between the given and current clusters.

Parameters:

Cluster	Cluster to be compared to the current cluster.
iGenes	Total number of available genes.
iDatasets	Total number of available datasets.

Returns:: Similarity score for the two clusters (minimum zero, increasing indicates greater similarity).

Remarks:: Used by COALESCE to postprocess modules. Generally returns fraction of shared genes proportional to the smaller of the two clusters; Jaccard index can also work well.

Definition at line 1517 of file coalescecluster.cpp.

References GetGenes(), and IsGene().

bool Sleipnir::CCoalesceCluster::Initialize	(	const CPCL &	PCL,
		CCoalesceCluster &	Pot,
		const std::vector< SCoalesceDataset > &	vecsDatasets,
		std::set< std::pair< size_t, size_t > > &	setpriiSeeds,
		const std::vector< float > &	vecdSeed,
		size_t	iPairs,
		float	dPValue,
		float	dProbability,
		size_t	iThreads
	)

Randomly initializes a new cluster from the given PCL by selecting a pair of correlated genes and a surrounding seed of additional genes.

Parameters:

PCL	Expression data from which cluster is seeded.
Pot	Cluster initialized to the inverse of the current cluster, i.e. all genes not in the new cluster.
vecsDatasets	Vector of dataset block structure to be used by the new cluster for subsequent selection of significant conditions.
setpriiSeeds	Set of previously failed seed pairs to be excluded for cluster initialization.
vecdSeed	If non-empty, expression vector to be used as a cluster seed.
iPairs	Maximum number of gene pairs to be sampled for seed pair discovery.
dPValue	P-value threshhold for significant correlation.
dProbability	Prior probability of gene inclusion.
iThreads	Maximum number of simultaneous threads for gene pair sampling.

Returns:: True if the cluster was successfully initialized; false if no appropriate seed could be found.

Exceptions:

<exception class> Description of criteria for throwing this exception.

A new cluster is initialized by selecting the most highly correlated gene pair from a random sample of the input PCL that is below the significance threshhold. The expression centroid of this pair is then calculated, and all other genes significantly correlated with this centroid are subsequently added. Genes not selected for inclusion are added to the inverse cluster instead. All significance tests are appropriately Bonferroni corrected for multiple hypothesis testing.

Remarks:: Initial seed pair is the most highly correlated and significant gene pair sampled from the input PCL; other genes significantly correlated with the resulting centroid are added subsequently to initialize the cluster. The provided dataset blocks are used for all subsequent condition significance tests and covariance calculations in which the cluster is involved.

Definition at line 89 of file coalescecluster.cpp.

References Sleipnir::CPCL::ENormalizeColumn, Sleipnir::CPCL::GetGenes(), Sleipnir::CPCL::Normalize(), and Sleipnir::CPCL::Open().

Referenced by Sleipnir::CCoalesce::Cluster().

bool Sleipnir::CCoalesceCluster::IsConverged ( ) [inline]

Returns true if the cluster has converged to a previously visited state.

Returns:: True if the cluster is in a previously visited state.

Remarks:: States are a combination of genes, conditions, and motifs, which are hashed into a single integer. This has the potential to cause false collisions, which will at worst prematurely terminate cluster refinement.

Definition at line 93 of file coalescecluster.h.

Referenced by Sleipnir::CCoalesce::Cluster().

bool Sleipnir::CCoalesceCluster::IsDataset ( size_t iDataset ) const [inline]

Returns true if the given dataset is significant in this cluster.

Parameters:

iDataset Dataset index to be tested for inclusion.

Returns:: True if the given dataset is included in this cluster.

Definition at line 182 of file coalescecluster.h.

bool Sleipnir::CCoalesceCluster::IsEmpty ( ) const [inline]

Returns true if the cluster contains no genes or no conditions.

Returns:: True if the cluster contains no genes or no conditions.

Definition at line 104 of file coalescecluster.h.

Referenced by Sleipnir::CCoalesce::Cluster().

bool Sleipnir::CCoalesceCluster::IsGene ( size_t iGene ) const [inline]

Returns true if the given gene is significant in this cluster.

Parameters:

iGene Gene index to be tested for inclusion.

Returns:: True if the given gene is included in this cluster.

Reimplemented from Sleipnir::CCoalesceClusterImpl.

Definition at line 168 of file coalescecluster.h.

Referenced by GetSimilarity(), and Save().

bool Sleipnir::CCoalesceCluster::LabelMotifs	(	const CCoalesceMotifLibrary &	Motifs,
		SMotifMatch::EType	eMatchType,
		float	dPenaltyGap,
		float	dPenaltyMismatch,
		float	dPValue
	)

Labels all significant motifs in the cluster with any known TF motifs that match below the given threshhold.

Parameters:

Motifs	Motif library to be used for cluster and known motifs.
eMatchType	Type of match to be performed: correlation, rmse, etc.
dPenaltyGap	Alignment score penalty for gaps.
dPenaltyMismatch	Alignment score penalty for mismatches.
dPValue	P-value (or other score) threshhold below which known TFs must match.

Returns:: True if the labeling process succeeded (possibly with no matching labels), false otherwise.

Labels any significant motifs in the cluster with zero or more matching known TF binding motifs. These known labels are scored and will be read and written with subsequence Open/Save calls that include this cluster's motifs.

Remarks:: For each significant motif in the cluster, scans any known TF motifs in Motifs for matches based on one of several PWM matching algorithms (usually correlation p-value).

See also:: CCoalesceMotifLibrary::GetKnown

Definition at line 1596 of file coalescecluster.cpp.

References Sleipnir::CCoalesceMotifLibrary::GetKnowns().

size_t Sleipnir::CCoalesceCluster::Open	(	const std::string &	strPCL,
		size_t	iSkip,
		const CPCL &	PCL,
		CCoalesceMotifLibrary *	pMotifs = `NULL`
	)

Opens a cluster based on the given PCL file and, if present, accompanying motif file.

Parameters:

strPCL	Description of parameter strPCL.
iSkip	Number of feature columns to skip between the gene IDs and first experimental column.
PCL	PCL with which gene and condition labels are associated.
pMotifs	If non-null, motif library with which cluster motifs are created.

Returns:: Number of successfully read cluster conditions, or -1 on failure.

Remarks:: Input PCL should be in the format generated by Save, i.e. only genes included in the bicluster, all conditions, included conditions sorted to the left and tagged with an initial *. For a given PCL file <filename>.pcl, motifs are read from <filename>_motifs.txt.

See also:: Save

Definition at line 1201 of file coalescecluster.cpp.

References Sleipnir::CPCL::GetExperiment(), Sleipnir::CPCL::GetExperiments(), Sleipnir::CPCL::GetExtension(), Sleipnir::CPCL::GetGene(), Sleipnir::CPCL::GetGenes(), and Sleipnir::CPCL::Open().

size_t Sleipnir::CCoalesceCluster::Open	(	std::istream &	istm,
		const CPCL &	PCL,
		CCoalesceMotifLibrary *	pMotifs = `NULL`
	)

Opens a cluster based on the textual representation (including genes, conditions, and motifs) in the given input stream.

Parameters:

istm	Input stream from which cluster is read.
PCL	PCL with which gene and condition labels are associated.
pMotifs	If non-null, motif library with which cluster motifs are created.

Returns:: True if cluster was opened successfully, false otherwise.

Remarks:: Opening can fail if the given stream contains gene or cluster IDs not present in the given PCL. Motifs will be ignored if pMotifs is null.

See also:: Save

Definition at line 1268 of file coalescecluster.cpp.

References Sleipnir::CPCL::GetExperiment(), Sleipnir::CPCL::GetGene(), Sleipnir::CCoalesceMotifLibrary::Open(), and Sleipnir::CMeta::Tokenize().

bool Sleipnir::CCoalesceCluster::Open	(	const CHierarchy &	Hierarchy,
		const std::vector< CCoalesceCluster > &	vecClusters,
		const std::vector< std::string > &	vecstrClusters,
		float	dFraction,
		float	dCutoff,
		size_t	iCutoff,
		CCoalesceMotifLibrary *	pMotifs = `NULL`
	)

Creates a new cluster by merging the one or more clusters in the given hierarchy.

Parameters:

Hierarchy	Hierarchy of cluster indices to be merged into the newly opened cluster.
vecClusters	Vector of preexisting clusters, a superset of those in the given hierarchy.
vecstrClusters	Vector of preexisting cluster IDs.
dFraction	Minimum fraction of input cluster in which a gene must appear to be included in the output cluster.
dCutoff	Edit distance threshhold beyond which merged motif alignments will be discarded.
iCutoff	Maximum number of input motifs which will be merged exactly; larger numbers of motifs will be merged heuristically.
pMotifs	If non-null, motif library by which input and output motifs are managed.

Returns:: True of the new cluster was created successfully, false otherwise.

Remarks:: Currently only genes are filtered by dFraction; all conditions and motifs in any input cluster will be included in the output cluster.

See also:: GetSimilarity | CClustHierarchical

Definition at line 1347 of file coalescecluster.cpp.

void Sleipnir::CCoalesceCluster::RemoveGenes ( const std::vector< size_t > & veciGenes ) [inline]

Removes the given genes from the cluster.

Parameters:

veciGenes Vector of gene IDs to be removed.

Definition at line 193 of file coalescecluster.h.

bool Sleipnir::CCoalesceCluster::Save	(	const std::string &	strDirectory,
		size_t	iID,
		const CPCL &	PCL,
		const CCoalesceMotifLibrary *	pMotifs = `NULL`
	)		const

Saves cluster in a pair of PCL and motif text files in the given directory.

Parameters:

strDirectory	Directory in which cluster files are generated.
iID	Integer ID assigned to the cluster.
PCL	PCL from which gene and condition labels are read.
pMotifs	If non-null, motif library with which cluster motifs are formatted.

Returns:: True if the cluster was saved successfully, false otherwise.

Saves the cluster as a pair of text files (PCL and motifs) in the given directory. Filenames are randomly generated in the format c<id>_<rand>.pcl and c<id>_<rand>_motifs.txt. The PCL file will contain only genes included in the bicluster along with all conditions; conditions in the bicluster will be sorted to the left and tagged with an initial *. Motif files are of the format:

 <type1>    <subtype1>  <score1>    <tf1>:<match1>|<tf2>:<match2>|...|<tfy>:<matchx>
 <pst1>
 <pwm1>
 <type2>    <subtype2>  <score2>    <tf1>:<match1>|<tf2>:<match2>|...|<tfy>:<matchx>
 <pst2>
 <pwm2>
 ...
 <typez>    <subtypez>  <scorez>    <tf1>:<match1>|<tf2>:<match2>|...|<tfy>:<matchx>
 <pstz>
 <pwmz>

See also:: Open

Definition at line 1033 of file coalescecluster.cpp.

References Sleipnir::CPCL::GetExperiment(), Sleipnir::CPCL::GetExperiments(), Sleipnir::CPCL::GetExtension(), Sleipnir::CPCL::GetGenes(), IsGene(), Sleipnir::CPCL::MaskGene(), Sleipnir::CPCL::Open(), Sleipnir::CPCL::Save(), and Sleipnir::CPCL::SetExperiment().

Referenced by Sleipnir::CCoalesce::Cluster().

void Sleipnir::CCoalesceCluster::Save	(	std::ostream &	ostm,
		size_t	iID,
		const CPCL &	PCL,
		const CCoalesceMotifLibrary *	pMotifs = `NULL`,
		float	dCutoffPWMs = `0`,
		float	dPenaltyGap = `0`,
		float	dPenaltyMismatch = `0`,
		bool	fNoRCs = `false`
	)		const

Saves a textual representation of the cluster (including genes, conditions, and motifs) to the given output stream.

Parameters:

ostm	Output stream to which cluster description is saved.
iID	Integer ID assigned to the cluster.
PCL	PCL from which gene and condition labels are read.
pMotifs	If non-null, motif library with which cluster motifs are formatted.
dCutoffPWMs	Minimum information threshhold (in bits) for a PWM to be saved.
dPenaltyGap	Alignment score penalty for gaps.
dPenaltyMismatch	Alignment score penalty for mismatches.
fNoRCs	If true, resolve the given motif into a single strand without reverse complements before saving PWM.

Saves cluster in the following tab-delimited textual format:

 Cluster    <id>
 Genes  <gene1> <gene2> ... <genew>
 Conditions <cond1> <cond2> ... <condx>
 Motifs
 <type1>    <subtype1>  <score1>    <tf1>:<match1>|<tf2>:<match2>|...|<tfy>:<matchy>
 <pst1>
 <pwm1>
 <type2>    <subtype2>  <score2>    <tf1>:<match1>|<tf2>:<match2>|...|<tfy>:<matchy>
 <pst2>
 <pwm2>
 ...
 <typez>    <subtypez>  <scorez>    <tf1>:<match1>|<tf2>:<match2>|...|<tfy>:<matchy>
 <pstz>
 <pwmz>

See also:: Open

Definition at line 1151 of file coalescecluster.cpp.

References Sleipnir::CPCL::GetExperiment(), Sleipnir::CPCL::GetGene(), GetGenes(), and GetMotifs().

bool Sleipnir::CCoalesceCluster::SelectConditions	(	const CPCL &	PCL,
		const CCoalesceCluster &	Pot,
		size_t	iThreads,
		float	dPValue,
		float	dZScore
	)

Performs feature selection to include significant expression conditions in a converging cluster.

Parameters:

PCL	Expression dataset from which significant datasets are selected.
Pot	Inverse of current cluster used for genomic background calculations.
iThreads	Maximum number of simultaneous threads for condition significance calculations.
dPValue	P-value threshhold for condition significance.
dZScore	Z-score effect size threshhold for condition significance.

Returns:: True if zero or more conditions were selected successfully, false otherwise.

Selects zero or more conditions in which the cluster's current gene set is differentially expressed. That is, in each selected condition, the average expression of genes in the cluster must differ from the genomic background with at least the given significance and effect size threshholds. If dataset blocks were given at cluster initialization time, all conditions in the block are added (or not) simultaneously using a multivariate significance test.

See also:: SelectMotifs | SelectGenes

Definition at line 489 of file coalescecluster.cpp.

References GetGenes().

Referenced by Sleipnir::CCoalesce::Cluster().

bool Sleipnir::CCoalesceCluster::SelectGenes	(	const CPCL &	PCL,
		const CCoalesceGeneScores &	GeneScores,
		const CCoalesceGroupHistograms &	HistsCluster,
		const CCoalesceGroupHistograms &	HistsPot,
		size_t	iMinimum,
		size_t	iThreads,
		CCoalesceCluster &	Pot,
		float	dProbability,
		const CCoalesceMotifLibrary *	pMotifs = `NULL`
	)

Add and remove (im)probable genes to a converging cluster using Bayesian integration based on the currently selected expression conditions and sequence motifs.

Parameters:

PCL	PCL from which genes are selected.
GeneScores	Per-gene motif scores used to determine each gene's probability based on motif frequencies.
HistsCluster	Precalculated histogram of motif frequencies using genes in the cluster.
HistsPot	Precalculated histogram of motif frequencies using genes not in the cluster.
iMinimum	Minimum number of genes for which data must be present for a given motif for it to contribute to gene probabilities.
iThreads	Maximum number of simultaneous threads for gene probability calculations.
Pot	Inverse of genes in the cluster; used to determine genomic background probabilities.
dProbability	Probability threshhold above which genes are included in the cluster.
pMotifs	If non-null, motif library managing significant sequence motifs.

Returns:: True if zero or more genes were included in the cluster, false otherwise.

Adds and removes genes to/from a converging cluster based on the currently selected significant expression conditions and sequence motifs. The probability of gene membership in the cluster given some data, P(g in C|D), is calculated using Bayes rule as P(D|g in C)P(g in C)/(P(D|g in C) + P(D|g notin C)). All data features are assumed to be independent, such that P(D|g in C) is a product over each significant condition/motif of A) the probability of the gene's expression value being drawn from the cluster's distribution of expression values or B) the probability of its frequency for some motif being drawn from the cluster's distribution of frequencies for that motif. Distributions are assumed to be normal, and frequencies are Laplace smoothed to avoid zeros. The prior P(g in C) is used to stabilize cluster convergence, such that it is equal to 1 if a gene was already included in the cluster in a previous iteration and equal to dProbability if the gene was not previously included.

Remarks:: A prior is also applied to the two main data types, expression conditions and sequence motifs. If X and M are the sets of all possible conditions and motifs, respectively, and CX and CM are the significant conditions and motifs in the cluster, then expression probabilities have prior |CX|/|C| and motifs have prior |CM|/|M|.

See also:: SelectConditions | SelectMotifs

Definition at line 755 of file coalescecluster.cpp.

References GetGenes(), and Sleipnir::CPCL::GetGenes().

Referenced by Sleipnir::CCoalesce::Cluster().

bool Sleipnir::CCoalesceCluster::SelectMotifs	(	const CCoalesceGroupHistograms &	HistsCluster,
		const CCoalesceGroupHistograms &	HistsPot,
		float	dPValue,
		float	dZScore,
		size_t	iMaxMotifs,
		size_t	iThreads,
		const CCoalesceMotifLibrary *	pMotifs = `NULL`
	)

Performs feature selection to include significant sequence motifs in a converging cluster.

Parameters:

HistsCluster	Precalculated histogram of motif frequencies using genes in the cluster.
HistsPot	Precalculated histogram of motif frequencies using genes not in the cluster.
dPValue	P-value threshhold for motif significance.
dZScore	Z-score effect size threshhold for motif significance.
iMaxMotifs	Maximum number of motifs associated with a cluster; if more motifs are present, no new selection is performed.
iThreads	Maximum number of simultaneous threads for motif significance calculations.
pMotifs	If non-null, motif library with which motifs are managed.

Returns:: True if zero or more significant motifs were selected; false otherwise.

Selects zero or more motifs differentially over- or under-enriched in the genes currently in a converging cluster. Motifs are selected on a per-sequence-subtype basis, so a motif may be enriched, for example, in an upstream but not downstream flank. All motifs managed by the given library are tested for significance, including all k-mers, reverse complements, and any currently constructed PSTs. Significance testing is performed by z-scoring the within versus without frequency histograms, and all p-values are Bonferroni corrected.

Remarks:: No motif selection can be performed if pMotifs is null; this will not generate an error, but only expression biclustering (i.e. conditions and genes) will be performed.

See also:: CalculateHistograms | SelectConditions | SelectGenes

Definition at line 588 of file coalescecluster.cpp.

Referenced by Sleipnir::CCoalesce::Cluster().

void Sleipnir::CCoalesceCluster::SetGenes ( size_t iGenes ) [inline]

Replaces the cluster's genes with the requested number of initial genes.

Parameters:

iGenes Number of genes to be included in the cluster.

Remarks:: Sets the cluster's gene set to 0 through iGenes - 1.

Definition at line 118 of file coalescecluster.h.

Referenced by Sleipnir::CCoalesce::Cluster().

void Sleipnir::CCoalesceCluster::Snapshot	(	const CCoalesceGeneScores &	GeneScores,
		CCoalesceGroupHistograms &	Histograms
	)

Records the state of the cluster between convergence iterations.

Parameters:

GeneScores	Per-gene sequence scores used to snapshot motif histograms.
Histograms	Motif histograms to be updated using the current gene scores.

Remarks:: Should be called at the start of each COALESCE iteration, after new histograms are calculated but before anything else.

See also:: IsConverged

Definition at line 1553 of file coalescecluster.cpp.

References GetGenes().

Referenced by Sleipnir::CCoalesce::Cluster().

void Sleipnir::CCoalesceCluster::Subtract	(	CPCL &	PCL,
		const CCoalesceCluster &	Pot
	)		const

Subtract the average expression value for each condition in the cluster from each gene in the cluster.

Parameters:

PCL	Expression matrix from which cluster averages are subtracted.
Pot	Inverse of genes in the cluster; used to determine the difference of the cluster's average from the existing per-condition average.

Remarks:: This effectively masks the average effect of each condition in the cluster from the contained genes' expression values so that the cluster won't be re-found later. Actually subtracts the difference between the cluster average and the overall average from each condition, since the overall average need not be zero.

Definition at line 301 of file coalescecluster.cpp.

References Sleipnir::CPCL::Get(), GetGenes(), and Sleipnir::CMeta::IsNaN().

Referenced by Sleipnir::CCoalesce::Cluster().

void Sleipnir::CCoalesceCluster::Subtract ( CCoalesceGeneScores & GeneScores ) const

Subtract the average frequency of each motif in the cluster from the score for that motif in each gene in the cluster.

Parameters:

GeneScores Per-gene motif frequency scores from which the cluster averages are subtracted.

Remarks:: This effectively masks the average effect of each motif in the cluster from the contained genes' sequences so that they won't be re-found later.

Definition at line 330 of file coalescecluster.cpp.

References GetGenes().

The documentation for this class was generated from the following files:

src/coalescecluster.h
src/coalescecluster.cpp

Public Member Functions

Detailed Description

Member Function Documentation