Sleipnir
|
Performs regulatory module prediction (gene expression biclustering plus de novo sequence motif discovery) using the COALESCE algorithm of Huttenhower et al. 2009. More...
#include <coalesce.h>
Public Member Functions | |
bool | Cluster (const CPCL &PCL, const CFASTA &FASTA, std::vector< CCoalesceCluster > &vecClusters) |
Executes the COALESCE regulatory module prediction algorithm on the given gene expression (and, optionally, sequence) data. | |
void | SetSeed (const CPCL &PCL) |
Explicitly sets the expression profile used to seed the first module. | |
void | SetPValueCorrelation (float dPValue) |
Sets the correlation p-value threshhold for genes to be included in a cluster during initialization. | |
float | GetPValueCorrelation () const |
Returns the correlation p-value threshhold for genes to be included in a cluster during initialization. | |
void | SetBins (size_t iBins) |
Sets the number of discretization bins used for calculating motif frequency histograms. | |
size_t | GetBins () const |
Returns the number of discretization bins used for calculating motif frequency histograms. | |
float | GetZScoreCondition () const |
Returns the z-score effect size threshhold for including significant expression conditions in a cluster. | |
void | SetZScoreCondition (float dZScore) |
Sets the z-score effect size threshhold for including significant expression conditions in a cluster. | |
float | GetPValueCondition () const |
Returns the p-value threshhold for including significant expression conditions in a cluster. | |
void | SetPValueCondition (float dPValue) |
Sets the p-value threshhold for including significant expression conditions in a cluster. | |
float | GetZScoreMotif () const |
Returns the z-score effect size threshhold for including significant sequence motifs in a cluster. | |
void | SetZScoreMotif (float dZScore) |
Sets the z-score effect size threshhold for including significant sequence motifs in a cluster. | |
float | GetPValueMotif () const |
Returns the p-value threshhold for including significant sequence motifs in a cluster. | |
void | SetPValueMotif (float dPValue) |
Sets the p-value threshhold for including significant sequence motifs in a cluster. | |
float | GetProbabilityGene () const |
Returns the probability threshhold for including genes in a cluster. | |
void | SetProbabilityGene (float dProbability) |
Sets the probability threshhold for including genes in a cluster. | |
bool | IsDirectoryIntermediate () const |
Returns true if a module output directory has been set. | |
const std::string & | GetDirectoryIntermediate () const |
Returns the output directory for predicted modules. | |
void | SetDirectoryIntermediate (const std::string &strDirectoryIntermediate) |
Sets the output directory for predicted modules. | |
void | SetMotifs (CCoalesceMotifLibrary &Motifs) |
Sets the motif library used to manage gene sequences and motifs. | |
const CCoalesceMotifLibrary * | GetMotifs () const |
Returns the motif library used to manage gene sequences and motifs. | |
size_t | GetK () const |
Returns the length of k-mer motifs. | |
void | SetK (size_t iK) |
Sets the length of k-mer motifs. | |
size_t | GetBasesPerMatch () const |
Returns the granularity in base pairs with which motif frequency histograms are calculated. | |
void | SetBasesPerMatch (size_t iBasesPerMatch) |
Sets the granularity in base pairs with which motif frequency histograms are calculated. | |
float | GetPValueMerge () const |
Returns the p-value threshhold at which motifs are merged to build PSTs. | |
void | SetPValueMerge (float dPValue) |
Sets the p-value threshhold at which motifs are merged to build PSTs. | |
float | GetCutoffMerge () const |
Returns the edit distance threshhold at which motifs are merged to build PSTs. | |
void | SetCutoffMerge (float dCutoff) |
Sets the edit distance threshhold at which motifs are merged to build PSTs. | |
size_t | GetSizeMinimum () const |
Returns the minimum number of genes that must be present in a successful module. | |
void | SetSizeMinimum (size_t iSizeGenes) |
Sets the minimum number of genes that must be present in a successful module. | |
size_t | GetSizeMaximum () const |
Returns the maximum number of motifs that may be associated with a converging module. | |
void | SetSizeMaximum (size_t iSizeMotifs) |
Sets the maximum number of motifs that may be associated with a converging module. | |
size_t | GetSizeMerge () const |
Returns the maximum number of motifs that are considered for merging into PSTs during module convergence. | |
void | SetSizeMerge (size_t iSizeMerge) |
Sets the maximum number of motifs that are considered for merging into PSTs during module convergence. | |
void | ClearDatasets () |
Removes all currently set dataset blocks. | |
bool | AddDataset (const std::set< size_t > &setiDataset) |
Adds a block of conditions known to form a non-independent dataset. | |
void | SetNumberCorrelation (size_t iPairs) |
Sets the maximum number of gene pairs subsampled for seed pair discovery during module initialization. | |
size_t | GetNumberCorrelation () const |
Returns the maximum number of gene pairs subsampled for seed pair discovery during module initialization. | |
void | SetThreads (size_t iThreads) |
Sets the maximum number of simultaneous threads used for clustering. | |
size_t | GetThreads () const |
Returns the maximum number of simultaneous threads used for clustering. | |
void | AddWiggle (const CFASTA &FASTA) |
Adds a wiggle track of supporting data to be used to weight sequence information. | |
void | ClearWiggles () |
Removes all currently active wiggle tracks. | |
void | AddOutputIntermediate (std::ostream &ostm) |
Adds an output stream to which module information is printed after convergence. | |
void | RemoveOutputIntermediate (std::ostream &ostm) |
Removes an output stream to which module information was printed after convergence. | |
void | ClearOutputIntermediate () |
Removes all currently active intermediate output streams. | |
void | SetNormalize (bool fNormalize) |
Sets the normalization behavior for automatically detected single channel expression conditions. | |
bool | GetNormalize () const |
Returns true if automatic detection and normalization of single channel expression data is enabled. | |
void | ClearSeed () |
Removes any currently set seed expression profile. |
Performs regulatory module prediction (gene expression biclustering plus de novo sequence motif discovery) using the COALESCE algorithm of Huttenhower et al. 2009.
The COALESCE algorithm consumes gene expression data and, optionally, DNA sequences, to predict regulatory modules. These consist of expression biclusters (subsets of genes and conditions) and putative regulatory motifs. COALESCE predicts modules in a serial manner, seeding each module with a small number of correlated genes. It then iterates between feature selection and Bayesian integration of the selected features to determine which genes should be in the module. Feature selection chooses expression conditions in which the cluster's genes are differentially expressed (i.e. significantly different than the genomic background) and sequence motifs over- or under-enriched in sequences associated with the cluster's genes (also relative to genomic background). Bayesian integration assumes that these features are independent (although prior knowledge of non-independent datasets can be provided and used to incorporate covariance information) and calculates the probability with which each gene in the genome is included in the developing module. These two steps (feature selection and Bayesian integration) are iterated until the module has converged, at which point its average values (expression and motif frequencies) are subtracted from its genes' data, and COALESCE continues with the next module. A variety of options and data can be used to modify this procedure, both at the level of the algorithm itself (e.g. the probability threshhold at which genes are included in a module) and at the level of implementation optimizations (e.g. the granularity with which motif frequencies are discretized).
Definition at line 61 of file coalesce.h.
bool Sleipnir::CCoalesce::AddDataset | ( | const std::set< size_t > & | setiDataset | ) | [inline] |
Adds a block of conditions known to form a non-independent dataset.
setiDataset | Set of condition indices forming a dataset. |
Adds a dataset block to subsequent executions of COALESCE. A dataset block consists of two or more expression conditions known to be non-independent, e.g. multiple conditions belonging to the same time course. Such dataset blocks are treated as units for inclusion in/exclusion from predicted modules, and their covariance is determined and incorporated into significance calculations for differential expression.
Definition at line 573 of file coalesce.h.
void Sleipnir::CCoalesce::AddOutputIntermediate | ( | std::ostream & | ostm | ) | [inline] |
Adds an output stream to which module information is printed after convergence.
ostm | Output stream to which each module will be printed after it converges. |
Definition at line 695 of file coalesce.h.
void Sleipnir::CCoalesce::AddWiggle | ( | const CFASTA & | FASTA | ) | [inline] |
Adds a wiggle track of supporting data to be used to weight sequence information.
FASTA | FASTA file containing peudo-wiggle-track formatted per-base weights for gene sequences. |
Adds a wiggle track of supporting information used to weight gene sequence positions during COALESCE clustering. A wiggle track as used by COALESCE is not precisely in the wiggle track format as defined by the ENCODE project; instead, it is a FASTA file in which sequence base pairs have been replaced by per-base-pair scores, one floating point value per line. In COALESCE, one or more wiggle tracks can be used to weight the individual base pairs used to determine motif occurrence and frequencies. Lower weights (down to zero) will downweight the base pairs at those positions (and thus the effective frequencies of any motifs that occur there), and higher weights will upweight them. In the absence of wiggle tracks, the default weight of all base pairs is one.
Definition at line 666 of file coalesce.h.
void Sleipnir::CCoalesce::ClearDatasets | ( | ) | [inline] |
Removes all currently set dataset blocks.
Definition at line 547 of file coalesce.h.
void Sleipnir::CCoalesce::ClearOutputIntermediate | ( | ) | [inline] |
Removes all currently active intermediate output streams.
Definition at line 725 of file coalesce.h.
void Sleipnir::CCoalesce::ClearSeed | ( | ) | [inline] |
Removes any currently set seed expression profile.
Definition at line 768 of file coalesce.h.
void Sleipnir::CCoalesce::ClearWiggles | ( | ) | [inline] |
Removes all currently active wiggle tracks.
Definition at line 677 of file coalesce.h.
bool Sleipnir::CCoalesce::Cluster | ( | const CPCL & | PCL, |
const CFASTA & | FASTA, | ||
std::vector< CCoalesceCluster > & | vecClusters | ||
) |
Executes the COALESCE regulatory module prediction algorithm on the given gene expression (and, optionally, sequence) data.
PCL | PCL file containing genes and expression values with which clustering is performed. |
FASTA | FASTA file (possibly empty) containing gene sequences used for motif prediction during clustering. |
vecClusters | Output vector of regulatory modules predicted by COALESCE. |
Executes the COALESCE algorithm on the given data, predicting zero or more regulatory modules (expression biclusters plus putative sequence motifs). Each predicted module consists of one or more genes, one or more conditions of the given PCL in which those genes are coregulated, and zero or more sequence motifs over- or under-enriched (and thus potentially causal) in the module's genes. For more details, see CCoalesce and Huttenhower et al. 2009.
Definition at line 539 of file coalesce.cpp.
References Sleipnir::CCoalesceCluster::CalculateHistograms(), GetBasesPerMatch(), GetBins(), GetCutoffMerge(), Sleipnir::CCoalesceCluster::GetDatasets(), GetDirectoryIntermediate(), Sleipnir::CPCL::GetExperiment(), Sleipnir::CPCL::GetExperiments(), Sleipnir::CPCL::GetGene(), Sleipnir::CCoalesceCluster::GetGenes(), Sleipnir::CPCL::GetGenes(), GetK(), Sleipnir::CCoalesceCluster::GetMotifs(), GetMotifs(), GetNormalize(), GetNumberCorrelation(), GetProbabilityGene(), GetPValueCondition(), GetPValueCorrelation(), GetPValueMerge(), GetPValueMotif(), GetSizeMaximum(), GetSizeMerge(), GetSizeMinimum(), GetThreads(), GetZScoreCondition(), GetZScoreMotif(), Sleipnir::CCoalesceCluster::Initialize(), Sleipnir::CCoalesceCluster::IsConverged(), IsDirectoryIntermediate(), Sleipnir::CCoalesceCluster::IsEmpty(), Sleipnir::CPCL::Open(), Sleipnir::CCoalesceCluster::Save(), Sleipnir::CCoalesceCluster::SelectConditions(), Sleipnir::CCoalesceCluster::SelectGenes(), Sleipnir::CCoalesceCluster::SelectMotifs(), Sleipnir::CCoalesceCluster::SetGenes(), Sleipnir::CCoalesceCluster::Snapshot(), and Sleipnir::CCoalesceCluster::Subtract().
size_t Sleipnir::CCoalesce::GetBasesPerMatch | ( | ) | const [inline] |
Returns the granularity in base pairs with which motif frequency histograms are calculated.
Definition at line 373 of file coalesce.h.
Referenced by Cluster().
size_t Sleipnir::CCoalesce::GetBins | ( | ) | const [inline] |
Returns the number of discretization bins used for calculating motif frequency histograms.
Definition at line 119 of file coalesce.h.
Referenced by Cluster().
float Sleipnir::CCoalesce::GetCutoffMerge | ( | ) | const [inline] |
Returns the edit distance threshhold at which motifs are merged to build PSTs.
Definition at line 432 of file coalesce.h.
Referenced by Cluster().
const std::string& Sleipnir::CCoalesce::GetDirectoryIntermediate | ( | ) | const [inline] |
Returns the output directory for predicted modules.
Definition at line 287 of file coalesce.h.
Referenced by Cluster(), and IsDirectoryIntermediate().
size_t Sleipnir::CCoalesce::GetK | ( | ) | const [inline] |
Returns the length of k-mer motifs.
Definition at line 345 of file coalesce.h.
Referenced by Cluster().
const CCoalesceMotifLibrary* Sleipnir::CCoalesce::GetMotifs | ( | ) | const [inline] |
Returns the motif library used to manage gene sequences and motifs.
Definition at line 331 of file coalesce.h.
Referenced by Cluster().
bool Sleipnir::CCoalesce::GetNormalize | ( | ) | const [inline] |
Returns true if automatic detection and normalization of single channel expression data is enabled.
Definition at line 757 of file coalesce.h.
Referenced by Cluster().
size_t Sleipnir::CCoalesce::GetNumberCorrelation | ( | ) | const [inline] |
Returns the maximum number of gene pairs subsampled for seed pair discovery during module initialization.
Definition at line 611 of file coalesce.h.
Referenced by Cluster().
float Sleipnir::CCoalesce::GetProbabilityGene | ( | ) | const [inline] |
Returns the probability threshhold for including genes in a cluster.
Definition at line 245 of file coalesce.h.
Referenced by Cluster().
float Sleipnir::CCoalesce::GetPValueCondition | ( | ) | const [inline] |
Returns the p-value threshhold for including significant expression conditions in a cluster.
Definition at line 161 of file coalesce.h.
Referenced by Cluster().
float Sleipnir::CCoalesce::GetPValueCorrelation | ( | ) | const [inline] |
Returns the correlation p-value threshhold for genes to be included in a cluster during initialization.
Definition at line 91 of file coalesce.h.
Referenced by Cluster().
float Sleipnir::CCoalesce::GetPValueMerge | ( | ) | const [inline] |
Returns the p-value threshhold at which motifs are merged to build PSTs.
Definition at line 404 of file coalesce.h.
Referenced by Cluster().
float Sleipnir::CCoalesce::GetPValueMotif | ( | ) | const [inline] |
Returns the p-value threshhold for including significant sequence motifs in a cluster.
Definition at line 217 of file coalesce.h.
Referenced by Cluster().
size_t Sleipnir::CCoalesce::GetSizeMaximum | ( | ) | const [inline] |
Returns the maximum number of motifs that may be associated with a converging module.
Definition at line 488 of file coalesce.h.
Referenced by Cluster().
size_t Sleipnir::CCoalesce::GetSizeMerge | ( | ) | const [inline] |
Returns the maximum number of motifs that are considered for merging into PSTs during module convergence.
Definition at line 519 of file coalesce.h.
Referenced by Cluster().
size_t Sleipnir::CCoalesce::GetSizeMinimum | ( | ) | const [inline] |
Returns the minimum number of genes that must be present in a successful module.
Definition at line 460 of file coalesce.h.
Referenced by Cluster().
size_t Sleipnir::CCoalesce::GetThreads | ( | ) | const [inline] |
Returns the maximum number of simultaneous threads used for clustering.
Definition at line 639 of file coalesce.h.
Referenced by Cluster().
float Sleipnir::CCoalesce::GetZScoreCondition | ( | ) | const [inline] |
Returns the z-score effect size threshhold for including significant expression conditions in a cluster.
Definition at line 133 of file coalesce.h.
Referenced by Cluster().
float Sleipnir::CCoalesce::GetZScoreMotif | ( | ) | const [inline] |
Returns the z-score effect size threshhold for including significant sequence motifs in a cluster.
Definition at line 189 of file coalesce.h.
Referenced by Cluster().
bool Sleipnir::CCoalesce::IsDirectoryIntermediate | ( | ) | const [inline] |
Returns true if a module output directory has been set.
Definition at line 273 of file coalesce.h.
References GetDirectoryIntermediate().
Referenced by Cluster().
void Sleipnir::CCoalesce::RemoveOutputIntermediate | ( | std::ostream & | ostm | ) | [inline] |
Removes an output stream to which module information was printed after convergence.
ostm | Output stream to which modules were to be printed. |
Definition at line 712 of file coalesce.h.
void Sleipnir::CCoalesce::SetBasesPerMatch | ( | size_t | iBasesPerMatch | ) | [inline] |
Sets the granularity in base pairs with which motif frequency histograms are calculated.
iBasesPerMatch | Number of base pairs per match used to calculated motif frequency histograms. |
Definition at line 390 of file coalesce.h.
void Sleipnir::CCoalesce::SetBins | ( | size_t | iBins | ) | [inline] |
Sets the number of discretization bins used for calculating motif frequency histograms.
iBins | Number of bins used to discretize motif frequencies. |
Definition at line 105 of file coalesce.h.
void Sleipnir::CCoalesce::SetCutoffMerge | ( | float | dCutoff | ) | [inline] |
Sets the edit distance threshhold at which motifs are merged to build PSTs.
dCutoff | Edit distance threshhold at which motifs are merged to build PSTs. |
Definition at line 446 of file coalesce.h.
void Sleipnir::CCoalesce::SetDirectoryIntermediate | ( | const std::string & | strDirectoryIntermediate | ) | [inline] |
Sets the output directory for predicted modules.
strDirectoryIntermediate | Output directory in which predicted modules are saved. |
Definition at line 301 of file coalesce.h.
void Sleipnir::CCoalesce::SetK | ( | size_t | iK | ) | [inline] |
Sets the length of k-mer motifs.
iK | K-mer length of predicted motifs; also used as building blocks for more complex motifs. |
Definition at line 359 of file coalesce.h.
void Sleipnir::CCoalesce::SetMotifs | ( | CCoalesceMotifLibrary & | Motifs | ) | [inline] |
Sets the motif library used to manage gene sequences and motifs.
Motifs | Motif library used to manage gene sequences and motifs during clustering. |
Definition at line 315 of file coalesce.h.
void Sleipnir::CCoalesce::SetNormalize | ( | bool | fNormalize | ) | [inline] |
Sets the normalization behavior for automatically detected single channel expression conditions.
fNormalize | If true, single channel conditions are detected and normalized; otherwise, they are left unchanged. |
Definition at line 743 of file coalesce.h.
void Sleipnir::CCoalesce::SetNumberCorrelation | ( | size_t | iPairs | ) | [inline] |
Sets the maximum number of gene pairs subsampled for seed pair discovery during module initialization.
iPairs | Maximum number of gene pairs subsampled for module seeding. |
Definition at line 597 of file coalesce.h.
void Sleipnir::CCoalesce::SetProbabilityGene | ( | float | dProbability | ) | [inline] |
Sets the probability threshhold for including genes in a cluster.
dProbability | Probability threshhold for inclusion of genes in a cluster. |
Definition at line 259 of file coalesce.h.
void Sleipnir::CCoalesce::SetPValueCondition | ( | float | dPValue | ) | [inline] |
Sets the p-value threshhold for including significant expression conditions in a cluster.
dPValue | P-value threshhold for inclusion of expression conditions in a cluster. |
Definition at line 175 of file coalesce.h.
void Sleipnir::CCoalesce::SetPValueCorrelation | ( | float | dPValue | ) | [inline] |
Sets the correlation p-value threshhold for genes to be included in a cluster during initialization.
dPValue | Correlation p-value threshhold for gene inclusion during module initialization. |
Definition at line 77 of file coalesce.h.
void Sleipnir::CCoalesce::SetPValueMerge | ( | float | dPValue | ) | [inline] |
Sets the p-value threshhold at which motifs are merged to build PSTs.
dPValue | P-value threshhold at which motifs are merged to build PSTs. |
Definition at line 418 of file coalesce.h.
void Sleipnir::CCoalesce::SetPValueMotif | ( | float | dPValue | ) | [inline] |
Sets the p-value threshhold for including significant sequence motifs in a cluster.
dPValue | P-value threshhold for inclusion of motifs in a cluster. |
Definition at line 231 of file coalesce.h.
void Sleipnir::CCoalesce::SetSeed | ( | const CPCL & | PCL | ) |
Explicitly sets the expression profile used to seed the first module.
PCL | PCL from which expression profile to be seeded is read. |
Forces the first module to be seeded with the given expression profile rather than a randomly chosen significantly correlated gene pair.
Definition at line 313 of file coalesce.cpp.
References Sleipnir::CPCL::Get(), and Sleipnir::CPCL::GetExperiments().
void Sleipnir::CCoalesce::SetSizeMaximum | ( | size_t | iSizeMotifs | ) | [inline] |
Sets the maximum number of motifs that may be associated with a converging module.
iSizeMotifs | Maximum number of motifs associated with a converging module. |
Definition at line 505 of file coalesce.h.
void Sleipnir::CCoalesce::SetSizeMerge | ( | size_t | iSizeMerge | ) | [inline] |
Sets the maximum number of motifs that are considered for merging into PSTs during module convergence.
iSizeMerge | Maximum number of motifs considered for PSTs construction during module convergence. |
Definition at line 536 of file coalesce.h.
void Sleipnir::CCoalesce::SetSizeMinimum | ( | size_t | iSizeGenes | ) | [inline] |
Sets the minimum number of genes that must be present in a successful module.
iSizeGenes | Minimum number of genes present in a successful module. |
Definition at line 474 of file coalesce.h.
void Sleipnir::CCoalesce::SetThreads | ( | size_t | iThreads | ) | [inline] |
Sets the maximum number of simultaneous threads used for clustering.
iThreads | Maximum number of simultaneous threads used during clustering. |
Definition at line 625 of file coalesce.h.
void Sleipnir::CCoalesce::SetZScoreCondition | ( | float | dZScore | ) | [inline] |
Sets the z-score effect size threshhold for including significant expression conditions in a cluster.
dZScore | Z-score threshhold for inclusion of expression conditions in a cluster. |
Definition at line 147 of file coalesce.h.
void Sleipnir::CCoalesce::SetZScoreMotif | ( | float | dZScore | ) | [inline] |
Sets the z-score effect size threshhold for including significant sequence motifs in a cluster.
dZScore | Z-score threshhold for inclusion of motifs in a cluster. |
Definition at line 203 of file coalesce.h.