Sleipnir
Public Member Functions | Static Public Member Functions
Sleipnir::CCoalesceMotifLibrary Class Reference

Manages a set of kmer, reverse complement, and probabilistic suffix tree motifs for CCoalesce. More...

#include <coalescemotifs.h>

Inheritance diagram for Sleipnir::CCoalesceMotifLibrary:
Sleipnir::CCoalesceMotifLibraryImpl

Public Member Functions

 CCoalesceMotifLibrary (size_t iK)
 Initializes a new motif library based on kmers of the given length.
float GetMatch (const std::string &strSequence, uint32_t iMotif, size_t iOffset, SCoalesceModifierCache &sModifiers) const
 Calculates the length-normalized match strength of the given motif against the appropriate number of characters in the input sequence at the requested offset.
uint32_t Open (const std::string &strMotif)
 Returns a motif ID constructed from the given string representation.
bool OpenKnown (std::istream &istm)
 Opens a set of known TF motifs in the given text file input stream.
std::string GetPWM (uint32_t iMotif, float dCutoffPWMs, float dPenaltyGap, float dPenaltyMismatch, bool fNoRCs) const
 Returns a string encoding of the requested motif ID's PWM, with appropriate reverse complement resolution and low-information motif removal.
bool Simplify (uint32_t iMotif) const
 Simplifies the given PST motif ID.
bool GetKnown (uint32_t iMotif, SMotifMatch::EType eMatchType, float dPenaltyGap, float dPenaltyMismatch, std::vector< std::pair< std::string, float > > &vecprstrdKnown, float dPValue=1) const
 Retrieves all known TF motifs matching a given motif beyond a given threshhold.
size_t GetKnowns () const
 Returns the number of known TF motifs.
std::string GetMotif (uint32_t iMotif) const
 Returns the string representation of the given motif ID.
uint32_t Merge (uint32_t iOne, uint32_t iTwo, float dCutoff, bool fAllowDuplicates)
 Returns a motif ID representing the merger of the two input motifs, which can be of any type.
uint32_t RemoveRCs (uint32_t iMotif, float dPenaltyGap, float dPenaltyMismatch)
 Returns a motif ID corresponding to the given ID with only one strand of reverse complements retained.
float Align (uint32_t iOne, uint32_t iTwo, float dCutoff)
 Returns an alignment edit distance score for the given two motif IDs.
size_t GetMotifs () const
 Returns the number of motifs currently being managed by the library.
size_t GetK () const
 Returns the underlying kmer length of the library.
bool GetMatches (const std::string &strKMer, std::vector< uint32_t > &veciMotifs) const
 Calculates the kmer and RC motifs that match a given string of length k.
void SetPenaltyGap (float dPenalty)
 Sets the alignment score penalty for gaps (insertions or deletions).
float GetPenaltyGap () const
 Gets the alignment score penalty for gaps (insertions or deletions).
void SetPenaltyMismatch (float dPenalty)
 Sets the alignment score penalty for mismatches.
float GetPenaltyMismatch () const
 Gets the alignment score penalty for mismatches.
const CPSTGetPST (uint32_t iMotif) const
 Returns the CPST corresponding to the given motif ID.

Static Public Member Functions

static bool Open (std::istream &istm, std::vector< SMotifMatch > &vecsMotifs, CCoalesceMotifLibrary *pMotifs=NULL)
 Retrieves a set of motifs from the given input text stream.
static std::string GetReverseComplement (const std::string &strKMer)
 Returns the reverse complement of the given sequence.

Detailed Description

Manages a set of kmer, reverse complement, and probabilistic suffix tree motifs for CCoalesce.

A motif library for some small integer k consists of all kmers of length k, all reverse complement pairs (RCs) over those kmers, and zero or more probabilistic suffix trees (PSTs) formed at runtime by merging kmers, RCs, and other PSTs. Each motif is represented by an atomic ID, which can be converted to and from a string representation by the library or matched against any given string. The library also provides services for merging kmers/RCs/PSTs by non-gapped edit distance comparisons. All such motifs are candidates for (under)enrichment in clusters found by CCoalesce.

Remarks:
PSTs in the library can be of any depth, regardless of k; merging overlapping kmers, for example, may form a valid PST of length greater than k.
See also:
CCoalesce

Definition at line 50 of file coalescemotifs.h.


Constructor & Destructor Documentation

Initializes a new motif library based on kmers of the given length.

Parameters:
iKLength of kmers underlying the motif library.

Definition at line 76 of file coalescemotifs.h.


Member Function Documentation

float Sleipnir::CCoalesceMotifLibrary::Align ( uint32_t  iOne,
uint32_t  iTwo,
float  dCutoff 
) [inline]

Returns an alignment edit distance score for the given two motif IDs.

Parameters:
iOneFirst motif ID to be aligned.
iTwoSecond motif ID to be aligned.
dCutoffEdit distance threshhold beyond which alignment will be discarded.
Returns:
Minimum edit distance for alignment of the two given motifs IDs, or dCutoff if no better alignment is found.

Aligns the two given motifs and returns the minimum edit distance. K-mer motifs are aligned as simple strings. Reverse complement motifs are aligned in all four possible arrangements and the minimum edit distance returned. PSTs are similarly aligned in all possible configurations and the best scoring alignment returned.

Definition at line 272 of file coalescemotifs.h.

References GetMotif(), and GetPST().

size_t Sleipnir::CCoalesceMotifLibrary::GetK ( ) const [inline]

Returns the underlying kmer length of the library.

Returns:
Length of kmers underlying the motif library.

Definition at line 330 of file coalescemotifs.h.

bool Sleipnir::CCoalesceMotifLibrary::GetKnown ( uint32_t  iMotif,
SMotifMatch::EType  eMatchType,
float  dPenaltyGap,
float  dPenaltyMismatch,
std::vector< std::pair< std::string, float > > &  vecprstrdKnown,
float  dPValue = 1 
) const

Retrieves all known TF motifs matching a given motif beyond a given threshhold.

Parameters:
iMotifMotif ID to be matched against known TF motifs.
eMatchTypeType of match to be performed: correlation, rmse, etc.
dPenaltyGapAlignment score penalty for gaps.
dPenaltyMismatchAlignment score penalty for mismatches.
vecprstrdKnownOutput vector pairing known TF IDs with their match scores, which must be below dPValue.
dPValueP-value (or other score) threshhold below which known TFs must match.
Returns:
True if the retrieval succeeded (possibly with no matches), false otherwise.

Retrieves all known motifs matching a given novel motif below a given threshhold. This is usually a Bonferroni-corrected p-value of correlation between the known and novel motif PWMs, but other measures can be used. For known motifs with multiple known PWMs, only the best matching PWM is used.

See also:
OpenKnown

Definition at line 705 of file coalescemotifs.cpp.

References Sleipnir::CFullMatrix< tType >::Get(), Sleipnir::CFullMatrix< tType >::GetColumns(), and Sleipnir::CFullMatrix< tType >::GetRows().

size_t Sleipnir::CCoalesceMotifLibrary::GetKnowns ( ) const [inline]

Returns the number of known TF motifs.

Returns:
Number of known TF motifs.
See also:
OpenKnown | GetKnown

Definition at line 98 of file coalescemotifs.h.

Referenced by Sleipnir::CCoalesceCluster::LabelMotifs().

float Sleipnir::CCoalesceMotifLibrary::GetMatch ( const std::string &  strSequence,
uint32_t  iMotif,
size_t  iOffset,
SCoalesceModifierCache sModifiers 
) const

Calculates the length-normalized match strength of the given motif against the appropriate number of characters in the input sequence at the requested offset.

Parameters:
strSequenceSequence against which motif is matched.
iMotifID of motif to be matched.
iOffsetZero-based offset within strSequence at which the match is performed.
sModifiersA modifier cache containing any prior weights to be incorporated into the match.
Returns:
Length-normalized strength of motif match against the given sequence and offset.
Remarks:
iMotif must represent a valid motif for the current library, and iOffset must fall within strSequence, although motifs extending from a valid iOffset past the end of the sequence will be handled appropriately. Only PST motifs are currently supported, as there should never be any need to match non-PST motifs at runtime, but support for kmers and RCs could be added in a straightforward manner.
See also:
GetMatches

Definition at line 178 of file coalescemotifs.cpp.

References Sleipnir::CPST::GetDepth(), Sleipnir::CPST::GetMatch(), Sleipnir::CMeta::GetNaN(), and Sleipnir::CMeta::IsNaN().

bool Sleipnir::CCoalesceMotifLibrary::GetMatches ( const std::string &  strKMer,
std::vector< uint32_t > &  veciMotifs 
) const [inline]

Calculates the kmer and RC motifs that match a given string of length k.

Parameters:
strKMerKmer to be matched.
veciMotifsList to which matching kmer/RC motif IDs are appended.
Returns:
True if match was successful (even with no hits); false otherwise.
Remarks:
Input strings containing non-canonical bases (letters outside of the alphabet {A, C, G, T}) will be ignored. Only kmer and RC motif IDs will be matched, regardless of the PSTs currently managed by the library.
See also:
GetMatch

Definition at line 355 of file coalescemotifs.h.

std::string Sleipnir::CCoalesceMotifLibrary::GetMotif ( uint32_t  iMotif) const [inline]

Returns the string representation of the given motif ID.

Parameters:
iMotifMotif ID to be returned as a string.
Returns:
String representation of the requested motif.
Remarks:
Kmers are represented as strings of k characters; RCs are represented as two kmers delimited by a pipe (|); PSTs are represented as described in CPST.
See also:
CPST

Reimplemented from Sleipnir::CCoalesceMotifLibraryImpl.

Definition at line 119 of file coalescemotifs.h.

Referenced by Align(), and Merge().

size_t Sleipnir::CCoalesceMotifLibrary::GetMotifs ( ) const [inline]

Returns the number of motifs currently being managed by the library.

Returns:
Number of motifs (kmers, RCs, and PSTs) currently managed by the library.
Remarks:
For a given k, there will always be exactly 4^k kmers. For odd k, there will be exactly (4^k)/2 RCs; for even k, (4^k)/2 - (4^(k/2))/2. The number of PSTs will vary over the lifetime of the library as they are (optionally) created by merging existing motifs.

Definition at line 318 of file coalescemotifs.h.

Gets the alignment score penalty for gaps (insertions or deletions).

Returns:
Alignment score penalty for gaps.
Remarks:
Alignments are internally ungapped, so gap penalties are only incurred at the ends (i.e. by overhangs).
See also:
SetPenaltyGap | GetPenaltyMismatch | Merge

Definition at line 400 of file coalescemotifs.h.

Gets the alignment score penalty for mismatches.

Returns:
Alignment score penalty for mismatches.
See also:
SetPenaltyMismatch | GetPenaltyGap | Merge

Definition at line 428 of file coalescemotifs.h.

const CPST* Sleipnir::CCoalesceMotifLibrary::GetPST ( uint32_t  iMotif) const [inline]

Returns the CPST corresponding to the given motif ID.

Parameters:
iMotifMotif ID of PST to retrieve.
Returns:
CPST corresponding to the given most ID, or null if none.

Reimplemented from Sleipnir::CCoalesceMotifLibraryImpl.

Definition at line 442 of file coalescemotifs.h.

Referenced by Align(), Merge(), and RemoveRCs().

string Sleipnir::CCoalesceMotifLibrary::GetPWM ( uint32_t  iMotif,
float  dCutoffPWMs,
float  dPenaltyGap,
float  dPenaltyMismatch,
bool  fNoRCs 
) const

Returns a string encoding of the requested motif ID's PWM, with appropriate reverse complement resolution and low-information motif removal.

Parameters:
iMotifMotif ID to be encoded.
dCutoffPWMsMinimum information threshhold (in bits) for a PWM to be returned.
dPenaltyGapAlignment score penalty for gaps.
dPenaltyMismatchAlignment score penalty for mismatches.
fNoRCsIf true, resolve the given motif into a single strand without reverse complements before generating PWM.
Returns:
String encoding of the requested motif (tab delimited, one base per line, one position per column), or an empty string if the given threshholds are not met.
See also:
RemoveRCs

Definition at line 553 of file coalescemotifs.cpp.

References Sleipnir::CFullMatrix< tType >::Get(), Sleipnir::CFullMatrix< tType >::GetColumns(), and Sleipnir::CFullMatrix< tType >::GetRows().

static std::string Sleipnir::CCoalesceMotifLibrary::GetReverseComplement ( const std::string &  strKMer) [inline, static]

Returns the reverse complement of the given sequence.

Parameters:
strKMerK-mer sequence to be reverse complemented.
Returns:
Reverse complement of the given sequence.

Reimplemented from Sleipnir::CCoalesceMotifLibraryImpl.

Definition at line 65 of file coalescemotifs.h.

Referenced by Sleipnir::CPST::RemoveRCs().

uint32_t Sleipnir::CCoalesceMotifLibrary::Merge ( uint32_t  iOne,
uint32_t  iTwo,
float  dCutoff,
bool  fAllowDuplicates 
) [inline]

Returns a motif ID representing the merger of the two input motifs, which can be of any type.

Parameters:
iOneID of first motif to be merged.
iTwoID of second motif to be merged.
dCutoffMaximum edit distance threshhold for successful merging.
fAllowDuplicatesIf true, duplicate merges will be handled correctly and an ID returned; otherwise -1 is returned in such cases.
Returns:
-1 if the two motifs cannot be merged or have already been merged; the ID of the merged motif otherwise, which will always be a PST.
Remarks:
If the two input motifs are successfully merged, the resulting motif will always be a newly created PST to be managed by the library. Minimum edit distances between the two input motifs are calculated using standard ungapped alignments and the current scoring penalties, with the minimum of all possible alignments used to score e.g. two input PSTs.
See also:
SetPenaltyGap | SetPenaltyMismatch

Definition at line 153 of file coalescemotifs.h.

References GetMotif(), and GetPST().

bool Sleipnir::CCoalesceMotifLibrary::Open ( std::istream &  istm,
std::vector< SMotifMatch > &  vecsMotifs,
CCoalesceMotifLibrary pMotifs = NULL 
) [static]

Retrieves a set of motifs from the given input text stream.

Parameters:
istmInput text stream from which motifs are loaded.
vecsMotifsOutput set of motifs loaded from the given stream.
pMotifsIf non-null, motif library used to construct motifs from the given stream.
Returns:
True if motifs were successfully loaded, false otherwise.

Opens motifs in the given input stream, starting at its current position and stopping once non-motif data is encountered.

Remarks:
If pMotifs is null, motifs in the given stream will be skipped but not saved.
See also:
CCoalesceCluster::Open

Definition at line 118 of file coalescemotifs.cpp.

Referenced by Sleipnir::CCoalesceCluster::Open().

uint32_t Sleipnir::CCoalesceMotifLibrary::Open ( const std::string &  strMotif)

Returns a motif ID constructed from the given string representation.

Parameters:
strMotifString representation of the desired motif ID.
Returns:
Motif ID corresponding to the given string representation.
See also:
GetMotif

Definition at line 265 of file coalescemotifs.cpp.

References Sleipnir::CPST::Open().

bool Sleipnir::CCoalesceMotifLibrary::OpenKnown ( std::istream &  istm)

Opens a set of known TF motifs in the given text file input stream.

Parameters:
istmInput stream from which known TF motifs are read.
Returns:
True if known motifs were opened successfully.

Opens a set of known TF consensus binding sequences stored as PWMs in a text file. Each line of the file should be tab-delimited, with the first column containing an arbitrary TF ID and the remaining 4n columns containing PWM entries for the n bases of the TF's motif. TF to PWM mappings can be many-to-one, i.e. a motif can have multiple known conensus binding sequences on different lines. PWMs are stored as continuously valued per-base probabilities in ACGT order, such that one TF line might be: GATA 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0.

See also:
GetKnown | GetKnowns

Definition at line 651 of file coalescemotifs.cpp.

References Sleipnir::CMeta::Tokenize().

uint32_t Sleipnir::CCoalesceMotifLibrary::RemoveRCs ( uint32_t  iMotif,
float  dPenaltyGap,
float  dPenaltyMismatch 
) [inline]

Returns a motif ID corresponding to the given ID with only one strand of reverse complements retained.

Parameters:
iMotifMotif ID from which reverse complements should be removed.
dPenaltyGapAlignment score penalty for gaps.
dPenaltyMismatchAlignment score penalty for mismatches.
Returns:
Motif ID corresponding to a single strand of the given ID.

Generates a motif ID corresponding to a single strand of the given ID's motif. For k-mer motif IDs, this does nothing. For reverse complement IDs, a k-mer ID is returned corresponding to one of the two strands. For PST IDs, a new PST is constructed using CPST::RemoveRCs.

See also:
CPST::RemoveRCs

Definition at line 238 of file coalescemotifs.h.

References GetPST().

void Sleipnir::CCoalesceMotifLibrary::SetPenaltyGap ( float  dPenalty) [inline]

Sets the alignment score penalty for gaps (insertions or deletions).

Parameters:
dPenaltyAlignment score penalty for gaps.
Remarks:
Alignments are internally ungapped, so gap penalties are only incurred at the ends (i.e. by overhangs).
See also:
GetPenaltyGap | SetPenaltyMismatch | Merge

Definition at line 383 of file coalescemotifs.h.

void Sleipnir::CCoalesceMotifLibrary::SetPenaltyMismatch ( float  dPenalty) [inline]

Sets the alignment score penalty for mismatches.

Parameters:
dPenaltyAlignment score penalty for mismatches.
See also:
GetPenaltyMismatch | SetPenaltyGap | Merge

Definition at line 414 of file coalescemotifs.h.

bool Sleipnir::CCoalesceMotifLibrary::Simplify ( uint32_t  iMotif) const

Simplifies the given PST motif ID.

Parameters:
iMotifID of the PST motif to be simplified.
Returns:
True if the given motif ID represents a PST and has been successfully simplified.
See also:
CPST::Simplify

Definition at line 626 of file coalescemotifs.cpp.


The documentation for this class was generated from the following files: