Sleipnir
|
Manages a set of kmer, reverse complement, and probabilistic suffix tree motifs for CCoalesce. More...
#include <coalescemotifs.h>
Public Member Functions | |
CCoalesceMotifLibrary (size_t iK) | |
Initializes a new motif library based on kmers of the given length. | |
float | GetMatch (const std::string &strSequence, uint32_t iMotif, size_t iOffset, SCoalesceModifierCache &sModifiers) const |
Calculates the length-normalized match strength of the given motif against the appropriate number of characters in the input sequence at the requested offset. | |
uint32_t | Open (const std::string &strMotif) |
Returns a motif ID constructed from the given string representation. | |
bool | OpenKnown (std::istream &istm) |
Opens a set of known TF motifs in the given text file input stream. | |
std::string | GetPWM (uint32_t iMotif, float dCutoffPWMs, float dPenaltyGap, float dPenaltyMismatch, bool fNoRCs) const |
Returns a string encoding of the requested motif ID's PWM, with appropriate reverse complement resolution and low-information motif removal. | |
bool | Simplify (uint32_t iMotif) const |
Simplifies the given PST motif ID. | |
bool | GetKnown (uint32_t iMotif, SMotifMatch::EType eMatchType, float dPenaltyGap, float dPenaltyMismatch, std::vector< std::pair< std::string, float > > &vecprstrdKnown, float dPValue=1) const |
Retrieves all known TF motifs matching a given motif beyond a given threshhold. | |
size_t | GetKnowns () const |
Returns the number of known TF motifs. | |
std::string | GetMotif (uint32_t iMotif) const |
Returns the string representation of the given motif ID. | |
uint32_t | Merge (uint32_t iOne, uint32_t iTwo, float dCutoff, bool fAllowDuplicates) |
Returns a motif ID representing the merger of the two input motifs, which can be of any type. | |
uint32_t | RemoveRCs (uint32_t iMotif, float dPenaltyGap, float dPenaltyMismatch) |
Returns a motif ID corresponding to the given ID with only one strand of reverse complements retained. | |
float | Align (uint32_t iOne, uint32_t iTwo, float dCutoff) |
Returns an alignment edit distance score for the given two motif IDs. | |
size_t | GetMotifs () const |
Returns the number of motifs currently being managed by the library. | |
size_t | GetK () const |
Returns the underlying kmer length of the library. | |
bool | GetMatches (const std::string &strKMer, std::vector< uint32_t > &veciMotifs) const |
Calculates the kmer and RC motifs that match a given string of length k. | |
void | SetPenaltyGap (float dPenalty) |
Sets the alignment score penalty for gaps (insertions or deletions). | |
float | GetPenaltyGap () const |
Gets the alignment score penalty for gaps (insertions or deletions). | |
void | SetPenaltyMismatch (float dPenalty) |
Sets the alignment score penalty for mismatches. | |
float | GetPenaltyMismatch () const |
Gets the alignment score penalty for mismatches. | |
const CPST * | GetPST (uint32_t iMotif) const |
Returns the CPST corresponding to the given motif ID. | |
Static Public Member Functions | |
static bool | Open (std::istream &istm, std::vector< SMotifMatch > &vecsMotifs, CCoalesceMotifLibrary *pMotifs=NULL) |
Retrieves a set of motifs from the given input text stream. | |
static std::string | GetReverseComplement (const std::string &strKMer) |
Returns the reverse complement of the given sequence. |
Manages a set of kmer, reverse complement, and probabilistic suffix tree motifs for CCoalesce.
A motif library for some small integer k consists of all kmers of length k, all reverse complement pairs (RCs) over those kmers, and zero or more probabilistic suffix trees (PSTs) formed at runtime by merging kmers, RCs, and other PSTs. Each motif is represented by an atomic ID, which can be converted to and from a string representation by the library or matched against any given string. The library also provides services for merging kmers/RCs/PSTs by non-gapped edit distance comparisons. All such motifs are candidates for (under)enrichment in clusters found by CCoalesce.
Definition at line 50 of file coalescemotifs.h.
Sleipnir::CCoalesceMotifLibrary::CCoalesceMotifLibrary | ( | size_t | iK | ) | [inline] |
Initializes a new motif library based on kmers of the given length.
iK | Length of kmers underlying the motif library. |
Definition at line 76 of file coalescemotifs.h.
float Sleipnir::CCoalesceMotifLibrary::Align | ( | uint32_t | iOne, |
uint32_t | iTwo, | ||
float | dCutoff | ||
) | [inline] |
Returns an alignment edit distance score for the given two motif IDs.
iOne | First motif ID to be aligned. |
iTwo | Second motif ID to be aligned. |
dCutoff | Edit distance threshhold beyond which alignment will be discarded. |
Aligns the two given motifs and returns the minimum edit distance. K-mer motifs are aligned as simple strings. Reverse complement motifs are aligned in all four possible arrangements and the minimum edit distance returned. PSTs are similarly aligned in all possible configurations and the best scoring alignment returned.
Definition at line 272 of file coalescemotifs.h.
References GetMotif(), and GetPST().
size_t Sleipnir::CCoalesceMotifLibrary::GetK | ( | ) | const [inline] |
Returns the underlying kmer length of the library.
Definition at line 330 of file coalescemotifs.h.
bool Sleipnir::CCoalesceMotifLibrary::GetKnown | ( | uint32_t | iMotif, |
SMotifMatch::EType | eMatchType, | ||
float | dPenaltyGap, | ||
float | dPenaltyMismatch, | ||
std::vector< std::pair< std::string, float > > & | vecprstrdKnown, | ||
float | dPValue = 1 |
||
) | const |
Retrieves all known TF motifs matching a given motif beyond a given threshhold.
iMotif | Motif ID to be matched against known TF motifs. |
eMatchType | Type of match to be performed: correlation, rmse, etc. |
dPenaltyGap | Alignment score penalty for gaps. |
dPenaltyMismatch | Alignment score penalty for mismatches. |
vecprstrdKnown | Output vector pairing known TF IDs with their match scores, which must be below dPValue. |
dPValue | P-value (or other score) threshhold below which known TFs must match. |
Retrieves all known motifs matching a given novel motif below a given threshhold. This is usually a Bonferroni-corrected p-value of correlation between the known and novel motif PWMs, but other measures can be used. For known motifs with multiple known PWMs, only the best matching PWM is used.
Definition at line 705 of file coalescemotifs.cpp.
References Sleipnir::CFullMatrix< tType >::Get(), Sleipnir::CFullMatrix< tType >::GetColumns(), and Sleipnir::CFullMatrix< tType >::GetRows().
size_t Sleipnir::CCoalesceMotifLibrary::GetKnowns | ( | ) | const [inline] |
Returns the number of known TF motifs.
Definition at line 98 of file coalescemotifs.h.
Referenced by Sleipnir::CCoalesceCluster::LabelMotifs().
float Sleipnir::CCoalesceMotifLibrary::GetMatch | ( | const std::string & | strSequence, |
uint32_t | iMotif, | ||
size_t | iOffset, | ||
SCoalesceModifierCache & | sModifiers | ||
) | const |
Calculates the length-normalized match strength of the given motif against the appropriate number of characters in the input sequence at the requested offset.
strSequence | Sequence against which motif is matched. |
iMotif | ID of motif to be matched. |
iOffset | Zero-based offset within strSequence at which the match is performed. |
sModifiers | A modifier cache containing any prior weights to be incorporated into the match. |
Definition at line 178 of file coalescemotifs.cpp.
References Sleipnir::CPST::GetDepth(), Sleipnir::CPST::GetMatch(), Sleipnir::CMeta::GetNaN(), and Sleipnir::CMeta::IsNaN().
bool Sleipnir::CCoalesceMotifLibrary::GetMatches | ( | const std::string & | strKMer, |
std::vector< uint32_t > & | veciMotifs | ||
) | const [inline] |
Calculates the kmer and RC motifs that match a given string of length k.
strKMer | Kmer to be matched. |
veciMotifs | List to which matching kmer/RC motif IDs are appended. |
Definition at line 355 of file coalescemotifs.h.
std::string Sleipnir::CCoalesceMotifLibrary::GetMotif | ( | uint32_t | iMotif | ) | const [inline] |
Returns the string representation of the given motif ID.
iMotif | Motif ID to be returned as a string. |
Reimplemented from Sleipnir::CCoalesceMotifLibraryImpl.
Definition at line 119 of file coalescemotifs.h.
size_t Sleipnir::CCoalesceMotifLibrary::GetMotifs | ( | ) | const [inline] |
Returns the number of motifs currently being managed by the library.
Definition at line 318 of file coalescemotifs.h.
float Sleipnir::CCoalesceMotifLibrary::GetPenaltyGap | ( | ) | const [inline] |
Gets the alignment score penalty for gaps (insertions or deletions).
Definition at line 400 of file coalescemotifs.h.
float Sleipnir::CCoalesceMotifLibrary::GetPenaltyMismatch | ( | ) | const [inline] |
Gets the alignment score penalty for mismatches.
Definition at line 428 of file coalescemotifs.h.
const CPST* Sleipnir::CCoalesceMotifLibrary::GetPST | ( | uint32_t | iMotif | ) | const [inline] |
Returns the CPST corresponding to the given motif ID.
iMotif | Motif ID of PST to retrieve. |
Reimplemented from Sleipnir::CCoalesceMotifLibraryImpl.
Definition at line 442 of file coalescemotifs.h.
Referenced by Align(), Merge(), and RemoveRCs().
string Sleipnir::CCoalesceMotifLibrary::GetPWM | ( | uint32_t | iMotif, |
float | dCutoffPWMs, | ||
float | dPenaltyGap, | ||
float | dPenaltyMismatch, | ||
bool | fNoRCs | ||
) | const |
Returns a string encoding of the requested motif ID's PWM, with appropriate reverse complement resolution and low-information motif removal.
iMotif | Motif ID to be encoded. |
dCutoffPWMs | Minimum information threshhold (in bits) for a PWM to be returned. |
dPenaltyGap | Alignment score penalty for gaps. |
dPenaltyMismatch | Alignment score penalty for mismatches. |
fNoRCs | If true, resolve the given motif into a single strand without reverse complements before generating PWM. |
Definition at line 553 of file coalescemotifs.cpp.
References Sleipnir::CFullMatrix< tType >::Get(), Sleipnir::CFullMatrix< tType >::GetColumns(), and Sleipnir::CFullMatrix< tType >::GetRows().
static std::string Sleipnir::CCoalesceMotifLibrary::GetReverseComplement | ( | const std::string & | strKMer | ) | [inline, static] |
Returns the reverse complement of the given sequence.
strKMer | K-mer sequence to be reverse complemented. |
Reimplemented from Sleipnir::CCoalesceMotifLibraryImpl.
Definition at line 65 of file coalescemotifs.h.
Referenced by Sleipnir::CPST::RemoveRCs().
uint32_t Sleipnir::CCoalesceMotifLibrary::Merge | ( | uint32_t | iOne, |
uint32_t | iTwo, | ||
float | dCutoff, | ||
bool | fAllowDuplicates | ||
) | [inline] |
Returns a motif ID representing the merger of the two input motifs, which can be of any type.
iOne | ID of first motif to be merged. |
iTwo | ID of second motif to be merged. |
dCutoff | Maximum edit distance threshhold for successful merging. |
fAllowDuplicates | If true, duplicate merges will be handled correctly and an ID returned; otherwise -1 is returned in such cases. |
Definition at line 153 of file coalescemotifs.h.
References GetMotif(), and GetPST().
bool Sleipnir::CCoalesceMotifLibrary::Open | ( | std::istream & | istm, |
std::vector< SMotifMatch > & | vecsMotifs, | ||
CCoalesceMotifLibrary * | pMotifs = NULL |
||
) | [static] |
Retrieves a set of motifs from the given input text stream.
istm | Input text stream from which motifs are loaded. |
vecsMotifs | Output set of motifs loaded from the given stream. |
pMotifs | If non-null, motif library used to construct motifs from the given stream. |
Opens motifs in the given input stream, starting at its current position and stopping once non-motif data is encountered.
Definition at line 118 of file coalescemotifs.cpp.
Referenced by Sleipnir::CCoalesceCluster::Open().
uint32_t Sleipnir::CCoalesceMotifLibrary::Open | ( | const std::string & | strMotif | ) |
Returns a motif ID constructed from the given string representation.
strMotif | String representation of the desired motif ID. |
Definition at line 265 of file coalescemotifs.cpp.
References Sleipnir::CPST::Open().
bool Sleipnir::CCoalesceMotifLibrary::OpenKnown | ( | std::istream & | istm | ) |
Opens a set of known TF motifs in the given text file input stream.
istm | Input stream from which known TF motifs are read. |
Opens a set of known TF consensus binding sequences stored as PWMs in a text file. Each line of the file should be tab-delimited, with the first column containing an arbitrary TF ID and the remaining 4n columns containing PWM entries for the n bases of the TF's motif. TF to PWM mappings can be many-to-one, i.e. a motif can have multiple known conensus binding sequences on different lines. PWMs are stored as continuously valued per-base probabilities in ACGT order, such that one TF line might be: GATA 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0
.
Definition at line 651 of file coalescemotifs.cpp.
References Sleipnir::CMeta::Tokenize().
uint32_t Sleipnir::CCoalesceMotifLibrary::RemoveRCs | ( | uint32_t | iMotif, |
float | dPenaltyGap, | ||
float | dPenaltyMismatch | ||
) | [inline] |
Returns a motif ID corresponding to the given ID with only one strand of reverse complements retained.
iMotif | Motif ID from which reverse complements should be removed. |
dPenaltyGap | Alignment score penalty for gaps. |
dPenaltyMismatch | Alignment score penalty for mismatches. |
Generates a motif ID corresponding to a single strand of the given ID's motif. For k-mer motif IDs, this does nothing. For reverse complement IDs, a k-mer ID is returned corresponding to one of the two strands. For PST IDs, a new PST is constructed using CPST::RemoveRCs.
Definition at line 238 of file coalescemotifs.h.
References GetPST().
void Sleipnir::CCoalesceMotifLibrary::SetPenaltyGap | ( | float | dPenalty | ) | [inline] |
Sets the alignment score penalty for gaps (insertions or deletions).
dPenalty | Alignment score penalty for gaps. |
Definition at line 383 of file coalescemotifs.h.
void Sleipnir::CCoalesceMotifLibrary::SetPenaltyMismatch | ( | float | dPenalty | ) | [inline] |
Sets the alignment score penalty for mismatches.
dPenalty | Alignment score penalty for mismatches. |
Definition at line 414 of file coalescemotifs.h.
bool Sleipnir::CCoalesceMotifLibrary::Simplify | ( | uint32_t | iMotif | ) | const |
Simplifies the given PST motif ID.
iMotif | ID of the PST motif to be simplified. |
Definition at line 626 of file coalescemotifs.cpp.