Sleipnir
Public Member Functions
Sleipnir::CPST Class Reference

Represents a probabilistic suffix tree (PST) containing zero or more overlapping strings in a weighted manner. More...

#include <pst.h>

Inheritance diagram for Sleipnir::CPST:
Sleipnir::CPSTImpl

Public Member Functions

 CPST (size_t iArity)
 Initializes a new PST over an alphabet of the given size.
void RemoveRCs (float dPenaltyGap, float dPenaltyMismatch, CPST &PSTOut) const
 Constructs a new PST containing only a single strand of the current PST by finding the best alignment of contained sequences and their reverse complements.
void Add (const std::string &strSequence, const CPST &PST, int iOffset)
 Add the given string and strings from the given PST to the current PST, with an optional character offset between the two.
void Add (const std::string &strOne, const std::string &strTwo, int iOffset)
 Add the given strings to the current PST, with an optional character offset between them.
void Add (const std::string &strSequence, size_t iOffset=0)
 Add the given string to the current PST, with an optional character offset relative to the PST's root.
void Add (const CPST &PST, size_t iOffset=0)
 Add the strings from the given PST to the current PST, with an optional character offset relative to the current PST's root.
float GetMatch (const std::string &strTarget, size_t iOffset=0) const
 Returns the match score of the PST against the given string, with an optional offset.
size_t GetDepth () const
 Returns the maximum length over all strings in the PST.
std::string GetMotif () const
 Returns a string representation of the PST.
float Align (const std::string &strSequence, float dPenaltyGap, float dPenaltyMismatch, float dCutoff, int &iOffset) const
 Returns the best alignment score of the given string against all strings in the PST, using the requested penalties, maximum score, and offset.
float Align (const CPST &PST, float dPenaltyGap, float dPenaltyMismatch, float dCutoff, int &iOffset) const
 Returns the best alignment score of strings in the given PST against strings in the current PST, using the requested penalties, maximum score, and offset.
bool Open (const std::string &strPST)
 Adds the given PST encoded in string form to the current PST.
bool GetPWM (CFullMatrix< uint16_t > &MatPWM, const char *szSymbols) const
 Retrieves a PWM approximately equivalent to the current PST.
size_t Integrate () const
 Counts the number of possible discrete strings encoded by the PST.
bool Simplify ()
 Simplifies the current PST by removing all subtrees occurring at relatively low frequency.

Detailed Description

Represents a probabilistic suffix tree (PST) containing zero or more overlapping strings in a weighted manner.

A probabilistic suffix tree (PST) represents a set of zero or more weighted strings in an efficient manner. For example, a PST might contain the strings:

 2:AACCGG
 1:AACCT
 1:AACT

indicating that the set contains AACCGG with probability 0.5, AACCT with probability 0.25, and AACT with probability 0.25. A PST encodes such strings as an overlapping tree of characters, which is thus compact in memory and can be compared efficiently against an input string to count the matches of any string in the tree.

Remarks:
CPST is a relatively naive and inflexible (but efficient) implementation of PSTs that should only be used for fixed alphabets containing a small number of characters. It is currently intended for use mainly with CCoalesce.
See also:
CCoalesceMotifLibrary

Definition at line 55 of file pst.h.


Constructor & Destructor Documentation

Sleipnir::CPST::CPST ( size_t  iArity) [inline]

Initializes a new PST over an alphabet of the given size.

Parameters:
iArityMaximum number of unique characters in strings contained in the new PST.

Definition at line 64 of file pst.h.


Member Function Documentation

void Sleipnir::CPST::Add ( const std::string &  strSequence,
const CPST PST,
int  iOffset 
) [inline]

Add the given string and strings from the given PST to the current PST, with an optional character offset between the two.

Parameters:
strSequenceFixed string to be added to the current PST.
PSTPST containing strings to be added to the current PST.
iOffsetOptional character offset between the given string and PST; if positive, the PST is added first, and if negative, the fixed string.

Definition at line 83 of file pst.h.

Referenced by Add(), and RemoveRCs().

void Sleipnir::CPST::Add ( const std::string &  strOne,
const std::string &  strTwo,
int  iOffset 
) [inline]

Add the given strings to the current PST, with an optional character offset between them.

Parameters:
strOneFirst string to be added to the current PST.
strTwoSecond string to be added to the current PST.
iOffsetOptional character offset between the given strings; if positive, strOne is added first, and if negative, strTwo.

Definition at line 106 of file pst.h.

References Add().

void Sleipnir::CPST::Add ( const std::string &  strSequence,
size_t  iOffset = 0 
) [inline]

Add the given string to the current PST, with an optional character offset relative to the PST's root.

Parameters:
strSequenceFixed string to be added to the current PST.
iOffsetOptional character offset between the given string; if nonzero, iOffset characters along each branch of the current PST are skipped before strSequence is inserted.

Definition at line 127 of file pst.h.

References GetDepth().

void Sleipnir::CPST::Add ( const CPST PST,
size_t  iOffset = 0 
) [inline]

Add the strings from the given PST to the current PST, with an optional character offset relative to the current PST's root.

Parameters:
PSTPST containng strings to be added to the current PST.
iOffsetOptional character offset applied to the given PST; if nonzero, iOffset characters along each branch of the current PST are skipped before strings from the given PST are added.

Definition at line 145 of file pst.h.

References GetDepth().

float Sleipnir::CPST::Align ( const std::string &  strSequence,
float  dPenaltyGap,
float  dPenaltyMismatch,
float  dCutoff,
int &  iOffset 
) const [inline]

Returns the best alignment score of the given string against all strings in the PST, using the requested penalties, maximum score, and offset.

Parameters:
strSequenceString to be aligned with the current PST.
dPenaltyGapAlignment score penalty for gaps.
dPenaltyMismatchAlignment score penalty for mismatches.
dCutoffMaximum score before an alignment is considered inviable; used for optimization.
iOffsetIf nonzero, the number of characters to skip within the input string before scoring the alignment.
Returns:
Minimum alignment score of the given string with strings in the current PST.
Remarks:
Alignments are internally ungapped, so gap penalties are only incurred at the ends (i.e. by overhangs).

Definition at line 237 of file pst.h.

References GetDepth().

Referenced by Align(), and RemoveRCs().

float Sleipnir::CPST::Align ( const CPST PST,
float  dPenaltyGap,
float  dPenaltyMismatch,
float  dCutoff,
int &  iOffset 
) const [inline]

Returns the best alignment score of strings in the given PST against strings in the current PST, using the requested penalties, maximum score, and offset.

Parameters:
PSTPST to be aligned with the current PST.
dPenaltyGapAlignment score penalty for gaps.
dPenaltyMismatchAlignment score penalty for mismatches.
dCutoffMaximum score before an alignment is considered inviable; used for optimization.
iOffsetIf nonzero, the number of characters to skip within the input PST before scoring the alignment.
Returns:
Minimum alignment score of the given PST with strings in the current PST.
Remarks:
Alignments are internally ungapped, so gap penalties are only incurred at the ends (i.e. by overhangs).

Definition at line 272 of file pst.h.

References Align(), and GetDepth().

size_t Sleipnir::CPST::GetDepth ( ) const [inline]

Returns the maximum length over all strings in the PST.

Returns:
Maximum length over all strings in the PST.

Definition at line 183 of file pst.h.

Referenced by Add(), Align(), Sleipnir::CCoalesceMotifLibrary::GetMatch(), GetMatch(), and GetPWM().

float Sleipnir::CPST::GetMatch ( const std::string &  strTarget,
size_t  iOffset = 0 
) const [inline]

Returns the match score of the PST against the given string, with an optional offset.

Parameters:
strTargetString against which PST is matched.
iOffsetIf nonzero, the number of characters to skip within the target string before scoring the match.
Returns:
Length-normalized probabilistic match score.
Remarks:
Score is equal to zero if no strings in the PST match; otherwise, it is the maximum probability over all matching strings normalized by the probability of matching by chance, (1/arity)^length.

Definition at line 168 of file pst.h.

References GetDepth().

Referenced by Sleipnir::CCoalesceMotifLibrary::GetMatch().

std::string Sleipnir::CPST::GetMotif ( ) const [inline]

Returns a string representation of the PST.

Returns:
String representation of the PST.

A PST is represented using pipe characters (|) to indicate choices, with parentheses indicating grouping and weights indicated with preceding counts delimited by colons (:). For example, a PST containing the strings:

 2:AACCGG
 1:AACCT
 1:AACT

would be represented by the motif:

 AAC(3:C(2:GG)|(1:T))|(1:T)

Definition at line 207 of file pst.h.

bool Sleipnir::CPST::GetPWM ( CFullMatrix< uint16_t > &  MatPWM,
const char *  szSymbols 
) const [inline]

Retrieves a PWM approximately equivalent to the current PST.

Parameters:
MatPWMOutput PWM approximating the current PST.
szSymbolsString of symbols mapped to the indices of the current PST.
Remarks:
szSymbols must be of the same length as the current PST's arity; it is typically "ACGT". GetPWM is most useful if the PST first has its reverse complements removed. Technically returns a PSSM or PFM of counts, not a PWM of continuous probabilities.
See also:
RemoveRCs

Definition at line 313 of file pst.h.

References Sleipnir::CFullMatrix< tType >::Clear(), Sleipnir::CFullMatrix< tType >::Get(), GetDepth(), Sleipnir::CFullMatrix< tType >::GetRows(), Sleipnir::CFullMatrix< tType >::Initialize(), and Sleipnir::CMeta::Permute().

size_t Sleipnir::CPST::Integrate ( ) const [inline]

Counts the number of possible discrete strings encoded by the PST.

Returns:
Number of distinct strings encoded by the PST.

Definition at line 341 of file pst.h.

bool Sleipnir::CPST::Open ( const std::string &  strPST) [inline]

Adds the given PST encoded in string form to the current PST.

Parameters:
strPSTPST encoded in string form, or a string from the current PST's alphabet.
Returns:
True if the given PST string was added successfully.
See also:
GetMotif

Definition at line 291 of file pst.h.

Referenced by Sleipnir::CCoalesceMotifLibrary::Open().

void Sleipnir::CPST::RemoveRCs ( float  dPenaltyGap,
float  dPenaltyMismatch,
CPST PSTOut 
) const

Constructs a new PST containing only a single strand of the current PST by finding the best alignment of contained sequences and their reverse complements.

Parameters:
dPenaltyGapAlignment score penalty for gaps.
dPenaltyMismatchAlignment score penalty for mismatches.
PSTOutOutput PST constructed from the single stranded best alignment of strings in the current PST.

Constructs a new PST representing a single strand of the current PST. Which strand is chosen arbitrarily; resolution occurs by tracing all possible strings stored in the current PST (i.e. paths from root to a leaf), aligning each string and its reverse complement with previous strings, and retaining only the best alignments. This is a heuristic, but it correctly preserves most PSTs that are already single-stranded and resolves most reverse complement PSTs. However, it should only be used for tasks like generating approximate PWMs from PSTs, since it can seriously diverge from the original PST in the worst cases.

Definition at line 55 of file pst.cpp.

References Add(), Align(), and Sleipnir::CCoalesceMotifLibrary::GetReverseComplement().

bool Sleipnir::CPST::Simplify ( ) [inline]

Simplifies the current PST by removing all subtrees occurring at relatively low frequency.

Returns:
True if simplification succeeded, false otherwise.
Remarks:
Removes all subtrees with frequency below 1/arity.

Definition at line 357 of file pst.h.


The documentation for this class was generated from the following files: