Sleipnir
Public Member Functions
Sleipnir::CFASTA Class Reference

Encapsulates a standard FASTA file or a modified ENCODE-style wiggle (WIG) file. More...

#include <fasta.h>

Inheritance diagram for Sleipnir::CFASTA:
Sleipnir::CFASTAImpl

Public Member Functions

bool Open (const char *szFile, const std::set< std::string > &setstrTypes)
 Opens a FASTA or WIG file and indexes the file without explicitly loading its contents.
void Save (std::ostream &ostm, size_t iWrap=80) const
 Saves a copy of the FASTA file to the given output stream.
bool Get (size_t iGene, std::vector< SFASTASequence > &vecsSequences) const
 Retrieves all sequences (of any type) associated with the given gene index.
bool Get (size_t iGene, std::vector< SFASTAWiggle > &vecsValues) const
 Retrieves all values (of any type) associated with the given gene index.
bool Open (const char *szFile)
 Opens a FASTA or WIG file and indexes the file without explicitly loading its contents.
size_t GetGenes () const
 Returns the total number of genes associated with this FASTA/WIG.
const std::string & GetGene (size_t iGene) const
 Returns the gene ID associated with the given index.
size_t GetGene (const std::string &strGene) const
 Returns the index of the given gene ID.
const std::string GetHeader (size_t iGene, const std::string &strType) const
 Returns the header line associated with the given gene index and sequence type.
const std::set< std::string > & GetTypes () const
 Returns the set of sequence types indexed by this FASTA/WIG.

Detailed Description

Encapsulates a standard FASTA file or a modified ENCODE-style wiggle (WIG) file.

CFASTA performs efficient, disk-based indexing of large FASTA files; it also provides several Sleipnir-specific extensions:

  1. Sequences of different types associated with the same gene ID. Unlike standard FASTA files, gene IDs assumed to contain no tab characters. Instead, if an ID line contains tabs, the second tab-delimited column is interpreted as a sequence type. This allows, for example, a gene sequence, upstream flank, and downstream flank to be associated in a separable manner with the same gene ID:
     > GENE
     gene sequence here
     > GENE 5
     upstream flank here
     > GENE 3
     downstream flank here
    
  2. Introns and exons (or similar sequence subtypes) can be encoded within each sequence by alternating blocks of upper- and lower-case bases. By default, upper-case bases are interpreted as exons and lower-case as introns, but there are no intrinsic semantics associated with the subtypes by CFASTA. For example, the following "gene" would begin and end with exons separated by a single intron:
     > GENE
     AACCGGTTacgtTTGGCCAA
    
  3. FASTA files akin to ENCODE wiggle (WIG) files are also supported, in which a single floating point value (instead of a nucleotide or amino acid) is associated with each position in a gene's sequence; each value must appear alone on separate lines. This can be used to encode per-base conservation scores, nucleosome/TF occupancies, or other continuous data. For example, the gene sequence above might be accompanied in a separate wiggle file by the scores:
     > GENE
     0.9
     0.1
     0.25
     ...
    
Remarks:
Very large FASTA files are supported with relatively low memory usage by indexing the file on disk when opened without loading any sequence data. Sequence data is loaded on an as-needed basis by the Get methods; this incurs a slight runtime penalty, but allows extremely large FASTA files to be accessed efficiently with very low memory usage.

Definition at line 74 of file fasta.h.


Member Function Documentation

bool Sleipnir::CFASTA::Get ( size_t  iGene,
std::vector< SFASTASequence > &  vecsSequences 
) const [inline]

Retrieves all sequences (of any type) associated with the given gene index.

Parameters:
iGeneIndex of gene for which sequences are retrieved.
vecsSequencesZero or more output sequences associated with the given gene index.
Returns:
True if sequences were retrieved successfully; false otherwise.
Remarks:
Must be called only after a successful Open. Performs zero or more seeks on disk to load the sequence associated with the given gene index over all types in the FASTA file's index.
See also:
Open

Definition at line 99 of file fasta.h.

Referenced by Get(), and Save().

bool Sleipnir::CFASTA::Get ( size_t  iGene,
std::vector< SFASTAWiggle > &  vecsValues 
) const [inline]

Retrieves all values (of any type) associated with the given gene index.

Parameters:
iGeneIndex of gene for which values are retrieved.
vecsValuesZero or more output values associated with the given gene index.
Returns:
True if values were retrieved successfully; false otherwise.
Remarks:
Must be called only after a successful Open. Performs zero or more seeks on disk to load the values associated with the given gene index over all types in the WIG file's index.
See also:
Open

Definition at line 123 of file fasta.h.

References Get().

const std::string& Sleipnir::CFASTA::GetGene ( size_t  iGene) const [inline]

Returns the gene ID associated with the given index.

Parameters:
iGeneGene index for which ID is returned.
Returns:
Gene ID associated with the given index.

Reimplemented from Sleipnir::CFASTAImpl.

Definition at line 171 of file fasta.h.

Referenced by Save().

size_t Sleipnir::CFASTA::GetGene ( const std::string &  strGene) const [inline]

Returns the index of the given gene ID.

Parameters:
strGeneGene ID for which index is returned.
Returns:
Index of the given gene ID.

Definition at line 185 of file fasta.h.

size_t Sleipnir::CFASTA::GetGenes ( ) const [inline]

Returns the total number of genes associated with this FASTA/WIG.

Returns:
Number of genes associated with this FASTA/WIG.

Definition at line 157 of file fasta.h.

Referenced by Save().

const std::string Sleipnir::CFASTA::GetHeader ( size_t  iGene,
const std::string &  strType 
) const [inline]

Returns the header line associated with the given gene index and sequence type.

Parameters:
iGeneGene index for which header is retrieved.
strTypeSequence type for which header is retrieved.
Returns:
Header line associated with the given gene/type pair.
See also:
GetGene

Definition at line 207 of file fasta.h.

Referenced by Save().

const std::set<std::string>& Sleipnir::CFASTA::GetTypes ( ) const [inline]

Returns the set of sequence types indexed by this FASTA/WIG.

Returns:
Set of sequence types indexed by this FASTA/WIG.

Definition at line 221 of file fasta.h.

bool Sleipnir::CFASTA::Open ( const char *  szFile,
const std::set< std::string > &  setstrTypes 
)

Opens a FASTA or WIG file and indexes the file without explicitly loading its contents.

Parameters:
szFilePath to FASTA/WIG file to open.
setstrTypesIf nonempty, set of sequence types to be loaded; types not in the set are ignored.
Returns:
True if file was loaded successfully; false otherwise.
Remarks:
Supports FASTA and WIG files as described in CFASTA. No data is loaded on open, but an index is created over all genes and types of interest; a file handle is held open, and data is loaded as needed by the Get methods.
See also:
Save

Definition at line 62 of file fasta.cpp.

References Sleipnir::CMeta::Tokenize().

Referenced by Open().

bool Sleipnir::CFASTA::Open ( const char *  szFile) [inline]

Opens a FASTA or WIG file and indexes the file without explicitly loading its contents.

Parameters:
szFilePath to FASTA/WIG file to open.
Returns:
True if file was loaded successfully; false otherwise.
Remarks:
Supports FASTA and WIG files as described in CFASTA. No data is loaded on open, but an index is created over all genes and types of interest; a file handle is held open, and data is loaded as needed by the Get methods.
See also:
Save

Definition at line 145 of file fasta.h.

References Open().

void Sleipnir::CFASTA::Save ( std::ostream &  ostm,
size_t  iWrap = 80 
) const

Saves a copy of the FASTA file to the given output stream.

Parameters:
ostmOutput stream to which FASTA file is saved.
iWrapIf given, column at which output FASTA is linewrapped.
Remarks:
Currently only supports FASTA files, not WIGs.
See also:
Open

Definition at line 135 of file fasta.cpp.

References Get(), GetGene(), GetGenes(), GetHeader(), Sleipnir::SFASTASequence::m_fIntronFirst, Sleipnir::SFASTABase::m_strType, and Sleipnir::SFASTASequence::m_vecstrSequences.


The documentation for this class was generated from the following files: