Sleipnir
|
Implements a heavily optimized discrete naive Bayesian classifier. More...
#include <bayesnet.h>
Public Member Functions | |
bool | Open (const CBayesNetSmile &BNSmile) |
Construct a new minimal Bayes net from the given SMILE-based network. | |
bool | Open (std::istream &istm) |
Load a minimal Bayes net from the given binary stream. | |
bool | OpenCounts (const char *szFileCounts, const std::map< std::string, size_t > &mapstriNodes, const std::vector< unsigned char > &vecbDefaults, const std::vector< float > &vecdAlphas, float dPseudocounts=HUGE_VAL, const CBayesNetMinimal *pBNDefault=NULL) |
Constructs a naive Bayesian classifier using count data for each network node. | |
void | Save (std::ostream &ostm) const |
Save a minimal Bayes net to the given binary stream. | |
float | Evaluate (const std::vector< unsigned char > &vecbDatum, size_t iOffset=0) const |
Perform Bayesian inference to obtain the class probability given evidence for some number of nodes. | |
bool | Evaluate (const std::vector< unsigned char > &vecbData, float *adResults, size_t iGenes, size_t iStart=0) const |
Repeatedly perform Bayesian inference to obtain the class probability given evidence for some number of nodes. | |
float | Regularize (std::vector< float > &vecdAlphas) const |
const CDataMatrix & | GetCPT (size_t iNode) const |
Return the conditional probability table matrix for the indicated node. | |
size_t | GetNodes () const |
Return the total number of nodes in the Bayes net. | |
void | SetID (const std::string &strID) |
Sets the string identifier of the network. | |
const std::string & | GetID () const |
Returns the string identifier of the network. | |
const unsigned char | GetDefault (size_t iNode) const |
Returns the default value (if no input is provided) for the requested node. |
Implements a heavily optimized discrete naive Bayesian classifier.
CBayesNetMinimal provides a custom implementation of a discrete naive Bayesian classifier heavily optimized for rapid inference. The intended use is to learn an appropriate network and parameters offline using one of the more complex Bayes net implementations. The resulting network can then be converted to a minimal form and used for online (realtime) inference. A minimal Bayes net always consists of one output (class) node and zero or more data nodes, all discrete and taking one or more different values.
Definition at line 291 of file bayesnet.h.
float Sleipnir::CBayesNetMinimal::Evaluate | ( | const std::vector< unsigned char > & | vecbDatum, |
size_t | iOffset = 0 |
||
) | const |
Perform Bayesian inference to obtain the class probability given evidence for some number of nodes.
vecbDatum | Values for each evidence node; 0xF indicates missing data (no evidence) for a particular node. Note that each evidence value is stored in four bits, not a full byte. |
iOffset | Position of the first piece of evidence within vecbDatum; zero by default. This can be used to store multiple data in a single vector and rapidly perform inference for each subsequent data setting. |
Definition at line 829 of file bayesnetfn.cpp.
References Sleipnir::CFullMatrix< tType >::Get(), Sleipnir::CFullMatrix< tType >::GetColumns(), Sleipnir::CMeta::GetNaN(), and Sleipnir::CFullMatrix< tType >::GetRows().
Referenced by Evaluate().
bool Sleipnir::CBayesNetMinimal::Evaluate | ( | const std::vector< unsigned char > & | vecbData, |
float * | adResults, | ||
size_t | iGenes, | ||
size_t | iStart = 0 |
||
) | const |
Repeatedly perform Bayesian inference to obtain the class probability given evidence for some number of nodes.
vecbData | Values for each evidence node; 0xF indicates missing data (no evidence) for a particular node. Note that each evidence value is stored in four bits, not a full byte. Multiple sets of evidence can be included in vecbData, e.g. for N nodes, entries 0 through floor(N/2) comprise one set of evidence, floor(N/2)+1 through N the next, and so forth. |
adResults | Array into which posterior probabilities of the largest value of the class node are inserted given the evidence (generally probabilities of functional relationships). |
iGenes | Number of inferences to perform and probabilities to generate. |
iStart | First gene to process; this means that the first output probability is placed into the iStart element of adResults, and the first element read from vecbDatum is at iStart * ceil(N/2). |
Perform Bayesian inference iGenes - iStart times using evidence from vecbData, which consists of zero or more sets of evidence values for the N non-root nodes in the Bayes net. In pseudocode:
for( i = iStart; i < iGenes; ++i ) adValues[ i ] = Evaluate( vecbData, i * floor((N+1)/2) );
Definition at line 898 of file bayesnetfn.cpp.
References Evaluate().
const CDataMatrix& Sleipnir::CBayesNetMinimal::GetCPT | ( | size_t | iNode | ) | const [inline] |
Return the conditional probability table matrix for the indicated node.
iNode | Index of node whose CPT is returned (zero-based). |
Definition at line 320 of file bayesnet.h.
Referenced by Sleipnir::CBayesNetSmile::Open().
const unsigned char Sleipnir::CBayesNetMinimal::GetDefault | ( | size_t | iNode | ) | const [inline] |
Returns the default value (if no input is provided) for the requested node.
iNode | Node for which default value is returned. |
Definition at line 379 of file bayesnet.h.
Referenced by Sleipnir::CBayesNetSmile::Open().
const std::string& Sleipnir::CBayesNetMinimal::GetID | ( | ) | const [inline] |
Returns the string identifier of the network.
Definition at line 362 of file bayesnet.h.
size_t Sleipnir::CBayesNetMinimal::GetNodes | ( | ) | const [inline] |
Return the total number of nodes in the Bayes net.
Definition at line 334 of file bayesnet.h.
Referenced by Sleipnir::CBayesNetSmile::Open().
bool Sleipnir::CBayesNetMinimal::Open | ( | const CBayesNetSmile & | BNSmile | ) |
Construct a new minimal Bayes net from the given SMILE-based network.
BNSmile | SMILE-based network from which to copy node parameters; must have naive structure. |
Definition at line 719 of file bayesnetfn.cpp.
References Sleipnir::CBayesNetSmile::GetCPT(), Sleipnir::CBayesNetSmile::GetDefault(), Sleipnir::CBayesNetSmile::GetNodes(), and Sleipnir::CFullMatrix< tType >::GetRows().
bool Sleipnir::CBayesNetMinimal::Open | ( | std::istream & | istm | ) |
Load a minimal Bayes net from the given binary stream.
istm | Stream from which Bayes net is loaded. |
Definition at line 756 of file bayesnetfn.cpp.
References Sleipnir::CFullMatrix< tType >::GetRows(), and Sleipnir::CFullMatrix< tType >::Open().
bool Sleipnir::CBayesNetMinimal::OpenCounts | ( | const char * | szFileCounts, |
const std::map< std::string, size_t > & | mapstriNodes, | ||
const std::vector< unsigned char > & | vecbDefaults, | ||
const std::vector< float > & | vecdAlphas, | ||
float | dPseudocounts = HUGE_VAL , |
||
const CBayesNetMinimal * | pBNDefault = NULL |
||
) |
Constructs a naive Bayesian classifier using count data for each network node.
szFileCounts | Text file containing counts from which CPTs are derived. |
mapstriNodes | Mapping of node identifiers in counts file to integer indices (zero-based). |
vecbDefaults | If non-empty, vector of default values for each node if data is missing (-1 for none). |
vecdAlphas | If non-empty, vector of prior beliefs alpha for each node. |
dPseudocounts | If not equal to NaN, effective sample size to use for all nodes. |
pBNDefault | If non-null, Bayes net to use for default values when a distribution's counts are too sparse to use accurately. |
Creates a naive Bayesian classifier by estimating maximum likelihood parameter values from counts for each node's data values. These counts should be given in a text file where each set of counts is tab delimited in the form:
network_name number_of_nodes
class prior counts
node_name_1
node 1 counts for class 0
node 1 counts for class 1
node_name_2
...
For example, suppose we are constructing a network with two output classes and three datasets, which can take two, five, or two distinct values, respectively. Valid count data might resemble:
my_network_name 3 90 10 dataset_name_1 80 20 1 9 dataset_name_2 30 40 60 40 30 5 10 20 30 35 dataset_name_3 15 19 30 36
These would generate prior probabilities of 0.9 and 0.1 for the two classes, for example; CPTs for each node would similarly be calculated by dividing each set of counts by their sum. If default values are provided, they will be recorded and used during inference if there is no data available for the appropriate nodes. If a fallback network is provided, probability distributions with too few counts to estimate accurately will be replaced with fallback values.
The parameters can be regularized by providing prior belief weights alpha and an effective sample size (pseudocounts). If given, CPT parameters will be calculated as if there were the requested pseudocount number of data points and a uniform prior for each node with relative weight alpha. For example, if dataset_name_1
in the example above was given a pseudocount of 5 and an alpha of 6, the CPT parameters would be calculated as:
P(0|0) = (5 * 80 / (80+20) + 6 * 1 / 2) / (5 + 6) = 0.636 P(1|0) = (5 * 20 / (80+20) + 6 * 1 / 2) / (5 + 6) = 0.363 P(0|1) = (5 * 1 / (1 + 9) + 6 * 1 /2) / (5 + 6) = 0.318 P(1|1) = (5 * 9 / (1 + 9) + 6 * 1 /2) / (5 + 6) = 0.682
Regularization "smooths" the parameters towards a uniform prior belief with strength alpha relative to the effective sample size (pseudocounts), so these probabilities are closer to 0.5 than they would be otherwise.
vecbDefaults
and vecdAlphas
must be of the same length as the number of classifier nodes (including the root node), which must also agree with the maximum node index in mapstriNodes
. Definition at line 991 of file bayesnetfn.cpp.
References Sleipnir::CFullMatrix< tType >::GetRows(), Sleipnir::CFullMatrix< tType >::Initialize(), Sleipnir::CFullMatrix< tType >::Set(), and Sleipnir::CMeta::Tokenize().
void Sleipnir::CBayesNetMinimal::Save | ( | std::ostream & | ostm | ) | const |
Save a minimal Bayes net to the given binary stream.
ostm | Stream to which Bayes net is saved. |
Definition at line 790 of file bayesnetfn.cpp.
References Sleipnir::CFullMatrix< tType >::Save().
void Sleipnir::CBayesNetMinimal::SetID | ( | const std::string & | strID | ) | [inline] |
Sets the string identifier of the network.
strID | String identifier for the network. |
Definition at line 348 of file bayesnet.h.