Sleipnir
|
Greetings, and thanks for your interest in the Sleipnir library! Sleipnir is a C++ library enabling efficient analysis, integration, mining, and machine learning over genomic data. This includes a particular focus on microarrays, since they make up the bulk of available data for many organisms, but Sleipnir can also integrate a wide variety of other data types, from pairwise physical interactions to sequence similarity or shared transcription factor binding sites. All analysis is done with attention to speed and memory usage, enabling the integration of hundreds of datasets covering tens of thousands of genes. In addition to the core library, Sleipnir comes with a variety of pre-made tools, providing solutions to common data processing tasks and examples to help you use Sleipnir in your own programs. Sleipnir is free, open source, fully documented, and ready to be used by itself or as a component in your computational biology analyses.
https://github.com/FunctionLab/sleipnir/
.Sleipnir and its associated tools are provided as source code that can be compiled under Linux (using gcc), Windows (using Visual Studio or cygwin), or MacOS (using gcc). For more information, see Building Sleipnir and Contributing to Sleipnir.
If you use Sleipnir, please cite our publication:
Curtis Huttenhower, Mark Schroeder, Maria D. Chikina, and Olga G. Troyanskaya "The Sleipnir library for computational functional genomics", Bioinformatics 2008 PMID 18499696
We avoid distributing binaries directly due to licensing issues, and a typical build on a "normal" desktop computer should take around an hour, but if you have problems building Sleipnir or need a binary distribution for some other reason, please contact us! We're happy to help, and if you have suggestions or contributions, we'll post them here with appropriate credit.
While it is possible (on Linux/Mac OS, at least) to build Sleipnir with very few additional libraries, there are a number of external packages that will add to its functionality. A few of these are used by the core Sleipnir library, the remainder by the tools included with Sleipnir. In general, these libraries should be built and installed before Sleipnir. On Linux/Mac OS, the configure
tool will automatically find them in many cases, and it can be pointed at them using the --with
flags if necessary. On Windows with Visual Studio, you can use the Additional Include/Library Directories properties; see below for more details. External libraries usable with Sleipnir are:
General instructions are in this section. If you want to build the latest mercurial checkout on Ubuntu, Ubuntu from Mercurial (Current as of Ubuntu 12.04) provides detailed instructions.
/usr/local/smile
or /usr/local/svm_perf
).
./configure. If you've installed prerequisite libraries that it doesn't find automatically, provide an appropriate --with
switch for each one. For example, to build Sleipnir with SMILE and SVM Perf installed in custom directories under -c /usr/local/, type: ./configure --with-smile=/usr/local/smile/ --with-svm-perf=/usr/local/svm_perf/
--prefix=/custom/path/
flag when you run configrue
. configure's
completed successfully, run make
and make install
. sudo apt-get install mercurial gengetopt libboost-regex-dev libboost-graph-dev liblog4cpp5-dev build-essential libgsl0-dev
cd ~/Downloads
mkdir smile
cd smile
wget http://genie.sis.pitt.edu/download/smile_linux_x64_gcc_4_4_5.tar.gz
tar -xzf smile_linux_x64_gcc_4_4_5.tar.gz
rm smile_linux_x64_gcc_4_4_5.tar.gz
cd ..
sudo mv smile /usr/local/smile
cd ~/Downloads mkdir svmperf cd svmperf wget http://download.joachims.org/svm_perf/current/svm_perf.tar.gz tar -xzf svm_perf.tar.gz rm svm_perf.tar.gz wget http://libsleipnir.bitbucket.org/SVMperf/Makefile -O Makefile make cd .. sudo mv svmperf /usr/local
cd ~
hg clone https://bitbucket.org/libsleipnir/sleipnir
cd sleipnir ./gen_auto ./gen_tools_am
./configure --with-smile=/usr/local --with-svm-perf=/usr/local/svmperf/ make
Assuming that all completed successfully, you can now install sleipnir to /usr/local with:
sudo make install
If you want to install sleipnir to another location, adjust the ./configure step accordingly.
This section assumes that you're building Sleipnir on Windows using Visual Studio. I'm fairly certain that Sleipnir can be built using cygwin as well by approximately following the Linux/Mac OS instructions.
PTW32_STATIC_LIB
). extlib
. If you have built them elsewhere, make sure to update the Additional Include and Library Directories properties appropriately. LDFLAGS=-static
when running configure
). On Windows or Mac OS, when in doubt, link dynamically (e.g. in Visual Studio, using the DLL runtime libraries). Mac OS is not consistently able to link statically, and the SMILE library will only link statically in release (not debug) mode on Windows.--version
flag.--prefix
argument is optional): ./bootstrap.sh --with-libraries=graph --prefix=/desired/boost/install/path/ ./b2 install
--with
on Linux/Mac OS and stored in the Additional Include/Library Directories properties in Visual Studio. These must point to the directories where you've installed the necessary prerequisite libraries, including both library and header files.--with
argument to the prerequisite library's source directory.--with
, beware of Boost's tendency to append the compiler version to its library names under certain circumstances. If your Boost installation includes something like gcc41
in the library file names, use --with-boost-graph-lib
to give the path to the Boost graph library file rather than its parent directory. Remember, Boost is only used for certain tools, so it won't hurt if you need to exclude it.CXXFLAGS=-fno-threadsafe-statics
. This works around a bug in certain versions of g++ and pthreads.Sleipnir can be used to satisfy a variety of needs in bioinformatic data processing, from simple data normalization to complex integration and machine learning. The tools provided with Sleipnir can be used by themselves, or you can integrate the Sleipnir library into your own tools.
The following tasks are examples of what can be achieved using only prebuilt tools provided with Sleipnir. No programming necessary! To see what else can be done if you're writing your own code with Sleipnir, check out the Core Library section below.
You're investigating four different knockout strains of yeast. To assay their transcriptional response to nutrient limitation, you've grown the four cultures on media containing nothing but cheetos for two days, resulting in four two-color microarray time courses. Rather than using a pooled reference, you've used the zero time point of each time course as its reference. This leaves you with four PCL datasets, each containing twelve conditions, and each using a different reference. Your microarray technique is good but not great, so there are some missing values, and the different reference channels make it difficult to compare the different datasets. What can you do?
pza1.pcl
, ber1.pcl
, rmn1.pcl
, and cke1.pcl
. First, use KNNImputer to impute and remove missing values for each file: KNNImputer -i pza1.pcl -o pza1_imputed.pcl
Combiner -o combined_imputed.pcl *_imputed.pcl
MCluster -o combined_imputed.gtr -i combined_imputed.pcl > combined_imputed.cdt
Distancer -i pza1_imputed.pcl -o pza1_imputed.dab
Combiner -t dat -o combined_imputed_normalized.dab *_imputed.dab
MCluster -o combined_imputed_normalized.gtr -i combined_imputed_normalized.dab < combined_imputed.pcl > combined_imputed_normalized.cdt
In your previous microarray experiment, you discover that your ber1 knockout strain developed an aneuploidy halfway through your time course. The end of the right arm of chromosome one was duplicated in the last six conditions, artificially doubling the expression level of all of its genes. How can you keep this huge upregulation from driving your clustering?
First, create a PCL file of weights for every gene in every condition. Let's assume your original ber1_imputed.pcl
file looks like this:
ORF NAME GWEIGHT TIME1 TIME2 ... TIME12 EWEIGHT 1 1 ... 1 YAL001C TFC3 1 0.1 0.2 ... 0.12 YAL002W VPS8 1 -0.1 -0.2 ... -0.12 ... YAR070C YAR070C 1 1.1 1.2 ... 1.12 YAR071W PHO11 1 2.1 2.2 ... 2.12 YAR073W IMD1 1 -1.1 -1.2 ... -1.12 YAR075W YAR075W 1 -2.12 -2.11 ... -2.1 ... YPR203W YPR203W 1 0.12 0.11 ... 0.1 YPR204W YPR204W 1 -0.12 -0.11 ... -0.1
The four YAR genes listed here have been duplicated, and their expression levels are correspondingly high. Create a weights PCL file with exactly the same structure, save that the expression values are all replaced by the desired weights of each gene in each condition. A weight of 1.0 means that the gene should be counted normally, a weight of 0.5 means that it should contribute half as much weight, 2.0 twice as much, and so forth:
ORF NAME GWEIGHT TIME1 TIME2 ... TIME12 EWEIGHT 1 1 ... 1 YAL001C TFC3 1 1.0 1.0 ... 1.0 YAL002W VPS8 1 1.0 1.0 ... 1.0 ... YAR070C YAR070C 1 1.0 1.0 ... 0.5 YAR071W PHO11 1 1.0 1.0 ... 0.5 YAR073W IMD1 1 1.0 1.0 ... 0.5 YAR075W YAR075W 1 1.0 1.0 ... 0.5 ... YPR203W YPR203W 1 1.0 1.0 ... 1.0 YPR204W YPR204W 1 1.0 1.0 ... 1.0
Each of the four duplicated YAR genes should be assigned a weight of 0.5 in the conditions where it was duplicated; thus, the whole row for PHO11 should be:
YAR071W PHO11 1 1.0 1.0 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5
Let's name this file ber1_weights.pcl
. Now, run MCluster with the expression file and the weights file:
MCluster -o ber1_weighted.gtr -w ber1_weights.pcl -i ber1_imputed.pcl > ber1_weighted.cdt
The resulting cluster output will still contain the doubled expression values, so you can see what the genes' actual expression levels were, but they won't contribute abnormally much to the clustering.
Suppose you've just downloaded the latest and greatest versions of the Gene Ontology, MIPS Funcat, and KEGG Orthology. You're still chasing down information on your four knockout yeast strains, so you also get the GO yeast annotations and Funcat yeast annotations. This should give you five files:
gene_ontology.obo
and gene_association.sgd
for GO.funcat-2.0_scheme
and funcat-2.0_data_18052006
(or something similar) for MIPS.ko
for KEGG (the orthology file can be hard to find; it's on their FTP site).Let's load them into OntoShell and look around:
OntoShell -o gene_ontology.obo -g gene_assication.sgd -m funcat-2.0_scheme -a funcat-2.0_data_18052006 -k ko -K SCE
This should produce a command line from which you can explore the three ontologies simultaneously:
/> ls - ROOT O KEGG 1517 O GOBP 6462 O GOMF 6310 O GOCC 6434 O MIPS 6773 O MIPSP 0 /> cat PHO11 YAR071W (PHO11) One of three repressible acid phosphatases, a glycoprotein that is transported t o the cell surface by the secretory pathway KEGG: ko00361 Metabolism; Xenobiotics Biodegradation and Metab... ko00740 Metabolism; Metabolism of Cofactors and Vitamins... GOBP: GO:0006796 phosphate metabolic process GOCC: GO:0005576 extracellular region GOMF: GO:0003993 acid phosphatase activity MIPS: 01.04.01 phosphate utilization 01.05.01 C-compound and carbohydrate utilization 01.07 metabolism of vitamins, cofactors, and prostheti... /> ls -g GOBP/GO:0007624 - GO:0007624 1 0 ultradian rhythm P GO:0048511 0 1 rhythmic process YGL181W(GTS1,FHT1,LSR1)
For more information on specific OntoShell commands and capabilities, please see its documentation.
Suppose you've discovered four genes showing unusual activity during your cheeto time courses. Create a gene list text file for those four genes:
YAR014C YNL161W YKL189W YOR353C
Suppose this is named cheeto_genes.txt
. We can test for functional enrichment among this gene set across all three catalogs in OntoShell :
/> find -g -l cheeto_genes.txt 0.01 KEGG: ko04150 0.00791035 1 1 12 1517 Environmental Information Processing; Signal Transduction; ... GOBP: GO:0000903 6.55773e-011 4 4 8 6462 cellular morphogenesis during vegetative growth GO:0016049 4.14193e-006 4 4 103 6462 cell growth GO:0008361 1.13194e-005 4 4 132 6462 regulation of cell size ... GOCC: GO:0030427 6.15338e-006 4 4 153 6434 site of polarized growth GO:0005933 7.37199e-006 4 4 160 6434 cellular bud GO:0043332 1.07503e-005 3 4 34 6434 mating projection tip ... MIPS: 40.01 4.53959e-005 4 4 239 6773 cell growth / morphogenesis 40 7.86743e-005 4 4 274 6773 CELL FATE 40.01.03 0.00519136 2 4 37 6773 directional cell growth (morphogenesis)
So it looks like eating nothing but cheetos has something to do with vegetative growth! You could run this same command directly from the command line to save the output in a file for later reference:
OntoShell -o gene_ontology.obo -g gene_assication.sgd -m funcat-2.0_scheme -a funcat-2.0_data_18052006
-k ko -K SCE -x 'find -g -l cheeto_genes.txt 0.01' > cheeto_genes_enriched_terms.txt
You've done about as much by-hand analysis of your cheeto time courses as you can, so you're ready to throw some machine learning algorithms at them. Suppose you want to construct a predicted functional relationship network specific to your four datasets and the process of "cellular morphogenesis during vegetative growth".
translation GO:0043037 cytoskeleton organization and biogenesis GO:0007010 transcription from RNA polymerase II promoter GO:0006366 ... boron transport GO:0046713
positives
and run BNFunc : BNFunc -o gene_ontology.obo -a gene_assocation.sgd -i GO_functional_slim.txt -d positives
-l
flag as well: Answerer -p positives -n positives -l 0.05 -o answers.dab
pza1_imputed.dab
and so forth in the example above, and let's add four quantization QUANT files for them and one for the answer file. The answer file's easy; it just contains 0s and 1s, so create a text file answers.quant
containing one line: 0.5 1.5
pza1_imputed.quant
and so forth, each containing the single tab-delimited line: -1.5 -0.5 0.5 1.5 2.5 3.5 4.5
cheeto_genes.txt
file you created earlier? Assuming all of your files are together in the current directory, run: BNCreator -w answers.dab -o cheeto_network.xdsl -d . -c cheeto_genes.txt
BNCreator -i cheeto_network.xdsl -o cheeto_network.dab -d .
cheeto_network.dab
each represent a probability of functional interaction between each gene pair. You can do all sorts of interesting analyses on this network, including visualizing portions of it using the bioPIXIE algorithm. If you want to see what the portion of the network around your original genes of interest looks like, use Dat2Graph : Dat2Graph -i cheeto_network.dab -q cheeto_genes.txt -k 5 > cheeto_genes_subnetwork.dot
Dat2Dab -i cheeto_network.dab -o cheeto_network.dat
YAL001C YAL040C 0.0155878 YAL001C YAL041W 0.242001 YAL001C YAL056W 0.345961 ...
MCluster -o combined_imputed_fr_predictions.gtr -i cheeto_network.dab < combined_imputed.pcl > combined_imputed_fr_predictions.cdt
Cliquer -i cheeto_network.dab -r 3 -w 0.33 > cheeto_network_clusters.txt
16.203 YAR071W YBR093C YHR215W YBR092C YDL106C 14.1403 YBR066C YDR043C YHL027W YDR477W YBR112C YJL089W 15.3548 YBR093C YML077W YML123C YHR136C 14.1379 YIL108W YOR032C YFR028C YGL003C YGR225W 13.2328 YMR238W YOL131W YLR295C YKL046C YDR309C YHR061C YMR055C ...
cheeto_genes.txt
. Then use Hubber to run: Hubber -i cheeto_network.dab -g 100 cheeto_genes.txt > cheeto_gene_predictions.txt
YER033C|22.51|0 YDR389W|17.9219|0 YPL204W|16.5705|0 ...
While the tools provided with Sleipnir satisfy a variety of common data processing needs, the library's real potential lies in its ability to be integrated into anyone's bioinformatic analyses. If you're thinking of developing your own tools using the Sleipnir library, here are some ideas.
Keep in mind that the best way to develop using Sleipnir is to start from one of the pre-existing tools. Copy the code and/or project file for the tool most similar to your intended goal and start modifying! This will automatically ensure that you retain the required skeleton for interacting with Sleipnir:
EnableXdslFormat
, for example (although it's not required any more), and on Windows, pthreads needs some setup/teardown (see below).A skeletal main
function using Sleipnir (and Windows pthreads) might resemble:
#include "cmdline.h" #include "dat.h" #include "meta.h" #include "pcl.h" using namespace Sleipnir; int main( int iArgs, char** aszArgs ) { gengetopt_args_info sArgs; ... other variables here ... if( cmdline_parser( iArgs, aszArgs, &sArgs ) ) { cmdline_parser_print_help( ); return 1; } CMeta::Startup( sArgs.verbosity_arg ); #ifdef WIN32 pthread_win32_process_attach_np( ); #endif // WIN32 ... do stuff here ... #ifdef WIN32 pthread_win32_process_detach_np( ); #endif // WIN32 return 0; }
You've just downloaded the entire GEO database of microarrays for C. elegans. You'd like to explore these microarrays to find the gene pairs most highly correlated across the largest number of tissues.
Want to cluster genes using more than just expression correlation? You can feed Sleipnir's built-in clustering algorithms similarity scores based on anything, or calculate your own similarity measures in real time. Here are some ideas:
Sleipnir contains rudimentary support for continuous naive Bayesian classifiers using any distribution easily fittable by maximum likelihood: normal, beta, exponential, etc. It also has limited support for using the PNL graphical models library from Intel, which supports a variety of sophisticated continuous models. Some potential avenues of interest include:
By providing a uniform interface to a variety of functional catalogs (GO, MIPS, KEGG, SGD features, and MIPS phenotypes, to name a few), Sleipnir offers not only an opportunity for data analysis but for comparative functional annotation.
While Sleipnir includes a wide variety of data structures and analysis tools, a few formats and concepts recur frequently in its design. The most important of these is the symmetric matrix, encapsulated by the Sleipnir::CDat class. A Sleipnir::CDat represents a set of pairwise scores between genes; these can be encoded as a DAT text file of the form:
GENE1 GENE2 VALUE1 GENE1 GENE3 VALUE2 GENE2 GENE3 VALUE3 YPL149W YBR217W 15.6 YPL149W YKL126W -0.62 ...
Equivalently, a Sleipnir::CDat can be encoded as a DAB binary file, which stores identical information as a symmetric matrix (i.e. a half matrix):
GENE1 | GENE2 | GENE3 | ... | |
---|---|---|---|---|
GENE1 | VALUE1 | VALUE2 | ||
GENE2 | VALUE3 | |||
... |
Continuous values in a Sleipnir::CDat can be discretized automatically (e.g. for machine learning) using QUANT files. These are one-line tab-delimited text files indicating the bin edges for discretization. For example, suppose we have a DAT file named example.dat:
A B 0.2 A C 0.9 B C 0.6
We can pair it with example.quant
, which will contain:
0.3 0.6 0.9
This is equivalent to the discretized scores:
A B 0 A C 2 B C 1
Each bin edge (except the last) represents an inclusive upper bound. That is, given a value, it falls into the first bin where it's less than or equal to the edge. In interval notation, this means the QUANT above is equivalent to (-infinity, 0.3], (0.3, 0.6], (0.6, infinity).
Generally speaking, each Sleipnir::CDat represents the result of a single experimental assay (or group of related assays, e.g. one microarray time course). A group of datasets can be manipulated in tandem using Sleipnir::IDataset, an interface made to simplify machine learning or other analysis of many datasets simultaneously. This allows you to ask questions like, "Given some gene pair A and B, how did they interact in these four assays, and what does my gold standard say about their interaction?"
For more information, see the Sleipnir::CDat and Sleipnir::IDataset documentation.
Sleipnir often uses files containing gene lists, which are simple text files with one gene ID per line:
YPL149W YHR171W YBR217W ...
Other text-based files include the tab-delimited zeros or defaults file format, containing two columns, the first a node ID and the second an integer value:
MICROARRAY 0 TF 2 SYNL_TRAD 1 ...
Similarly, an ontology slim file contains one term per line with two tab-delimited columns, the first a description of some functional catalog term and the second its ID:
autophagy GO:0006914 mitochondrion organization and biogenesis GO:0007005 translation GO:0006412 ...
Finally, Sleipnir also takes advantage of several predefined file formats, primarily PCLs and the associated CDT/GTR file pairing system (see Sleipnir::CPCL).
In Sleipnir's tools, standard input and standard output are used as defaults almost everywhere; a "DAT/DAB" file generally means any appropriate Sleipnir::CDat format. If standard input or output is being used, DAT formatting is generally assumed; if an explicit input or output filename is given, DAB formatting is generally assumed.
While we don't (currently) have the resources to make Sleipnir a full-blown community project, we'd love to include (with full credit, of course) any patches submitted by the community. If you're interested in developing new Sleipnir tools or library components, the following steps may be useful:
https://bitbucket.org/libsleipnir/sleipnir/overview
. The sleipnir
branch always contains the latest development version of Sleipnir, and official versioned releases appear under tags
. If you'd like to submit patches to us for inclusion in Sleipnir, please try to do so against the current development version (sleipnir
). This repository ties in with our ticket system which also provides a good way to submit patches.SIZE_MAX
definition on Mac OS X - thanks to Alice Koechlin! half2relative.rb
and half2weights.rb
scripts to MIer - thanks to Arjun Krishnan! Sleipnir is provided under the Creative Commons Attribution 3.0 license.
You are free to share, copy, distribute, transmit, or adapt this work PROVIDED THAT you attribute the work to the authors listed above. For more information, please see the following web page: http://creativecommons.org/licenses/by/3.0/