Sleipnir
sleipnir.png
The Sleipnir Library for Computational Functional Genomics

Greetings, and thanks for your interest in the Sleipnir library! Sleipnir is a C++ library enabling efficient analysis, integration, mining, and machine learning over genomic data. This includes a particular focus on microarrays, since they make up the bulk of available data for many organisms, but Sleipnir can also integrate a wide variety of other data types, from pairwise physical interactions to sequence similarity or shared transcription factor binding sites. All analysis is done with attention to speed and memory usage, enabling the integration of hundreds of datasets covering tens of thousands of genes. In addition to the core library, Sleipnir comes with a variety of pre-made tools, providing solutions to common data processing tasks and examples to help you use Sleipnir in your own programs. Sleipnir is free, open source, fully documented, and ready to be used by itself or as a component in your computational biology analyses.

  1. Download
  2. Citation
  3. Building Sleipnir
  4. Example Uses
  5. Philosophy
  6. Contributing to Sleipnir
  7. Version History
  8. License
  1. Tool Documentation
  2. Library Documentation

Download

Sleipnir and its associated tools are provided as source code that can be compiled under Linux (using gcc), Windows (using Visual Studio or cygwin), or MacOS (using gcc). For more information, see Building Sleipnir and Contributing to Sleipnir.

Citation

If you use Sleipnir, please cite our publication:

Curtis Huttenhower, Mark Schroeder, Maria D. Chikina, and Olga G. Troyanskaya "The Sleipnir library for computational functional genomics", Bioinformatics 2008 PMID 18499696

Building Sleipnir

We avoid distributing binaries directly due to licensing issues, and a typical build on a "normal" desktop computer should take around an hour, but if you have problems building Sleipnir or need a binary distribution for some other reason, please contact us! We're happy to help, and if you have suggestions or contributions, we'll post them here with appropriate credit.

Prerequisites

While it is possible (on Linux/Mac OS, at least) to build Sleipnir with very few additional libraries, there are a number of external packages that will add to its functionality. A few of these are used by the core Sleipnir library, the remainder by the tools included with Sleipnir. In general, these libraries should be built and installed before Sleipnir. On Linux/Mac OS, the configure tool will automatically find them in many cases, and it can be pointed at them using the --with flags if necessary. On Windows with Visual Studio, you can use the Additional Include/Library Directories properties; see below for more details. External libraries usable with Sleipnir are:

Requirements

Recommendations

Suggestions

Linux/MacOS

General instructions are in this section. If you want to build the latest mercurial checkout on Ubuntu, Ubuntu from Mercurial (Current as of Ubuntu 12.04) provides detailed instructions.

  1. Obtain any Prerequisites you need/want. These can often be installed using your favorite Linux package manager. If you need to compile/install them to a nonstandard location by hand, please note the directory prefix where they are installed.
  2. If you're using SVM Perf, please use this Makefile to build it as a library rather than an executable.
  3. Note that SVM Perf and SMILE are both nonstandard in that they expect header files and libraries to reside in the same directory (e.g. /usr/local/smile or /usr/local/svm_perf).
  4. Download and unpack Sleipnir. If you elected to obtain sleipnir from the Mercurial repository, you will need to run both gen_auto and gen_tools_am (a step which requires GNU autotools).
  5. In the Sleipnir directory, run ./configure. If you've installed prerequisite libraries that it doesn't find automatically, provide an appropriate --with switch for each one. For example, to build Sleipnir with SMILE and SVM Perf installed in custom directories under -c /usr/local/, type:
     ./configure --with-smile=/usr/local/smile/ --with-svm-perf=/usr/local/svm_perf/
    
  6. If you'd like to install Sleipnir itself to a custom location, include a --prefix=/custom/path/ flag when you run configrue.
  7. After configure's completed successfully, run make and make install.
  8. Tools that use Sleipnir will be built an installed automatically if Gengetopt and any other prerequisite libraries are available.

Ubuntu from Mercurial (Current as of Ubuntu 12.04)

  1. Obtain mercurial, gengetopt, boost, log4cpp, liblog4cpp5-dev, and build-essential packages. In a terminal, type:
     sudo apt-get install mercurial gengetopt libboost-regex-dev libboost-graph-dev liblog4cpp5-dev build-essential libgsl0-dev
    
  2. If desired, download and install SMILE:
    1. From http://genie.sis.pitt.edu/downloads.html download the appropriate package (x64 or x86) for gcc version 4 or above (currently 4.4.5). If you have registered as a SMILE user and meet the appropriate requirements, the following commands should work for _x64 (assumes you have a Downloads directory):
        cd ~/Downloads
        mkdir smile
        cd smile
        wget http://genie.sis.pitt.edu/download/smile_linux_x64_gcc_4_4_5.tar.gz
        tar -xzf smile_linux_x64_gcc_4_4_5.tar.gz
        rm smile_linux_x64_gcc_4_4_5.tar.gz
        cd ..
        sudo mv smile /usr/local/smile
      
  3. Currently Sleipnir requires SVMperf, so you must complete the following steps:
    1. Visit http://www.cs.cornell.edu/People/tj/svm_light/svm_perf.html and make sure that you meet the conditions of use (currently: "The program is free for scientific use. Please contact me, if you are planning to use the software for commercial purposes. The software must not be further distributed without prior permission of the author. If you use SVMperf in your scientific work, please cite the appropriate publications (available from the SVMperf website)").
    2. Assuming you meet the conditions, the following steps in a terminal will download, compile, and install SVMperf as required by Sleipnir.
            cd ~/Downloads
            mkdir svmperf
            cd svmperf
            wget http://download.joachims.org/svm_perf/current/svm_perf.tar.gz
            tar -xzf svm_perf.tar.gz
            rm svm_perf.tar.gz
            wget http://libsleipnir.bitbucket.org/SVMperf/Makefile -O Makefile
            make
            cd ..
            sudo mv svmperf /usr/local
      
  4. Get Sleipnir (the following assumes you want sleipnir to live in ~/sleipnir, if this is not correct, adjust the paths accordingly)
     cd ~
     hg clone https://bitbucket.org/libsleipnir/sleipnir
    
  5. Move to the Sleipnir directory and run the autotools scripts:
      cd sleipnir
      ./gen_auto
      ./gen_tools_am
    
  6. Configure and build Sleipnir:
      ./configure --with-smile=/usr/local --with-svm-perf=/usr/local/svmperf/
      make
    
  7. Assuming that all completed successfully, you can now install sleipnir to /usr/local with:

      sudo make install
    

    If you want to install sleipnir to another location, adjust the ./configure step accordingly.

Windows

This section assumes that you're building Sleipnir on Windows using Visual Studio. I'm fairly certain that Sleipnir can be built using cygwin as well by approximately following the Linux/Mac OS instructions.

  1. Obtain any Prerequisites you need/want. A few of these have Windows installers, but most will need to be built using Visual Studio. In general, you can do this by:
    1. Unpack the library being built.
    2. Create an empty Visual Studio C++ project. Add all of the library's .c, .cpp, and/or .h files to the project.
    3. Make sure the project's Configuration Type property is "Static Library" and its Runtime Library property is "Multi-threaded" (or "Multi-threaded Debug" as appropriate).
    4. Some libraries have preprocessor definitions that must be set to ensure that they are built as static libraries (e.g. PTW32_STATIC_LIB).
    5. Build the project.
    gengetopt is an exception, since it's an executable program; make sure its Configuration Type is "Application".
  2. Download and unpack Sleipnir.
  3. Open up the Sleipnir solution or individual projects. By default, Sleipnir expects external libraries to be built in a directory named extlib. If you have built them elsewhere, make sure to update the Additional Include and Library Directories properties appropriately.
  4. Build the Sleipnir library project first.
  5. If you have built the gengetopt executable in a non-default location, make sure to modify the .ggo build rule's Command Line value under the Custom Build Rules menu item.
  6. Build any desired Sleipnir tools.

Troubleshooting

Example Uses

Sleipnir can be used to satisfy a variety of needs in bioinformatic data processing, from simple data normalization to complex integration and machine learning. The tools provided with Sleipnir can be used by themselves, or you can integrate the Sleipnir library into your own tools.

Tools

The following tasks are examples of what can be achieved using only prebuilt tools provided with Sleipnir. No programming necessary! To see what else can be done if you're writing your own code with Sleipnir, check out the Core Library section below.

Microarray Processing

You're investigating four different knockout strains of yeast. To assay their transcriptional response to nutrient limitation, you've grown the four cultures on media containing nothing but cheetos for two days, resulting in four two-color microarray time courses. Rather than using a pooled reference, you've used the zero time point of each time course as its reference. This leaves you with four PCL datasets, each containing twelve conditions, and each using a different reference. Your microarray technique is good but not great, so there are some missing values, and the different reference channels make it difficult to compare the different datasets. What can you do?

Clustering With Aneuploidies

In your previous microarray experiment, you discover that your ber1 knockout strain developed an aneuploidy halfway through your time course. The end of the right arm of chromosome one was duplicated in the last six conditions, artificially doubling the expression level of all of its genes. How can you keep this huge upregulation from driving your clustering?

First, create a PCL file of weights for every gene in every condition. Let's assume your original ber1_imputed.pcl file looks like this:

 ORF    NAME    GWEIGHT TIME1   TIME2   ... TIME12
 EWEIGHT            1   1   ... 1
 YAL001C    TFC3    1   0.1 0.2 ... 0.12
 YAL002W    VPS8    1   -0.1    -0.2    ... -0.12
 ...
 YAR070C    YAR070C 1   1.1 1.2 ... 1.12
 YAR071W    PHO11   1   2.1 2.2 ... 2.12
 YAR073W    IMD1    1   -1.1    -1.2    ... -1.12
 YAR075W    YAR075W 1   -2.12   -2.11   ... -2.1
 ...
 YPR203W    YPR203W 1   0.12    0.11    ... 0.1
 YPR204W    YPR204W 1   -0.12   -0.11   ... -0.1

The four YAR genes listed here have been duplicated, and their expression levels are correspondingly high. Create a weights PCL file with exactly the same structure, save that the expression values are all replaced by the desired weights of each gene in each condition. A weight of 1.0 means that the gene should be counted normally, a weight of 0.5 means that it should contribute half as much weight, 2.0 twice as much, and so forth:

 ORF    NAME    GWEIGHT TIME1   TIME2   ... TIME12
 EWEIGHT            1   1   ... 1
 YAL001C    TFC3    1   1.0 1.0 ... 1.0
 YAL002W    VPS8    1   1.0 1.0 ... 1.0
 ...
 YAR070C    YAR070C 1   1.0 1.0 ... 0.5
 YAR071W    PHO11   1   1.0 1.0 ... 0.5
 YAR073W    IMD1    1   1.0 1.0 ... 0.5
 YAR075W    YAR075W 1   1.0 1.0 ... 0.5
 ...
 YPR203W    YPR203W 1   1.0 1.0 ... 1.0
 YPR204W    YPR204W 1   1.0 1.0 ... 1.0

Each of the four duplicated YAR genes should be assigned a weight of 0.5 in the conditions where it was duplicated; thus, the whole row for PHO11 should be:

 YAR071W    PHO11   1   1.0 1.0 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5

Let's name this file ber1_weights.pcl. Now, run MCluster with the expression file and the weights file:

 MCluster -o ber1_weighted.gtr -w ber1_weights.pcl -i ber1_imputed.pcl > ber1_weighted.cdt

The resulting cluster output will still contain the doubled expression values, so you can see what the genes' actual expression levels were, but they won't contribute abnormally much to the clustering.

Exploring Functional Catalogs

Suppose you've just downloaded the latest and greatest versions of the Gene Ontology, MIPS Funcat, and KEGG Orthology. You're still chasing down information on your four knockout yeast strains, so you also get the GO yeast annotations and Funcat yeast annotations. This should give you five files:

Let's load them into OntoShell and look around:

 OntoShell -o gene_ontology.obo -g gene_assication.sgd -m funcat-2.0_scheme -a funcat-2.0_data_18052006
        -k ko -K SCE

This should produce a command line from which you can explore the three ontologies simultaneously:

/> ls
- ROOT
O KEGG  1517
O GOBP  6462
O GOMF  6310
O GOCC  6434
O MIPS  6773
O MIPSP 0
/> cat PHO11
YAR071W (PHO11)
One of three repressible acid phosphatases, a glycoprotein that is transported t
o the cell surface by the secretory pathway
KEGG: ko00361            Metabolism; Xenobiotics Biodegradation and Metab...
      ko00740            Metabolism; Metabolism of Cofactors and Vitamins...
GOBP: GO:0006796         phosphate metabolic process
GOCC: GO:0005576         extracellular region
GOMF: GO:0003993         acid phosphatase activity
MIPS: 01.04.01           phosphate utilization
      01.05.01           C-compound and carbohydrate utilization
      01.07              metabolism of vitamins, cofactors, and prostheti...
/> ls -g GOBP/GO:0007624
- GO:0007624         1     0     ultradian rhythm
P GO:0048511         0     1     rhythmic process
 YGL181W(GTS1,FHT1,LSR1)

For more information on specific OntoShell commands and capabilities, please see its documentation.

Suppose you've discovered four genes showing unusual activity during your cheeto time courses. Create a gene list text file for those four genes:

 YAR014C
 YNL161W
 YKL189W
 YOR353C

Suppose this is named cheeto_genes.txt. We can test for functional enrichment among this gene set across all three catalogs in OntoShell :

/> find -g -l cheeto_genes.txt 0.01
KEGG:
ko04150            0.00791035    1    1    12   1517 Environmental Information Processing; Signal Transduction; ...
GOBP:
GO:0000903         6.55773e-011  4    4    8    6462 cellular morphogenesis during vegetative growth
GO:0016049         4.14193e-006  4    4    103  6462 cell growth
GO:0008361         1.13194e-005  4    4    132  6462 regulation of cell size
...
GOCC:
GO:0030427         6.15338e-006  4    4    153  6434 site of polarized growth
GO:0005933         7.37199e-006  4    4    160  6434 cellular bud
GO:0043332         1.07503e-005  3    4    34   6434 mating projection tip
...
MIPS:
40.01              4.53959e-005  4    4    239  6773 cell growth / morphogenesis
40                 7.86743e-005  4    4    274  6773 CELL FATE
40.01.03           0.00519136    2    4    37   6773 directional cell growth (morphogenesis)

So it looks like eating nothing but cheetos has something to do with vegetative growth! You could run this same command directly from the command line to save the output in a file for later reference:

 OntoShell -o gene_ontology.obo -g gene_assication.sgd -m funcat-2.0_scheme -a funcat-2.0_data_18052006
        -k ko -K SCE -x 'find -g -l cheeto_genes.txt 0.01' > cheeto_genes_enriched_terms.txt

Bayesian Data Integration

You've done about as much by-hand analysis of your cheeto time courses as you can, so you're ready to throw some machine learning algorithms at them. Suppose you want to construct a predicted functional relationship network specific to your four datasets and the process of "cellular morphogenesis during vegetative growth".

Core Library

While the tools provided with Sleipnir satisfy a variety of common data processing needs, the library's real potential lies in its ability to be integrated into anyone's bioinformatic analyses. If you're thinking of developing your own tools using the Sleipnir library, here are some ideas.

An Important Note

Keep in mind that the best way to develop using Sleipnir is to start from one of the pre-existing tools. Copy the code and/or project file for the tool most similar to your intended goal and start modifying! This will automatically ensure that you retain the required skeleton for interacting with Sleipnir:

A skeletal main function using Sleipnir (and Windows pthreads) might resemble:

 #include "cmdline.h"
 #include "dat.h"
 #include "meta.h"
 #include "pcl.h"
 using namespace Sleipnir;
 
 int main( int iArgs, char** aszArgs ) {
    gengetopt_args_info sArgs;
    ... other variables here ...
 
    if( cmdline_parser( iArgs, aszArgs, &sArgs ) ) {
        cmdline_parser_print_help( );
        return 1; }
    CMeta::Startup( sArgs.verbosity_arg );
 #ifdef WIN32
    pthread_win32_process_attach_np( );
 #endif // WIN32
 
    ... do stuff here ...
 
 #ifdef WIN32
    pthread_win32_process_detach_np( );
 #endif // WIN32
    return 0; }

Rapid Data Mining

You've just downloaded the entire GEO database of microarrays for C. elegans. You'd like to explore these microarrays to find the gene pairs most highly correlated across the largest number of tissues.

Integrative Clustering Algorithms

Want to cluster genes using more than just expression correlation? You can feed Sleipnir's built-in clustering algorithms similarity scores based on anything, or calculate your own similarity measures in real time. Here are some ideas:

Continuous Bayesian Networks

Sleipnir contains rudimentary support for continuous naive Bayesian classifiers using any distribution easily fittable by maximum likelihood: normal, beta, exponential, etc. It also has limited support for using the PNL graphical models library from Intel, which supports a variety of sophisticated continuous models. Some potential avenues of interest include:

Functional Ontology Comparisons

By providing a uniform interface to a variety of functional catalogs (GO, MIPS, KEGG, SGD features, and MIPS phenotypes, to name a few), Sleipnir offers not only an opportunity for data analysis but for comparative functional annotation.

Philosophy

While Sleipnir includes a wide variety of data structures and analysis tools, a few formats and concepts recur frequently in its design. The most important of these is the symmetric matrix, encapsulated by the Sleipnir::CDat class. A Sleipnir::CDat represents a set of pairwise scores between genes; these can be encoded as a DAT text file of the form:

 GENE1  GENE2   VALUE1
 GENE1  GENE3   VALUE2
 GENE2  GENE3   VALUE3
 YPL149W    YBR217W 15.6
 YPL149W    YKL126W -0.62
 ...

Equivalently, a Sleipnir::CDat can be encoded as a DAB binary file, which stores identical information as a symmetric matrix (i.e. a half matrix):

GENE1 GENE2 GENE3 ...
GENE1 VALUE1 VALUE2
GENE2 VALUE3
...

Continuous values in a Sleipnir::CDat can be discretized automatically (e.g. for machine learning) using QUANT files. These are one-line tab-delimited text files indicating the bin edges for discretization. For example, suppose we have a DAT file named example.dat:

 A  B   0.2
 A  C   0.9
 B  C   0.6

We can pair it with example.quant, which will contain:

 0.3    0.6 0.9

This is equivalent to the discretized scores:

 A  B   0
 A  C   2
 B  C   1

Each bin edge (except the last) represents an inclusive upper bound. That is, given a value, it falls into the first bin where it's less than or equal to the edge. In interval notation, this means the QUANT above is equivalent to (-infinity, 0.3], (0.3, 0.6], (0.6, infinity).

Generally speaking, each Sleipnir::CDat represents the result of a single experimental assay (or group of related assays, e.g. one microarray time course). A group of datasets can be manipulated in tandem using Sleipnir::IDataset, an interface made to simplify machine learning or other analysis of many datasets simultaneously. This allows you to ask questions like, "Given some gene pair A and B, how did they interact in these four assays, and what does my gold standard say about their interaction?"

For more information, see the Sleipnir::CDat and Sleipnir::IDataset documentation.

Sleipnir often uses files containing gene lists, which are simple text files with one gene ID per line:

 YPL149W
 YHR171W
 YBR217W
 ...

Other text-based files include the tab-delimited zeros or defaults file format, containing two columns, the first a node ID and the second an integer value:

 MICROARRAY 0
 TF 2
 SYNL_TRAD  1
 ...

Similarly, an ontology slim file contains one term per line with two tab-delimited columns, the first a description of some functional catalog term and the second its ID:

 autophagy  GO:0006914
 mitochondrion organization and biogenesis  GO:0007005
 translation    GO:0006412
 ...

Finally, Sleipnir also takes advantage of several predefined file formats, primarily PCLs and the associated CDT/GTR file pairing system (see Sleipnir::CPCL).

In Sleipnir's tools, standard input and standard output are used as defaults almost everywhere; a "DAT/DAB" file generally means any appropriate Sleipnir::CDat format. If standard input or output is being used, DAT formatting is generally assumed; if an explicit input or output filename is given, DAB formatting is generally assumed.

Contributing to Sleipnir

While we don't (currently) have the resources to make Sleipnir a full-blown community project, we'd love to include (with full credit, of course) any patches submitted by the community. If you're interested in developing new Sleipnir tools or library components, the following steps may be useful:

Version History

License

Sleipnir is provided under the Creative Commons Attribution 3.0 license.

You are free to share, copy, distribute, transmit, or adapt this work PROVIDED THAT you attribute the work to the authors listed above. For more information, please see the following web page: http://creativecommons.org/licenses/by/3.0/