Sleipnir: MEFIT

MEFIT performs all steps necessary to produce context-specific predicted functional relationship networks (DAT/DAB files) from input microarray PCL files as described in Huttenhower et al 2006. This is essentially a summarization of work performed by Answerer, Distancer, BNCreator, and BNTruster.

Usage

Basic Usage

 MEFIT -r <related_dir> -u <unrelated.txt> -b <bins.quant> -o <learned_dir> -O <learned.xdsl>
        -p <predictions_dir> -t <trusts.txt> <data.pcl>*

First, construct a gold standard by reading related gene set text files from related_dir and unrelated gene pairs from unrelated.txt; any two genes coannotated to some set in related_dir are considered functionally related, and any gene pair listed in unrelated.txt is considered to be unrelated. Next, compute normalized pairwise correlations for all genes in the given data.pcl microarray data files and discretize these scores based on the QUANT file bins.quant. Learn a global Bayesian classifier learned.xdsl and context-specific classifiers for each positive gene set, stored in learned_dir. Finally, save a table of functional activity scores for each dataset in each context in trusts.txt, and save context-specific predicted functional relationship networks in predictions_dir.

Detailed Usage

package "MEFIT"
version "1.2"
purpose "Microarray Expression Functional Integration Technique (Huttenhower et al,
    Bioinformatics 2006)

MEFIT takes as input:
1. A collection of microarray data sets (PCL files provided on the command line)
2. A collection of known biological functions (lists of related genes provided
   using the -r flag)
3. A collection of known unrelated gene pairs (provided using the -u flag)

It produces as output:
1. A global Bayesian network learned by considered all of the data sets
   independently of biological function (specified using the -O flag)
2. One Bayesian network per biological function (placed in the directory
   specified by the -o flag)
3. Predicted probabilities of functional relationships within each biological
   function of interest (placed in the directory specified by the -p flag)
4. Trust scores for each input data set and function indicating how
   predictive a data set is within a function (specified by the -t flag)"

section "Inputs"
option  "related"       r   "Directory containing lists of known related genes"
                            string  typestr="directory" yes
option  "unrelated"     u   "List of known unrelated gene pairs"
                            string  typestr="filename"  yes

option  "distance"      d   "Similarity measure"
                            values="pearson","euclidean","kendalls","kolm-smir",
                            "spearman","pearnorm"   default="pearnorm"
option  "bins"          b   "Tab separated QUANT bin cutoffs"
                            string  typestr="filename"

section "Outputs"
option  "output"        o   "Directory to contain learned per-function Bayesian networks"
                            string  typestr="directory" yes
option  "global"        O   "Global learned Bayesian network"
                            string  typestr="filename"  yes
option  "predictions"   p   "Directory to contain predicted probabilities of functional relationship"
                            string  typestr="directory" yes
option  "trusts"        t   "Trust scores learned per data set and function"
                            string  typestr="filename"  yes

section "Learning/Evaluation/Features"
option  "genes"         g   "Subset of genes to include in evaluation"
                            string  typestr="filename"
option  "genex"         G   "Subset of genes to exclude from evaluation"
                            string  typestr="filename"
option  "zero"          z   "Zero missing values"
                            flag    off
option  "cutoff"        c   "Include only confidences above cutoff"
                            double  default="0"

option  "skip"          s   "Additional columns to skip in input PCLs"
                            int default="2"
option  "xdsl"          x   "Output XDSL files in place of DSLs"
                            flag    on
option  "dab"           a   "Output DAB files in place of DATs"
                            flag    on
option  "random"        R   "Seed random generator"
                            int default="0"
option  "verbosity"     v   "Message verbosity"
                            int default="5"

Flag	Default	Type	Description
None	None	PCL text files	Microarray datasets which will be integrated by MEFIT. Each dataset will correspond to one node in each of the learned Bayesian classifiers and assigned a trust score in each biological context. All input PCLs must have the same number of skip columns `-s`.
-r	None	Directory	Input directory containing related (positive) gene lists. Each gene list is a text file containing one systematic gene ID per line (see Answerer).
-u	None	Gene pair text file	Input tab-delimited text file containing two columns; each line is a gene pair which is known to be functionally unrelated (e.g. annotated to two different Gene Ontology terms; see Answerer).
-d	pearnorm	pearnorm, pearson, euclidean, kendalls, kolm-smir, or spearman	Similarity measure to be used for converting microarray data into pairwise similarity scores. `pearnorm` is the recommended Fisher's z-transformed Pearson correlation.
-b	None	QUANT text file	Input tab-delimited QUANT file containing exactly one line of bin edges; these are used to discretize pairwise similarity scores. For details, see Sleipnir::CDataPair.
-o	None	Directory	Output directory in which learned context-specific Bayesian classifiers are saved as (X)DSL files (see BNCreator).
-O	None	(X)DSL file	Output file in which the learned global (non-context-specific) Bayesian classifier is saved (see BNCreator).
-p	None	Directory	Directory in which predicted context-specific functional relationships (DAT/DAB files) are saved (see BNCreator).
-t	None	PCL text file	Output PCL file in which dataset/context functional activity scores are saved (see BNTruster).
-g	None	Text gene list	If given, use only gene pairs for which both genes are in the list. For details, see Sleipnir::CDat::FilterGenes.
-G	None	Text gene list	If given, use only gene pairs for which neither gene is in the list. For details, see Sleipnir::CDat::FilterGenes.
-z	off	Flag	If on, assume that all missing gene pairs in all datasets have a value of 0 (i.e. the first bin).
-c	None	Double	If given, remove all input edges below the given cutoff (after optional normalization).
-s	2	Integer	Number of columns to skip between the initial ID column and the first experimental (data) column in the input PCL.
-x	on	Flag	If on, assume XDSL files will be used instead of DSL files.
-a	on	Flag	If on, output DAB files instead of DAT files.