Sleipnir: COALESCE

COALESCE performs regulatory modules prediction (expression biclustering and de novo sequence motif analysis) as described in Huttenhower et al. 2009. The algorithm consumes gene expression data and, if motif discovery is performed, DNA sequence; it outputs zero or more predicted regulatory modules, each consisting of a set of genes, a subset of conditions under which they are coexpressed, and zero or more (under)enriched sequence motifs. It predicts modules serially by seeding a set of correlated genes, selecting conditions under which they are significantly coregulated, selecting motifs significantly under/over-enriched, and using Bayesian integration to combine these features add highly probable genes (and remove improbable genes).

Usage

Basic Usage

 COALESCE -i <input.pcl>

Perform expression biclustering on the expression data in input.pcl (which may contain missing values).

 COALESCE -i <input.pcl> -f <input.fasta>

Perform regulatory module prediction (biclustering plus motif discovery) on the expression data in input.pcl and using the sequence data in input.fasta. The latter may consist of several subtypes of sequences (e.g. upstream/downstream flanks) denoted by tab-delimited markers in the FASTA IDs, and introns and exons (or flanks and UTRs) may be differentiated by capitalization, e.g.:

 > GENE1    5
 aaccGGTT
 > GENE1
 TGCAtgcaACGT
 > GENE1    3
 TTGGccaa
 > GENE2    5
 ...

where aacc represents GENE1 's upstream flank, GGTT its 5' UTR, TGCA its first exon, TTGG its 3' UTR, and so forth.

 COALESCE -i <input.pcl> -f <input.fasta> -o <output_dir>

Perform regulatory module prediction on the expression data in input.pcl and using the sequence data in input.fasta, depositing PCL and motif files progressively in output_dir (in addition to the module summaries printed to standard output during normal execution).

 COALESCE -i <input.pcl> -j <modules.txt/modules_dir>

Postprocess a set of preliminary predicted modules stored in standard output format in modules.txt (or in separate files in modules_dir) that were generated from data in input.pcl.

Detailed Usage

package "COALESCE"
version "1.0"
purpose "Implements a Bayesian Iterative Signature Algorithm for biclustering and TFBS discovery."

section "Main"
option  "input"         i   "Input PCL file"
                            string  typestr="filename"
option  "fasta"         f   "Input FASTA file"
                            string  typestr="filename"
option  "datasets"      d   "Condition groupings into dataset blocks"
                            string  typestr="filename"
option  "output"        o   "Directory for output files (PCLs/motifs)"
                            string  typestr="directory"

section "Algorithm Parameters"
option  "prob_gene"     p   "Probability threshhold for gene inclusion"
                            double  default="0.95"
option  "pvalue_cond"   c   "P-value threshhold for condition inclusion"
                            double  default="0.05"
option  "pvalue_motif"  m   "P-value threshhold for motif inclusion"
                            double  default="0.05"
option  "zscore_cond"   C   "Z-score threshhold for condition inclusion"
                            double  default="0.5"
option  "zscore_motif"  M   "Z-score threshhold for motif inclusion"
                            double  default="0.5"

section "Sequence Parameters"
option  "k"             k   "Sequence kmer length"
                            int default="7"
option  "pvalue_merge"  g   "P-value threshhold for motif merging"
                            double  default="0.05"
option  "cutoff_merge"  G   "Edit distance cutoff for motif merging"
                            double  default="2.5"
option  "penalty_gap"   y   "Edit distance penalty for gaps"
                            double  default="1"
option  "penalty_mismatch"  Y   "Edit distance penalty for mismatches"
                            double  default="2.1"

section "Performance Parameters"
option  "pvalue_correl" n   "P-value threshhold for significant correlation"
                            double  default="0.05"
option  "number_correl" N   "Maximum number of pairs to sample for significant correlation"
                            int default="100000"
option  "sequences"     q   "Sequence types to use (comma separated)"
                            string
option  "bases"         b   "Resolution of bases per motif match"
                            int default="2500"
option  "size_minimum"  z   "Minimum gene count for clusters of interest"
                            int default="5"
option  "size_merge"    E   "Maximum motif count for realtime merging"
                            int default="100"
option  "size_maximum"  Z   "Maximum motif count to consider a cluster saturated"
                            int default="1000"

section "Postprocessing Parameters"
option  "postprocess"   j   "Input file/directory of clusters to postprocess"
                            string  typestr="directory"
option  "known_motifs"  K   "File containing known motifs"
                            string  typestr="filename"
option  "known_cutoff"  F   "Score cutoff for known motif labeling"
                            double  default="0.05"
option  "known_type"    S   "Type of known motif matching"
                            values="pvalue","rmse","js" default="pvalue"
option  "cutoff_postprocess"    J   "Similarity cutoff for cluster merging"
                            double  default="1"
option  "fraction_postprocess"  L   "Overlap fraction for postprocessing gene/condition inclusion"
                            double  default="0.5"
option  "cutoff_trim"   T   "Cocluster stdev cutoff for cluster trimming"
                            double  default="1"
option  "remove_rcs"    R   "Convert RCs and RC-like PSTs to single strand"
                            flag    on
option  "min_info"      u   "Uninformative motif threshhold (bits)"
                            double  default="0.3"
option  "max_motifs"    x   "Maximum motifs to merge exactly"
                            int default="2500"

section "Miscellaneous"
option  "normalize"     e   "Automatically detect/normalize single channel data"
                            flag    off
option  "progressive"   O   "Generate output progressively"
                            flag    on
option  "seed"          D   "Expression pattern with which to seed first cluster"
                            string  typestr="filename"

section "Optional"
option  "threads"       t   "Maximum number of concurrent threads"
                            int default="1"
option  "skip"          s   "Columns to skip in input PCL"
                            int default="2"
option  "random"        r   "Seed random generator"
                            int default="0"
option  "verbosity"     v   "Message verbosity"
                            int default="5"

Flag	Default	Type	Description
-i	stdin	PCL file	Input expression data to be biclustered; may contain missing values.
-f	None	FASTA file	If given, input sequence data to be mined for regulatory motifs. Can contain sub-types of sequence as described above; only gene IDs also present in `-i` will be analyzed.
-d	None	Text file	If given, input description of dataset blocks with expected covariance in `-i`. Each line of the file is considered to be one dataset, consisting of a tab-delimited list of one or more condition identifiers from `-i`. Conditions not listed in the file will be treated as independent single-condition datasets.
-o	None	Directory	If given, output directory into which regulatory module PCL and predicted motif files are placed. Otherwise, module information is printed to standard output.
Algorithm Parameters
-p	0.95	Double (probability)	Probability threshhold for including genes in a regulatory module.
-c	0.05	Double (p-value)	P-value threshhold for including conditions in a regulatory module.
-m	0.05	Double (p-value)	P-value threshhold for including motifs in a regulatory module.
-C	0.5	Double (z-score)	Z-score threshhold for including conditions in a regulatory module.
-M	0.5	Double (z-score)	Z-score threshhold for including motifs in a regulatory module.
Sequence Parameters
-k	7	Integer	Number of base pairs in minimal k-mer motif seeds; longer motifs are built out of units of this length.
-g	0.05	Double (p-value)	P-value threshhold for merging motifs with similar distributions among genes; must also meet `-G`.
-G	2.5	Double	Edit distance threshhold for merging motifs with similar sequences; must also meet -c .
-y	1	Double	Edit distance penalty for gaps when comparing motifs.
-Y	2.1	Double	Edit distance penalty for mismatches when comparing motifs.
Performance Parameters
-n	0.05	Double (p-value)	P-value threshhold for determining significantly correlated genes when seeding a new module.
-N	100000	Integer	Maximum number of gene pairs to sample when selecting genes to seed a new module.
-q	None	String	If given, sequence subtypes to use when predicting motifs; otherwise, all sequence types are analyzed.
-b	5000	Integer (base pairs)	Resolution in base pairs with which motif frequencies are tracked, i.e. in units of 1 / `-b`.
-z	5	Integer	Minimum number of genes required for a module to be preserved.
-E	100	Integer	Maximum number of motifs to be merged during module convergence.
-Z	1000	Integer	Maximum number of motifs to be associated with a module during convergence.
Postprocessing Parameters
-j	None	Text file/directory	Input text file of modules (as produced on standard output) or directory of module files (as produced by `-o`) to be postprocessed.
-K	None	Text file	If given, text file containing known TFs and motifs. Each line should be tab-delimited text in which the first column contains a TF identifier (not necessarily unique), and the subsequent 4n columns contain the PWM values (as floating point fractions) of the n base pairs of the motif, e.g. `GATA 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0`.
-F	0.05	Double (p-value)	Score threshhold for labeling a predicted motif as a known TF based on -c -K.
-S	pvalue	pvalue, rmse, or js	Scoring method for labeling a predicted motif as a known TF; options are p-value of PWM correlation, root-mean-square error between PWMs, or Jenson-Shannon divergence between PWMs.
-J	1	Double (fraction)	Minimum overlap fraction for two preliminary modules to be merged.
-L	0.5	Double (fraction)	Minimum fraction of merged preliminary modules in which a gene must be present in order to be maintained in the resulting postprocessed module.
-T	1	Double (z-score)	Z-score threshhold of cocluster frequencies of genes in a preliminary module above which they must occur to be maintained in the resulting postprocessed module.
-R	On	Flag	If given, convert reverse complement motifs to a single strand before outputting their PWMs.
-u	0.3	Double (bits)	Information threshhold (in bits) for a preliminary motif to be preserved in a postprocessed module.
-x	2500	Integer	Maximum number of motifs to be merged exactly into a postprocessed module; motif counts above this threshhold are merged heuristically.
Miscellaneous Parameters
-e	Off	Flag	If given, automatically detect and normalize single-channel conditions by log-transforming them against the per-gene median.
-O	On	Flag	If given, generate standard output progressively as modules are finalized.
Standard Parameters
-t	1	Integer	Number of simultaneous threads to use for applicable stages of module formation.
-s	2	Integer	Number of columns to skip between the initial ID column and the first experimental (data) column in the input PCL.