Sleipnir
|
COALESCE performs regulatory modules prediction (expression biclustering and de novo sequence motif analysis) as described in Huttenhower et al. 2009. The algorithm consumes gene expression data and, if motif discovery is performed, DNA sequence; it outputs zero or more predicted regulatory modules, each consisting of a set of genes, a subset of conditions under which they are coexpressed, and zero or more (under)enriched sequence motifs. It predicts modules serially by seeding a set of correlated genes, selecting conditions under which they are significantly coregulated, selecting motifs significantly under/over-enriched, and using Bayesian integration to combine these features add highly probable genes (and remove improbable genes).
COALESCE -i <input.pcl>
Perform expression biclustering on the expression data in input.pcl
(which may contain missing values).
COALESCE -i <input.pcl> -f <input.fasta>
Perform regulatory module prediction (biclustering plus motif discovery) on the expression data in input.pcl
and using the sequence data in input.fasta
. The latter may consist of several subtypes of sequences (e.g. upstream/downstream flanks) denoted by tab-delimited markers in the FASTA IDs, and introns and exons (or flanks and UTRs) may be differentiated by capitalization, e.g.:
> GENE1 5 aaccGGTT > GENE1 TGCAtgcaACGT > GENE1 3 TTGGccaa > GENE2 5 ...
where aacc
represents GENE1
's upstream flank, GGTT
its 5' UTR, TGCA
its first exon, TTGG
its 3' UTR, and so forth.
COALESCE -i <input.pcl> -f <input.fasta> -o <output_dir>
Perform regulatory module prediction on the expression data in input.pcl
and using the sequence data in input.fasta
, depositing PCL and motif files progressively in output_dir
(in addition to the module summaries printed to standard output during normal execution).
COALESCE -i <input.pcl> -j <modules.txt/modules_dir>
Postprocess a set of preliminary predicted modules stored in standard output format in modules.txt
(or in separate files in modules_dir
) that were generated from data in input.pcl
.
package "COALESCE"
version "1.0"
purpose "Implements a Bayesian Iterative Signature Algorithm for biclustering and TFBS discovery."
section "Main"
option "input" i "Input PCL file"
string typestr="filename"
option "fasta" f "Input FASTA file"
string typestr="filename"
option "datasets" d "Condition groupings into dataset blocks"
string typestr="filename"
option "output" o "Directory for output files (PCLs/motifs)"
string typestr="directory"
section "Algorithm Parameters"
option "prob_gene" p "Probability threshhold for gene inclusion"
double default="0.95"
option "pvalue_cond" c "P-value threshhold for condition inclusion"
double default="0.05"
option "pvalue_motif" m "P-value threshhold for motif inclusion"
double default="0.05"
option "zscore_cond" C "Z-score threshhold for condition inclusion"
double default="0.5"
option "zscore_motif" M "Z-score threshhold for motif inclusion"
double default="0.5"
section "Sequence Parameters"
option "k" k "Sequence kmer length"
int default="7"
option "pvalue_merge" g "P-value threshhold for motif merging"
double default="0.05"
option "cutoff_merge" G "Edit distance cutoff for motif merging"
double default="2.5"
option "penalty_gap" y "Edit distance penalty for gaps"
double default="1"
option "penalty_mismatch" Y "Edit distance penalty for mismatches"
double default="2.1"
section "Performance Parameters"
option "pvalue_correl" n "P-value threshhold for significant correlation"
double default="0.05"
option "number_correl" N "Maximum number of pairs to sample for significant correlation"
int default="100000"
option "sequences" q "Sequence types to use (comma separated)"
string
option "bases" b "Resolution of bases per motif match"
int default="2500"
option "size_minimum" z "Minimum gene count for clusters of interest"
int default="5"
option "size_merge" E "Maximum motif count for realtime merging"
int default="100"
option "size_maximum" Z "Maximum motif count to consider a cluster saturated"
int default="1000"
section "Postprocessing Parameters"
option "postprocess" j "Input file/directory of clusters to postprocess"
string typestr="directory"
option "known_motifs" K "File containing known motifs"
string typestr="filename"
option "known_cutoff" F "Score cutoff for known motif labeling"
double default="0.05"
option "known_type" S "Type of known motif matching"
values="pvalue","rmse","js" default="pvalue"
option "cutoff_postprocess" J "Similarity cutoff for cluster merging"
double default="1"
option "fraction_postprocess" L "Overlap fraction for postprocessing gene/condition inclusion"
double default="0.5"
option "cutoff_trim" T "Cocluster stdev cutoff for cluster trimming"
double default="1"
option "remove_rcs" R "Convert RCs and RC-like PSTs to single strand"
flag on
option "min_info" u "Uninformative motif threshhold (bits)"
double default="0.3"
option "max_motifs" x "Maximum motifs to merge exactly"
int default="2500"
section "Miscellaneous"
option "normalize" e "Automatically detect/normalize single channel data"
flag off
option "progressive" O "Generate output progressively"
flag on
option "seed" D "Expression pattern with which to seed first cluster"
string typestr="filename"
section "Optional"
option "threads" t "Maximum number of concurrent threads"
int default="1"
option "skip" s "Columns to skip in input PCL"
int default="2"
option "random" r "Seed random generator"
int default="0"
option "verbosity" v "Message verbosity"
int default="5"
Flag | Default | Type | Description |
---|---|---|---|
-i | stdin | PCL file | Input expression data to be biclustered; may contain missing values. |
-f | None | FASTA file | If given, input sequence data to be mined for regulatory motifs. Can contain sub-types of sequence as described above; only gene IDs also present in -i will be analyzed. |
-d | None | Text file | If given, input description of dataset blocks with expected covariance in -i . Each line of the file is considered to be one dataset, consisting of a tab-delimited list of one or more condition identifiers from -i . Conditions not listed in the file will be treated as independent single-condition datasets. |
-o | None | Directory | If given, output directory into which regulatory module PCL and predicted motif files are placed. Otherwise, module information is printed to standard output. |
Algorithm Parameters | |||
-p | 0.95 | Double (probability) | Probability threshhold for including genes in a regulatory module. |
-c | 0.05 | Double (p-value) | P-value threshhold for including conditions in a regulatory module. |
-m | 0.05 | Double (p-value) | P-value threshhold for including motifs in a regulatory module. |
-C | 0.5 | Double (z-score) | Z-score threshhold for including conditions in a regulatory module. |
-M | 0.5 | Double (z-score) | Z-score threshhold for including motifs in a regulatory module. |
Sequence Parameters | |||
-k | 7 | Integer | Number of base pairs in minimal k-mer motif seeds; longer motifs are built out of units of this length. |
-g | 0.05 | Double (p-value) | P-value threshhold for merging motifs with similar distributions among genes; must also meet -G . |
-G | 2.5 | Double | Edit distance threshhold for merging motifs with similar sequences; must also meet -c . |
-y | 1 | Double | Edit distance penalty for gaps when comparing motifs. |
-Y | 2.1 | Double | Edit distance penalty for mismatches when comparing motifs. |
Performance Parameters | |||
-n | 0.05 | Double (p-value) | P-value threshhold for determining significantly correlated genes when seeding a new module. |
-N | 100000 | Integer | Maximum number of gene pairs to sample when selecting genes to seed a new module. |
-q | None | String | If given, sequence subtypes to use when predicting motifs; otherwise, all sequence types are analyzed. |
-b | 5000 | Integer (base pairs) | Resolution in base pairs with which motif frequencies are tracked, i.e. in units of 1 / -b . |
-z | 5 | Integer | Minimum number of genes required for a module to be preserved. |
-E | 100 | Integer | Maximum number of motifs to be merged during module convergence. |
-Z | 1000 | Integer | Maximum number of motifs to be associated with a module during convergence. |
Postprocessing Parameters | |||
-j | None | Text file/directory | Input text file of modules (as produced on standard output) or directory of module files (as produced by -o ) to be postprocessed. |
-K | None | Text file | If given, text file containing known TFs and motifs. Each line should be tab-delimited text in which the first column contains a TF identifier (not necessarily unique), and the subsequent 4n columns contain the PWM values (as floating point fractions) of the n base pairs of the motif, e.g. GATA 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 . |
-F | 0.05 | Double (p-value) | Score threshhold for labeling a predicted motif as a known TF based on -c -K. |
-S | pvalue | pvalue, rmse, or js | Scoring method for labeling a predicted motif as a known TF; options are p-value of PWM correlation, root-mean-square error between PWMs, or Jenson-Shannon divergence between PWMs. |
-J | 1 | Double (fraction) | Minimum overlap fraction for two preliminary modules to be merged. |
-L | 0.5 | Double (fraction) | Minimum fraction of merged preliminary modules in which a gene must be present in order to be maintained in the resulting postprocessed module. |
-T | 1 | Double (z-score) | Z-score threshhold of cocluster frequencies of genes in a preliminary module above which they must occur to be maintained in the resulting postprocessed module. |
-R | On | Flag | If given, convert reverse complement motifs to a single strand before outputting their PWMs. |
-u | 0.3 | Double (bits) | Information threshhold (in bits) for a preliminary motif to be preserved in a postprocessed module. |
-x | 2500 | Integer | Maximum number of motifs to be merged exactly into a postprocessed module; motif counts above this threshhold are merged heuristically. |
Miscellaneous Parameters | |||
-e | Off | Flag | If given, automatically detect and normalize single-channel conditions by log-transforming them against the per-gene median. |
-O | On | Flag | If given, generate standard output progressively as modules are finalized. |
Standard Parameters | |||
-t | 1 | Integer | Number of simultaneous threads to use for applicable stages of module formation. |
-s | 2 | Integer | Number of columns to skip between the initial ID column and the first experimental (data) column in the input PCL. |