Sleipnir
Synthesizer

Synthesizer creates synthetic microarray data (PCLs) and gene sequences (FASTAs) based on a simple coexpression model. It can spike in zero or more transcription factors, which will influence both the expression and sequence composition (by insertion of discrete binding sites) of their synthetic targets. Synthesizer was created primarily to interact with COALESCE, but it can be used independently, e.g. for testing expression clustering algorithms.

Usage

Basic Usage

 Synthesizer -o <data.pcl> -O <data.fasta> > <description.txt>

Create synthetic expression data data.pcl and gene promoter sequences data.fasta, storing a description of the spiked TFs in description.txt.

 Synthesizer -o <data.pcl> -O <data.fasta> -g <genes> -c <conditions>
        -n <tfs> -f <sequence.fasta> > <description.txt>

Create synthetic expression data data.pcl with conditions conditions and gene promoter sequences data.fasta for genes genes based on an HMM model of the sequence in sequence.fasta, storing a description of the tfs spiked TFs in description.txt.

Detailed Usage

package "Synthesizer"
version "1.0"
purpose "Creates configurable synthetic microarray and sequence data."

section "Main"
option  "output_pcl"    o   "PCL expression output file"
                            string  typestr="filename"
option  "output_fasta"  O   "FASTA sequence output file"
                            string  typestr="filename"
option  "genes"         g   "Number of synthesized genes"
                            int default="5000"
option  "conditions"    c   "Number of synthesized conditions"
                            int default="100"

section "Regulators"
option  "tfs"           n   "Number of transcription factors"
                            int default="10"
option  "tf_gene"       q   "Probability of TF activity in a gene"
                            double  default="0.01"
option  "tf_condition"  Q   "Probability of TF activity in a condition"
                            double  default="0.1"
option  "tf_min"        t   "Minimum TFBS length"
                            int default="5"
option  "tf_max"        T   "Maximum TFBS length"
                            int default="12"

section "Expression"
option  "mean"          m   "Expression mean"
                            double  default="0"
option  "stdev"         s   "Expression standard deviation"
                            double  default="1"
option  "tf_mean"       M   "Up/downregulation mean"
                            double  default="2"
option  "tf_stdev"      S   "Up/downregulation standard deviation"
                            double  default="1"

section "Sequence"
option  "fasta"         f   "Input FASTA file"
                            string  typestr="filename"
option  "degree"        d   "Degree of sequence model HMM"
                            int default="3"
option  "seq_min"       l   "Minimum sequence length"
                            int default="1000"
option  "seq_max"       L   "Maximum sequence length"
                            int default="3000"
option  "tf_copm"       p   "Minimum TFBS copies"
                            int default="1"
option  "tf_copx"       P   "Maximum TFBS copies"
                            int default="5"
option  "tf_types"      y   "Sequence types containing TFBSs, comma separated"
                            string

section "Optional"
option  "wrap"          w   "Wrap width for FASTA output"
                            int default="60"
option  "random"        r   "Seed random generator"
                            int default="0"
option  "verbosity"     v   "Message verbosity"
                            int default="5"
Flag Default Type Description
-o None PCL file If given, synthetic gene expression data. Contains the number of genes specified by -g, conditions -c, and spiked TF modules -n, each influencing gene expression as per -M.
-O None FASTA file If given, synthetic gene sequence data. Contains the number of genes specified by -g, conditions -c, and spiked TF modules -n, each influencing binding sites as per -p and -P.
-g 5000 Integer Number of synthetic genes to create in PCL and FASTA outputs.
-c 100 Integer Number of synthetic expression conditions to create in PCL output.
-n 10 Integer Number of synthetic transcription factors to create, influencing expression targets in PCL output and binding sites in FASTA sequence output.
-q 0.01 Double (probability) Probability of a synthetic TF targeting a gene.
-Q 0.1 Double (probability) Probability of a synthetic TF being active in an expression condition.
-t 5 Integer (base pairs) Minimum length of the randomly generated synthetic motif associated with a TF.
-T 12 Integer (base pairs) Maximum length of the randomly generated synthetic motif associated with a TF.
-m 0 Double Mean of baseline (i.e. not influenced by any TF) randomly generated expression values.
-s 1 Double Standard deviation of baseline (i.e. not influenced by any TF) randomly generated expression values.
-M 2 Double Mean of synthetic TF effect on expression. Actual expression effects are chosen randomly as either positive or negative from a normal distribution with this average.
-S 1 Double Standard deviation of synthetic TF effect on expression. Actual expression effects are chosen randomly as either positive or negative from a normal distribution with this standard deviation.
-f None FASTA file Input sequence file from which an HMM will be built to generate the output synthetic sequences. Degree of the HMM is specified by -d.
-d 3 Integer Degree of the HMM used to generate synthetic output sequences; built from the input FASTA file -f.
-l 1000 Integer Minimum length per gene of randomly generated output sequences.
-L 3000 Integer Maximum length per gene of randomly generated output sequences.
-p 1 Integer Minimum number of synthetic TFBSs present in a gene's sequence once it has been determined to be a target of a synthetic TF.
-P 5 Integer Maximum number of synthetic TFBSs present in a gene's sequence once it has been determined to be a target of a synthetic TF.
-y None String If given, comma-separated list of output sequence types that should contain spiked TFBSs. 5 is a common value to include spiked TFBSs only in upstream flank sequences.
-w 60 Integer Wrap width of generated FASTA files.