Sleipnir
|
Synthesizer creates synthetic microarray data (PCLs) and gene sequences (FASTAs) based on a simple coexpression model. It can spike in zero or more transcription factors, which will influence both the expression and sequence composition (by insertion of discrete binding sites) of their synthetic targets. Synthesizer was created primarily to interact with COALESCE, but it can be used independently, e.g. for testing expression clustering algorithms.
Synthesizer -o <data.pcl> -O <data.fasta> > <description.txt>
Create synthetic expression data data.pcl
and gene promoter sequences data.fasta
, storing a description of the spiked TFs in description.txt
.
Synthesizer -o <data.pcl> -O <data.fasta> -g <genes> -c <conditions> -n <tfs> -f <sequence.fasta> > <description.txt>
Create synthetic expression data data.pcl
with conditions
conditions and gene promoter sequences data.fasta
for genes
genes based on an HMM model of the sequence in sequence.fasta
, storing a description of the tfs
spiked TFs in description.txt
.
package "Synthesizer"
version "1.0"
purpose "Creates configurable synthetic microarray and sequence data."
section "Main"
option "output_pcl" o "PCL expression output file"
string typestr="filename"
option "output_fasta" O "FASTA sequence output file"
string typestr="filename"
option "genes" g "Number of synthesized genes"
int default="5000"
option "conditions" c "Number of synthesized conditions"
int default="100"
section "Regulators"
option "tfs" n "Number of transcription factors"
int default="10"
option "tf_gene" q "Probability of TF activity in a gene"
double default="0.01"
option "tf_condition" Q "Probability of TF activity in a condition"
double default="0.1"
option "tf_min" t "Minimum TFBS length"
int default="5"
option "tf_max" T "Maximum TFBS length"
int default="12"
section "Expression"
option "mean" m "Expression mean"
double default="0"
option "stdev" s "Expression standard deviation"
double default="1"
option "tf_mean" M "Up/downregulation mean"
double default="2"
option "tf_stdev" S "Up/downregulation standard deviation"
double default="1"
section "Sequence"
option "fasta" f "Input FASTA file"
string typestr="filename"
option "degree" d "Degree of sequence model HMM"
int default="3"
option "seq_min" l "Minimum sequence length"
int default="1000"
option "seq_max" L "Maximum sequence length"
int default="3000"
option "tf_copm" p "Minimum TFBS copies"
int default="1"
option "tf_copx" P "Maximum TFBS copies"
int default="5"
option "tf_types" y "Sequence types containing TFBSs, comma separated"
string
section "Optional"
option "wrap" w "Wrap width for FASTA output"
int default="60"
option "random" r "Seed random generator"
int default="0"
option "verbosity" v "Message verbosity"
int default="5"
Flag | Default | Type | Description |
---|---|---|---|
-o | None | PCL file | If given, synthetic gene expression data. Contains the number of genes specified by -g , conditions -c , and spiked TF modules -n , each influencing gene expression as per -M . |
-O | None | FASTA file | If given, synthetic gene sequence data. Contains the number of genes specified by -g , conditions -c , and spiked TF modules -n , each influencing binding sites as per -p and -P . |
-g | 5000 | Integer | Number of synthetic genes to create in PCL and FASTA outputs. |
-c | 100 | Integer | Number of synthetic expression conditions to create in PCL output. |
-n | 10 | Integer | Number of synthetic transcription factors to create, influencing expression targets in PCL output and binding sites in FASTA sequence output. |
-q | 0.01 | Double (probability) | Probability of a synthetic TF targeting a gene. |
-Q | 0.1 | Double (probability) | Probability of a synthetic TF being active in an expression condition. |
-t | 5 | Integer (base pairs) | Minimum length of the randomly generated synthetic motif associated with a TF. |
-T | 12 | Integer (base pairs) | Maximum length of the randomly generated synthetic motif associated with a TF. |
-m | 0 | Double | Mean of baseline (i.e. not influenced by any TF) randomly generated expression values. |
-s | 1 | Double | Standard deviation of baseline (i.e. not influenced by any TF) randomly generated expression values. |
-M | 2 | Double | Mean of synthetic TF effect on expression. Actual expression effects are chosen randomly as either positive or negative from a normal distribution with this average. |
-S | 1 | Double | Standard deviation of synthetic TF effect on expression. Actual expression effects are chosen randomly as either positive or negative from a normal distribution with this standard deviation. |
-f | None | FASTA file | Input sequence file from which an HMM will be built to generate the output synthetic sequences. Degree of the HMM is specified by -d . |
-d | 3 | Integer | Degree of the HMM used to generate synthetic output sequences; built from the input FASTA file -f . |
-l | 1000 | Integer | Minimum length per gene of randomly generated output sequences. |
-L | 3000 | Integer | Maximum length per gene of randomly generated output sequences. |
-p | 1 | Integer | Minimum number of synthetic TFBSs present in a gene's sequence once it has been determined to be a target of a synthetic TF. |
-P | 5 | Integer | Maximum number of synthetic TFBSs present in a gene's sequence once it has been determined to be a target of a synthetic TF. |
-y | None | String | If given, comma-separated list of output sequence types that should contain spiked TFBSs. 5 is a common value to include spiked TFBSs only in upstream flank sequences. |
-w | 60 | Integer | Wrap width of generated FASTA files. |