Sleipnir
|
SeekMiner is the main program for integrating coexpressions among thousands of microarray datasets. Users supply the program with a set of genes as input (or query) and the program returns other similar genes with a coexpression to the input genes.
The main challenge in performing the user's query is finding the right datasets. As not all microarrays are relevant to exploring the query's coexpression, SeekMiner particularly favors those datasets where the query genes are highly correlated among each other. The intuition is based on the observation that the coregulation between the query genes would suggest that they participate in the same biological process, the biological process involving these genes is highly active. So datasets that pass this criteria would be very informative to the search process.
In addition to the default coregulation based weighting, SeekMiner supports other methods of scoring datasets, such as rank-based methods (order statistics) and equal-weighting.
Users can easily compare between methods, adjust parameters in the search algorithms, specify the datasets to be integrated, and test a number of different queries with varying length, in order to achieve their desired results.
SeekMiner -x <dset_platform_map> -i <gene_map> -q <query> -P <platform_dir> -p <prep_dir> -n <num_db> -d <db_dir> -Q <quant> -o <output_dir> -V <weight_method> -z <distance_measure> -m [-D <search_dset>]
This performs the coexpression search for a list of queries, and outputs the gene-ranking and the dataset weights in the output_dir
.
SeekMiner supports the following weighting methods (-V
):
CV
, default), where we iteratively use a subset of the query to construct a search instance to retrieve the remaining query genes. This is a form of measuring the coregulation of query genes using a cross-validation setup. EQUAL
), where all datasets are weighted equally. ORDER_STAT
), which is outlined in Adler et al (2009). This method computes a P-value statistics by comparing the rank of correlation across datasets to the ranks that would have been generated a null distribution (where correlations are assumed to be randomly scattered and all ranks are equally likely).The use of -V
CV
is highly recommended.
Users can select between Pearson correlations (-z
pearson
) or z-scores of Pearson (-z
z_score
). Z-scores is the recommended choice because it normalizes the correlation distribution to a standard normal distribution that can be compared across datasets. In addition, SeekMiner provides the following transformations on z-scores to further allow boosting of signals:
--score_cutoff
. Cuts off z-scores at a specified value. Z-scores that fall below the cut-off are assigned zero. --norm_subavg
. Subtracts each gene's average z-score. This prevents highly connected genes from being constantly returned with top ranks in the ranking. --norm_subavg_plat
. Normalizes z-score by subtracting the average across the platform and dividing by its standard deviation. This is designed to handle potential platform biases on the z-scores. --square_z
. Squaring the z-score. This is another way to boost the highly correlated gene-pairs.It is highly recommended to enable --norm_subavg
.
Users may also define the datasets that they wish to use for integrations in a query-specific way, using -D
argument. If this argument is absent, all datasets in the compendium will be integrated. If -D
is used, the search datasets must be selected from the available datasets defined in dset_platform_map
.
The output files are divided according to queries. Starting with the first query (with a file name 0), its final results will consist of three files: 0.query
, 0.dweight
, 0.gscore
.
0.query
stores the space-delimited query gene-set in text. 0.dweight
stores the weightings of datasets as a binary one-dimensional float vector (see SeekEvaluator for displaying a DWEIGHT extension file). 0.gscore
stores the gene scores as a binary one-dimensional float vector (see SeekEvaluator for displaying a GSCORE extension file).-x
dset_platform_map
Tab-delimited text file containing two columns, the dataset name, and the corresponding platform name. Below is a few sample lines:
GSE15913.GPL570.pcl GPL570 GSE16122.GPL2005.pcl GPL2005 GSE16797.GPL570.pcl GPL570 GSE16836.GPL570.pcl GPL570 GSE17351.GPL570.pcl GPL570 GSE17537.GPL570.pcl GPL570
Note that although the dataset name looks like a file name, it does not need to be a valid file name, as long as it properly and uniquely describes the dataset. Here, the dataset is uniquely identified by a GSE ID and a GPL ID combination. In addition, the ordering of the datasets in this file must match the order of the datasets in the CDatabaselet (ie DB files).
-i
gene_map
Tab-delimited gene-map file. Maps the genes to an ID between 0 to N where N is the genome size. Example:
1 1 2 10 3 100 4 1000 5 10000 6 100008589 7 100009676 8 10001 9 10002 10 10003 11 100033413 12 100033414
The ordering of the genes in this file must match the order of genes in the CDatabaselets (DB files).
-q
query
The file can contain multiple queries that are listed one query per line. The genes in each query are separated by spaces. Example:
10003 10002 10001 634 6265
The names of the genes must be selected from the genes in the gene_map
. The maximum length of the query depends on the amount of available memory in the system. It is recommended to keep each query less than 100 genes.
-D
search_dset
This file defines the list of datasets to be used for the query coexpression search. The file is defined in a query specific way. An example is provided below:
GSE15913.GPL570.pcl GSE16122.GPL2005.pcl GSE16836.GPL570.pcl ... GSE14933.GPL570.pcl GSE15162.GPL2005.pcl GSE15566.GPL570.pcl ... ...
where each line, corresponding to a query, is a space-separated dataset list for the query. The dataset names must be selected from the file dset_platform_map
.
-P
platform_dir
Directory that contains the following 3 files:
all_platforms.gplatavg
. the platform average z-scores all_platforms.gplatstdev
. the platform z-score standard deviation all_platforms.gplatorder
. the order of platformsThese binary files are generated by SeekPrep. The specification of this directory is necessary for --norm_subavg_plat
.
-p
prep_dir
Directory that contains the gene presence files and the gene average files:
There should be one pair of these files for every dataset that is specified in dset_platform_map
. Generated by SeekPrep.
-d
db_dir
Directory that contains the CDatabase (all of the DB files).
-Q
quant
The quant
file specifies how the z-scores are binned. This is necessary for properly reading the z-scores, because the z-scores are stored as binned values on disk. This quant file is used to convert them back to z-scores when they are read from disk. Currently, the maximum number of bins supported is 255. A snapshot of the quant
file is below:
-5.00 -4.96 -4.92 -4.88 -4.84 -4.80 -4.76 -4.72 -4.68 -4.64 -4.60 -4.56 -4.52 ...
The bin boundaries are separated by spaces.
-o
output_dir
Directory that will contain the search results.
-u
sinfo_dir
Directory that contains the SINFO files, which list a dataset's average z-score between all pairs of genes and the standard deviation. If this directory is provided, there should be one SINFO file for every dataset in dset_platform_map
. Generated by SeekPrep.
package "SeekMiner"
version "1.0"
purpose "Performs cross-platform microarray query-guided search"
section "Main"
option "dset" x "Input a set of datasets"
string typestr="filename" yes
option "search_dset" D "A set of datasets to search. If not specified, search all datasets."
string typestr="filename" default="NA"
option "input" i "Input gene mapping"
string typestr="filename" yes
option "query" q "Query gene list"
string typestr="filename" yes
option "dir_in" d "Database directory"
string typestr="directory" yes
option "dir_prep_in" p "Prep directory (containing .gavg, .gpres files)"
string typestr="directory" yes
option "dir_platform" P "Platform directory (containing .gplatavg, .gplatstdev, .gplatorder files)"
string typestr="directory" yes
option "dir_sinfo" u "Sinfo Directory (containing .sinfo files)"
string typestr="directory" default="NA"
option "dir_gvar" U "Gene variance directory (containing .gexpvar files)"
string typestr="directory" default="NA"
option "quant" Q "quant file (assuming all datasets use the same quantization)"
string typestr="filename" yes
option "num_db" n "Number of databaselets in database"
int default="1000" yes
option "num_threads" T "Number of threads"
int default="8"
option "per_g_required" H "Fraction (max 1.0) of genome required to be present in a dataset. Datasets not meeting the minimum required genes are skipped."
float default="0.0"
section "Dataset weighting"
option "weighting_method" V "Weighting method: query cross-validated weighting (CV), equal weighting (EQUAL), order statistics weighting (ORDER_STAT), variance weighting (VAR), user-given weighting (USER), SPELL weighting (AVERAGE_Z)"
values="CV","EQUAL","ORDER_STAT","VAR","USER","AVERAGE_Z" default="CV"
section "Optional - Functional Network Expansion"
option "func_db" w "Functional network db path"
string typestr="directory"
option "func_n" f "Functional network number of databaselets"
int default="1000"
option "func_prep" W "Functional network prep & platform directory"
string typestr="directory"
option "func_quant" R "Functional network quant file"
string typestr="filename"
option "func_dset" F "Functional network dset-list file (1 dataset)"
string typestr="filename"
option "func_logit" l "Functional network, integrate using logit values"
flag off
section "Optional - User-given Weighting"
option "user_weight_list" J "List of pre-computed dataset weight files (.dweight)"
string typestr="filename"
section "Optional - Random simulations"
option "random" S "Generate random ranking score"
flag off
option "num_random" t "Number of repetitions of generating random rankings"
int default="10"
section "Optional - Distance matrix transformations"
option "dist_measure" z "Distance measure"
values="pearson","z_score" default="z_score"
option "norm_subavg" m "If z_score is selected, subtract each result gene's average z-score in the dataset."
flag off
option "norm_subavg_plat" M "If z_score is selected, subtract each query gene's average score across platforms and divide by its stdev. Performed after --norm_subavg."
flag off
option "score_cutoff" c "Cutoff on the gene-gene score before adding, default: no cutoff"
float default="-9999"
option "square_z" e "If z_score is selected, take the square the z-scores. Usually used in conjunction with --score-cutoff."
flag off
section "Options for Dataset weighting"
option "per_q_required" C "Fraction (max 1.0) of query required to correlate with a gene, in order to count the gene's query score. A gene may not correlate with a query gene if it is absent, or its correlation with query does not pass cut-off (specified by --score_cutoff). Use this with caution. Be careful if using with --score_cutoff."
float default="0.0"
section "Options for CV-based dataset weighting"
option "CV_partition" I "The query partitioning method (for CV weighting): Leave-One-In, Leave-One-Out, X-Fold."
values="LOI","LOO","XFOLD" default="LOI"
option "CV_fold" X "The number of folds (for X-fold partitioning)."
int default="5"
option "CV_rbp_p" G "The parameter p for RBP scoring of each partition for its query gene retrieval (for CV weighting)."
float default="0.99"
section "MISC"
option "is_nibble" N "Whether the input DB is nibble type"
flag off
option "buffer" b "Number of Databaselets to store in memory"
int default="20"
option "output_text" O "Output results (gene scores and dataset weights) as text"
flag off
option "output_dir" o "Output directory"
string typestr="directory" yes
option "output_w_comp" Y "Output dataset weight components (generates .dweight_comp file)"
flag off
option "simulate_w" E "If equal weighting or order-statistics weighting is selected, output simulated dataset weights"
flag off
option "additional_db" B "Utilize a second CDatabase collection. Path to the second CDatabase's setting file."
string default="NA"