Sleipnir: SeekMiner

SeekMiner is the main program for integrating coexpressions among thousands of microarray datasets. Users supply the program with a set of genes as input (or query) and the program returns other similar genes with a coexpression to the input genes.

The main challenge in performing the user's query is finding the right datasets. As not all microarrays are relevant to exploring the query's coexpression, SeekMiner particularly favors those datasets where the query genes are highly correlated among each other. The intuition is based on the observation that the coregulation between the query genes would suggest that they participate in the same biological process, the biological process involving these genes is highly active. So datasets that pass this criteria would be very informative to the search process.

In addition to the default coregulation based weighting, SeekMiner supports other methods of scoring datasets, such as rank-based methods (order statistics) and equal-weighting.

Users can easily compare between methods, adjust parameters in the search algorithms, specify the datasets to be integrated, and test a number of different queries with varying length, in order to achieve their desired results.

Usage

Basic Usage

 SeekMiner -x <dset_platform_map> -i <gene_map> -q <query> -P <platform_dir> -p <prep_dir> -n <num_db>
 -d <db_dir> -Q <quant> -o <output_dir> -V <weight_method> -z <distance_measure> -m [-D <search_dset>]

This performs the coexpression search for a list of queries, and outputs the gene-ranking and the dataset weights in the output_dir.

Weighting Datasets

SeekMiner supports the following weighting methods (-V):

Query cross-validated weighting (CV, default), where we iteratively use a subset of the query to construct a search instance to retrieve the remaining query genes. This is a form of measuring the coregulation of query genes using a cross-validation setup.
Equal weighting (EQUAL), where all datasets are weighted equally.
Order statistics integration (ORDER_STAT), which is outlined in Adler et al (2009). This method computes a P-value statistics by comparing the rank of correlation across datasets to the ranks that would have been generated a null distribution (where correlations are assumed to be randomly scattered and all ranks are equally likely).

The use of -V CV is highly recommended.

Distance Measure and Transformations

Users can select between Pearson correlations (-z pearson) or z-scores of Pearson (-z z_score). Z-scores is the recommended choice because it normalizes the correlation distribution to a standard normal distribution that can be compared across datasets. In addition, SeekMiner provides the following transformations on z-scores to further allow boosting of signals:

--score_cutoff. Cuts off z-scores at a specified value. Z-scores that fall below the cut-off are assigned zero.
--norm_subavg. Subtracts each gene's average z-score. This prevents highly connected genes from being constantly returned with top ranks in the ranking.
--norm_subavg_plat. Normalizes z-score by subtracting the average across the platform and dividing by its standard deviation. This is designed to handle potential platform biases on the z-scores.
--square_z. Squaring the z-score. This is another way to boost the highly correlated gene-pairs.

It is highly recommended to enable --norm_subavg.

Search Datasets

Users may also define the datasets that they wish to use for integrations in a query-specific way, using -D argument. If this argument is absent, all datasets in the compendium will be integrated. If -D is used, the search datasets must be selected from the available datasets defined in dset_platform_map.

Output

The output files are divided according to queries. Starting with the first query (with a file name 0), its final results will consist of three files: 0.query, 0.dweight, 0.gscore.

The file base name (0) indicates the query index in the list.
The 0.query stores the space-delimited query gene-set in text.
The 0.dweight stores the weightings of datasets as a binary one-dimensional float vector (see SeekEvaluator for displaying a DWEIGHT extension file).
The 0.gscore stores the gene scores as a binary one-dimensional float vector (see SeekEvaluator for displaying a GSCORE extension file).

Query-independent search setting files and directories

-x dset_platform_map

Tab-delimited text file containing two columns, the dataset name, and the corresponding platform name. Below is a few sample lines:

 GSE15913.GPL570.pcl  GPL570
 GSE16122.GPL2005.pcl GPL2005
 GSE16797.GPL570.pcl  GPL570
 GSE16836.GPL570.pcl  GPL570
 GSE17351.GPL570.pcl  GPL570
 GSE17537.GPL570.pcl  GPL570

Note that although the dataset name looks like a file name, it does not need to be a valid file name, as long as it properly and uniquely describes the dataset. Here, the dataset is uniquely identified by a GSE ID and a GPL ID combination. In addition, the ordering of the datasets in this file must match the order of the datasets in the CDatabaselet (ie DB files).

-i gene_map

Tab-delimited gene-map file. Maps the genes to an ID between 0 to N where N is the genome size. Example:

 1    1
 2    10
 3    100
 4    1000
 5    10000
 6    100008589
 7    100009676
 8    10001
 9    10002
 10   10003
 11   100033413
 12   100033414

The ordering of the genes in this file must match the order of genes in the CDatabaselets (DB files).

-q query

The file can contain multiple queries that are listed one query per line. The genes in each query are separated by spaces. Example:

 10003 10002 10001
 634 6265

The names of the genes must be selected from the genes in the gene_map. The maximum length of the query depends on the amount of available memory in the system. It is recommended to keep each query less than 100 genes.

-D search_dset

This file defines the list of datasets to be used for the query coexpression search. The file is defined in a query specific way. An example is provided below:

 GSE15913.GPL570.pcl GSE16122.GPL2005.pcl GSE16836.GPL570.pcl ...
 GSE14933.GPL570.pcl GSE15162.GPL2005.pcl GSE15566.GPL570.pcl ...
 ...

where each line, corresponding to a query, is a space-separated dataset list for the query. The dataset names must be selected from the file dset_platform_map.

-P platform_dir

Directory that contains the following 3 files:

all_platforms.gplatavg. the platform average z-scores
all_platforms.gplatstdev. the platform z-score standard deviation
all_platforms.gplatorder. the order of platforms

These binary files are generated by SeekPrep. The specification of this directory is necessary for --norm_subavg_plat.

-p prep_dir

Directory that contains the gene presence files and the gene average files:

Gene presence (GPRES files): indicates the presence/absence of genes in a dataset
Gene average (GAVG files): indicates the average z-score of each gene in a dataset

There should be one pair of these files for every dataset that is specified in dset_platform_map. Generated by SeekPrep.

-d db_dir

Directory that contains the CDatabase (all of the DB files).

-Q quant

The quant file specifies how the z-scores are binned. This is necessary for properly reading the z-scores, because the z-scores are stored as binned values on disk. This quant file is used to convert them back to z-scores when they are read from disk. Currently, the maximum number of bins supported is 255. A snapshot of the quant file is below:

 -5.00 -4.96 -4.92 -4.88 -4.84 -4.80 -4.76 -4.72 -4.68 -4.64 -4.60 -4.56 -4.52 ...

The bin boundaries are separated by spaces.

-o output_dir

Directory that will contain the search results.

-u sinfo_dir

Directory that contains the SINFO files, which list a dataset's average z-score between all pairs of genes and the standard deviation. If this directory is provided, there should be one SINFO file for every dataset in dset_platform_map. Generated by SeekPrep.

Detailed Usage

package "SeekMiner"
version "1.0"
purpose "Performs cross-platform microarray query-guided search"

section "Main"
option  "dset"              x   "Input a set of datasets"
                                string typestr="filename"   yes
option  "search_dset"       D   "A set of datasets to search. If not specified, search all datasets."
                                string typestr="filename" default="NA"
option  "input"             i   "Input gene mapping"
                                string  typestr="filename"  yes
option  "query"             q   "Query gene list"
                                string typestr="filename"   yes
option  "dir_in"            d   "Database directory"
                                string  typestr="directory" yes
option  "dir_prep_in"       p   "Prep directory (containing .gavg, .gpres files)"
                                string  typestr="directory" yes
option  "dir_platform"      P   "Platform directory (containing .gplatavg, .gplatstdev, .gplatorder files)"
                                string  typestr="directory" yes
option  "dir_sinfo"         u   "Sinfo Directory (containing .sinfo files)"
                                string  typestr="directory" default="NA"
option  "dir_gvar"          U   "Gene variance directory (containing .gexpvar files)"
                                string  typestr="directory" default="NA"
option  "quant"             Q   "quant file (assuming all datasets use the same quantization)"
                                string  typestr="filename"  yes                             
option  "num_db"            n   "Number of databaselets in database"
                                int default="1000"  yes
option  "num_threads"       T   "Number of threads"
                                int default="8"
option  "per_g_required"    H   "Fraction (max 1.0) of genome required to be present in a dataset. Datasets not meeting the minimum required genes are skipped."
                                float default="0.0"

section "Dataset weighting"
option  "weighting_method"  V   "Weighting method: query cross-validated weighting (CV), equal weighting (EQUAL), order statistics weighting (ORDER_STAT), variance weighting (VAR), user-given weighting (USER), SPELL weighting (AVERAGE_Z)"
                                values="CV","EQUAL","ORDER_STAT","VAR","USER","AVERAGE_Z" default="CV"

section "Optional - Functional Network Expansion"
option  "func_db"           w   "Functional network db path"
                                string  typestr="directory"
option  "func_n"            f   "Functional network number of databaselets"
                                int default="1000"
option  "func_prep"         W   "Functional network prep & platform directory"
                                string typestr="directory"
option  "func_quant"        R   "Functional network quant file"
                                string typestr="filename"
option  "func_dset"         F   "Functional network dset-list file (1 dataset)"
                                string typestr="filename"
option  "func_logit"        l   "Functional network, integrate using logit values"
                                flag    off

section "Optional - User-given Weighting"
option  "user_weight_list"  J   "List of pre-computed dataset weight files (.dweight)"
                                string typestr="filename"

section "Optional - Random simulations"
option  "random"            S   "Generate random ranking score"
                                flag    off
option  "num_random"        t   "Number of repetitions of generating random rankings"
                                int default="10"

section "Optional - Distance matrix transformations"
option  "dist_measure"      z   "Distance measure"
                                values="pearson","z_score" default="z_score"
option  "norm_subavg"       m   "If z_score is selected, subtract each result gene's average z-score in the dataset."
                                flag    off
option  "norm_subavg_plat"  M   "If z_score is selected, subtract each query gene's average score across platforms and divide by its stdev. Performed after --norm_subavg."
                                flag    off
option  "score_cutoff"      c   "Cutoff on the gene-gene score before adding, default: no cutoff"
                                float default="-9999"
option  "square_z"          e   "If z_score is selected, take the square the z-scores. Usually used in conjunction with --score-cutoff."                            
                                flag    off

section "Options for Dataset weighting"
option  "per_q_required"    C   "Fraction (max 1.0) of query required to correlate with a gene, in order to count the gene's query score. A gene may not correlate with a query gene if it is absent, or its correlation with query does not pass cut-off (specified by --score_cutoff). Use this with caution. Be careful if using with --score_cutoff."
                                float default="0.0"

section "Options for CV-based dataset weighting"
option  "CV_partition"      I   "The query partitioning method (for CV weighting): Leave-One-In, Leave-One-Out, X-Fold."
                                values="LOI","LOO","XFOLD" default="LOI"
option  "CV_fold"           X   "The number of folds (for X-fold partitioning)."
                                int default="5"
option  "CV_rbp_p"          G   "The parameter p for RBP scoring of each partition for its query gene retrieval (for CV weighting)."
                                float   default="0.99"  

section "MISC"                              
option  "is_nibble"         N   "Whether the input DB is nibble type"
                                flag    off
option  "buffer"            b   "Number of Databaselets to store in memory"
                                int default="20"
option  "output_text"       O   "Output results (gene scores and dataset weights) as text"
                                flag    off
option  "output_dir"        o   "Output directory"
                                string typestr="directory"  yes
option  "output_w_comp"     Y   "Output dataset weight components (generates .dweight_comp file)"
                                flag    off
option  "simulate_w"        E   "If equal weighting or order-statistics weighting is selected, output simulated dataset weights"
                                flag    off
option  "additional_db"     B   "Utilize a second CDatabase collection. Path to the second CDatabase's setting file."
                                string default="NA"