Inference and modelling aspects of multiple ranked lists

Speaker: Michael G. Schimek, Professor, Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria.

Abstract

In recent years there has been an increasing interest in rank-based statistical methods for two reasons: (i) Their robustness properties make them ideal for detecting relationships without making specific distributional assumptions, most relevant when analysing Big Data problems, and (ii) the analytic requirements of high-throughput biotechnologies and the integration of their outcomes when different experiments or laboratory platforms are involved.

In this seminar our focus is on ranked lists derived from high-throughput measurements. Typically, such lists comprise between hundreds and tens of thousands of items (e.g. gene expression values). However, only a comparably small subset of k top-ranked items is informative and useful. Items listed in the top-range are typically characterised by a strong overlap of their rank positions when they are ranked by different instances of assessment. A central inference task is the identification of an overall k* for a number of ranked lists comprising the same set of items, before one can fit a consolidated data model to the obtained sublists. Inference on k* is connected to the notion of stopping rules in machine learning. We present a recent statistical approach for inference in multiple ranked lists as well as a novel tool for the graphical representation of the obtained results. The rarely considered but practically relevant case of dependencies across lists will be covered too. Another quite demanding task is the estimation of the ‘true signals’ and the errors from multiple ranked (sub)lists. Conventional model-based approaches are not practicable because the number of rankings is rather small in those data we focus on compared to the lengths of such ranked lists. Maximum likelihood or moment methods cannot be applied here because it is not possible to write down a simple formula for the target function. Instead we introduce a distribution function approach. Last but not least we discuss the difference between a statistical model for the rank data on the one hand and stochastic data aggregation on the other, also in the light of missing rankings. We illustrate the described methods with simulated as well as omics data. Most of the discussed methods have been implemented in our R package TopKLists.

Keywords

Modelling of rank data, multiple ranked lists, omics data, R package, statistical inference, stochastic aggregation.

Some references

  1. Hall, P. and Schimek, M. G. (2012). Moderate deviation-based inference for random degeneration in paired rank lists. Journal of American Statistical Association, 107, 661-672.
  2. Lin, S. and Ding, J. (2009). Integration of ranked lists via Cross Entropy Monte Carlo with applications to mRNA and microRNA studies. Biometrics, 65, 9-18.
  3. Lin, S. (2010) Space oriented rank-based data integration. Statistical Applications in Genetics and Molecular Biology, 9,1.
  4. Schimek, M. G., Mysickova, A. and Budinska, E. (2012). An inference and integration approach for the consolidation of ranked lists. Communications in Statistics - Simulation and Computation, 41:7, 1152-1166.
  5. Schimek, M. G. et al. (2015). TopKLists: a comprehensive R package for statistical inference, stochastic aggregation, and visualization of multiple omics ranked lists. Statistical Applications in Genetics and Molecular Biology, DOI 10.1515/sagmb-2014-0093.
  6. Schimek, M. G. and Svendova, V. (2015). Novel methods for the statistical analysis of multiple and repeated rankings. Proceedings of the 60th ISI World Congress, e-publication to appear.
Published Oct. 6, 2015 1:26 PM - Last modified Oct. 12, 2015 10:45 AM