High-dimensional

This project is completed and this page is archived. Last change on this page was 2010.

Survival models with high-dimensional data structure

Duration: 2004-2010

Summary

Many clinical disciplines are still suffering from a comparatively low predictive power of specially developed risk scores. A hope is that essential progress is initiated by identification of genomic and proteomic features. Here, microarray data and protein mass spectra promise further insights. The understanding of whole genomes and the development of disease specific biomarkers should aid diagnosis, improve the performance of prognostic scores, and finally lead to new treatments. Such data is characterized by a huge number of potential predictors and typically only few patients, which makes it difficult to analyze. Standard survival techniques, such as fitting a Cox regression model by maximizing partial likelihood, are not directly applicable.
In this project we adapt statistical approaches that can deal with high-dimensional data structures, such as penalized estimation and boosting. These methods have been developed mostly for the continuous and binary response case. Only recently, some proposals have been made for right censored event time response variables, but there are still methodological problems. An example is the rather fragile selection of the number of steps required for path algorithm procedures. There is little research on modelling of time variation of covariates for high-dimensional data, potentially in combination with time-varying effects on survival. Therefore we start with discrete-time survival models, where time-varying covariates are easily incorporated and available techniques for binary responses variables can be adapted. In a next step we develop a competitive continuous-time approach. Boosting and path algorithm techniques will be investigated for estimation.
A central problem is the selection of regularization or complexity parameters. For our discrete-time survival approach, model selection criteria built on model-based estimates of the effective degrees of freedom will be adapted. For validation, we will investigate bootstrap-based estimates of the degrees of freedom. For continuous-time survival models, such degrees of freedom estimates are difficult to obtain, and it is important to take the right censored data structure into account. We will focus on resampling-based estimates of prediction error, that incorporate time and deal appropriately with right censoring. These estimates will then be used for selection of model complexity, to avoid overfitting for our flexible time survival approach. As an alternative, model selection based on false discovery rates will be investigated.
The work in this project will be closely coordinated with the projects of our clinical research partners. In particular, a comprehensive analysis for the project ``Microarray validation of cardiovascular risk factors'' will be provided. Further benefit can be expected from collaboration with Time-varying and Dynamic scores.

Publications

Porzelius C, Schumacher M, Binder H. The benefit of data-based model complexity selection via prediction
error curves in time-to-event data. Computational Statistics 2011; 26:293–301.
Binder H, Porzelius C, Schumacher M. An overview of techniques for linking high-dimensional molecular data to time-to-event endpoints by risk prediction models. Biom J 2011; 53(2):170–189.
Porzelius C, Schumacher M, Binder H. A general, prediction error-based criterion for selecting model complexity for high-dimensional survival models. Statist. Med. 2010; 29(7-8):830–838.
Porzelius C, Schumacher M, Binder H. Sparse regression techniques in low-dimensional survival data settings. Statistics and Computing 2010; 20(2):151–163.
Binder H, Schumacher M. Incorporating pathway information into boosting estimation of high-dimensional risk prediction models. BMC Bioinformatics 2009; 10:18.
Binder H, Allignol A, Schumacher M, Beyersmann J. Boosting for high-dimensional time-to-event data with competing risks. Bioinformatics 2009; 25(7):890–896.
Binder H, Schumacher M. Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples. Statistical Applications in Genetics and Molecular Biology 2008; 7(1):Article 12.
Binder H, Schumacher M. Comment on ‘Network-constrained regularization and variable selection for analysis of genomic data’. Bioinformatics 2008; 24(21):2566–2568.
Binder H, Schumacher M. Allowing for mandatory covariates in boosting estimation of sparse highdimensional survival models. BMC Bioinformatics 2008; 9:14.

Principal investigators

Prof. Dr. Martin Schumacher (IMBI)

Prof. Dr. Jens Timmer (Physikalisches Institut, Universität Freiburg)

Researchers

Prof. Dr. Martin Schumacher (IMBI)

Prof. Dr. Jens Timmer (Physikalisches Institut, Universität Freiburg)

Dr. Harald Binder (IMBI)

Dipl. Stat. Christine Porzelius (IMBI)

Institut für Medizinische Biometrieund Statistik (IMBI)