Home UiO Faculty of Medicine Institute of Basic Medical Sciences
print logo

Microsurv (II)

Survival prediction from clinico-genomic models - a comparative study.

Background

In the paper Surival prediction from clinico-genomic models - a comparative study (Bøvelstad et al (2009), BMC Bioinformatics 10:413we showed how clinical and genomic covariates can be combined in order to optimize predictions of patient survival. At this web page you will find Matlab and R implementations with user instructions of the seven prediction methods described in the paper. The main outputs of the programs are parameter estimates for each covariate corresponding to the covariate's effect on survival. If you in addition have clinical and genomic measurements for new patients, the estimates are used to calculate a prognostic index for each patient. The programs are suitable for any application aiming at explaining/predicting time-to-event data from sets of covariates of both low and high dimensions.

The methods aim at optimal predictions of patient survival in the setting where a set of clinical covariates as well as a set of high-dimensional genomic covariates are available for each patient. The clinical covariates are assumed to be few in number, and known to have effect on survival. The genomic covariates on the other hand are assumed to be of a much higher dimension and each of them not necessarily affecting survival, implying that some sort of dimension reduction should be applied to estimate their possible effects. The prediction methods are adoptions of the following seven dimension reduction estimation techniques to the Cox proportional hazard`s model:

  • Univariate selection
  • Principal components regression (PCR)
  • Supervised PCR
  • Partial least squares (PLS) regression
  • Supervised PLS
  • Ridge regression
  • Lasso

The prediction models are obtained by simultaneous use of the clinical and the genomic variables, but by applying dimension reduction only to the high-dimensional genomic covariates. All prediction methods make use of a parameter lambda which represents the complexity of the genomic part of the model. For univariate selection, lambda represents the number of selected genomic covariates. For PCR and PLS, it represents the number of components, i.e. linear combinations of the genomic covariates. For supervised PCR/PLS, lambda is bivariate, representing the percentage of variables and the number of PCR/PLS components. The complexity parameter for ridge regression and the lasso is the penalty parameter, which controls the amount of shrinkage. For all methods, the optimal value of lambda is found using K-fold cross-validation. 

Download

The first six methods listed above are implemented in Matlab, whereas the lasso is implemented in R. A Zip-file containing the required program files can be downloaded here.

User instructions

After having downloaded and unzipped the program package, you must prepare the data file(s). The first data file (required) must be on the form ''patients times covariates'', and the columns must be ordered in the following way: ''survival times'', ''censoring indicator'', "clinical covariates", "genomic covariates''. In addition, if you have clinical and genomic covariates for new patients (with unknown survival times) who you want to estimate prognoses for, these can be included in an (optional) data file of the same form as the required file, but without the first two columns. Note that dimension reduction is applied only to the high-dimensional genomic part of the model, and not to the clinical covariates which we assume to be many fewer. If you use too many clinical covariates you may get very unstable predictions. 
 

For all methods except lasso:
 
Open the file ''Script-Clinico.m''. This is a script file executing one of the six prediction methods, which you have to specify by editing the file. Please proceed as outlined underneath.
  • In line 13 in this file, please specify the number of columns containing clinical covariates.
  • In line 17, specify which prediction method to use. 1 = Univariate selection, 2 = PCR, 3 = supervised PCR, 4 = PLS, 5 = supervised PLS, 6 = Ridge regression.
  • In line 20, please decide the number of folds K to be used in the K-fold cross-validation for the genomic part of the model. The default value is K=10.
  • In line 23 and line 24, please specify the upper limits for the grid of complexity parameters that the cross-validation procedure will search through, by giving values to grid1 and grid2 (grid2 is only used for supervised PCR and supervised PLS, and should be set to 'default' for the other four methods). The lower limit (zero for all methods except ridge), corresponds to a baseline model using no genomic information. For the univariate selection, grid1 represents the maximal number of genomic variables to include in the model, and default value is approximately 15% of the number of individuals. For PCR and supervised PCR grid1 gives the maximal number of PCR directions, with default values equal to 15% of the number of individuals. Further, for supervised PCR and supervised PLS grid2 is the maximum percentage (given as a number between 0 and 1) of gene variables picked out using univariate selection. For PLS and supervised PLS, grid1 represents the maximal number of PLS components, with default value equal to 5. Finally, for ridge regression grid1 is the maximal value of the penalty parameter on a normalized log2 scale: grid1 = log2(max lambda/n).
  • Run the script in Matlab.

For the lasso:

First you have to install the glmpath package of Park and Tibshirani (2006), which contains the coxpath function performing the lasso for the Cox proportional hazards model. Then open the file ''LassoScriptClinico.txt'', and proceed as outlined underneath.

  • In line 4 in "LassoScriptClinico.txt", please provide the name of the data set you want to investigate.
  • In line 8, please specify the number of columns in your data set that contain clinical covariate values.
  • In line 10, please decide the number of folds K to be used in the K-fold cross validation. The default value is K=10.
  • In line 12, please specify the lower limit for the grid of tuning parameters that the cross-validation procedure will search through. The default value is 1.
  • In line 14, please specify the upper limit for the grid of tuning parameters that the cross-validation procedure will search through. The default value is 104.
  • Finally, in line 16, please set the number of grid points. The default value is M=100.
  • Run the script in R.

When running the program, if you encounter the problem that no genomic variables joins the active set even after many (>20) steps, please increase relax.lambda in line 23 (C.f. the glmpath documentation for further instructions on how to use the coxpath function.)
The outputs of the scripts are:

  • lambda, the optimal complexity parameter value(s),
  • kappa, the estimated coefficients for the clinical covariates (first q values) and estimated coefficients for the genomic covariates (last p values),

and, if covariate values for new patients (with unknown survival times) are given as additional input,

  • PI, prognostic indices, i.e. estimates of prognosis for each of the new patients.
  • Authors

    The Matlab programs were written by Hege Bøvelstad (1) and Ståle Nygård (1,2) with supervision of Ørnulf Borgan (1), 1=Department of Mathematics, University of Oslo; 2=Norwegian Computing Center.

    Contact information: 
Published Oct 24, 2008 02:34 PM - Last modified Jun 22, 2011 10:46 AM