Survival prediction from microarray data
In the paper Predicting survival from microarray data - a comparative study (Bøvelstad et al. (2007), Bioinformatics 23, 2080-2087), seven established methods for survival prediction from high-dimensional genomic data is compared. At this web page you will find Matlab and R implementations of these seven methods along with user instructions. The main outputs of the programs are parameter estimates for each gene, corresponding to the gene`s effect on survival. If you in addition have genomic measurements for new patients, these estimates are used to calculate a prognostic index for each patient. The programs are suitable to any application aiming at correlating and/or predicting some time-to-event using high-dimensional (e.g. genomic) data.
Due to the large number of genomic measurements (used as explanatory variables) one has to resort to some dimension reduction or parameter shrinkage estimation technique to obtain gene parameter estimates. Our prediction methods adopt the Cox proportional hazard`s model to the following seven linear regression estimation techniques:
- Univariate selection
- Forward stepwise selction
- Principal component regression (PCR)
- Supervised principal component regression (supervised PCR)
- Partial least squares (PLS) regression
- Ridge regression
All these methods make use of a tuning parameter which represents the complexity of the model. For the first two methods it represents the number of variables in the model. For PCR and PLS it represents the number of directions or linear combinations of the covariates. For supervised PCR it is bivariate, representing the percentage of variables and the number of PCR directions. The tuning parameter for ridge regression and the lasso is the penalty parameter, which controls the amount of shrinkage. The optimal tuning parameter is found using K-fold cross-validation.
The first six methods listed above are implemented in Matlab, whereas the lasso is implemented in R. A Zip-file containing the required program files can be downloaded here.
After having downloaded and unzipped the program package, you must prepare the data file(s). The first data file (required) must be on the form ''patients times genes'' (Nxp), and the columns must be ordered in the following way: ''survival times'', ''censoring indicator'', ''gene expression values''. In addition, if you have genomic measurements of new patients (with unknown survival times) who you want to estimate prognoses for, these can be included in an (optional) data file of the same form as the required file, but without the first two columns.
For all methods except the lasso:
Open the file ''MicrosurvScript.m''. This is a script file executing one of the six prediction methods, which you have to specify by editing the file. Please proceed as outlined underneath.
- In line 9 in this file, please specify which regression method to use. 1 = Univariate selection, 2 = Forward stepwise selection, 3 = PCR, 4 = supervised PCR, 5 = PLS, 6 = Ridge regression.
- In line 12, please decide the number of folds K to be used in the K-fold cross-validation. The default value is K=10.
- In line 15 and line 16, please specify the upper limits for the grid of tuning parameters that the cross-validation procedure will search through, by giving values to grid and grid2 (grid2 is only used for supervised PCR, and should be set to 'default' for the other five methods). The lower limit is set to zero for all methods, corresponding to a baseline model using no gene expression information. For the univariate selection and the forward stepwise selection, grid1 represents the maximal percentage of gene variables to include in the model, and default values are 10% and 5% of the number of individuals, respectively. For PCR and supervised PCR grid1 gives the maximal number of PCR directions, with default values equal to 10% of the number of indiviuals. Further, for supervised PCR grid2 is the maximum percentage of gene variables picked out using univariate selection. For PLS, grid1 represents the maximal number of PLS direction, with default value equal to 5. Finally, for ridge regression grid1 represents the maximal value of the penalty parameter.
- Run the script in Matlab.
For the lasso:
First you have to install the glmpath package of Park and Tibshirani (2006), which contains the coxpath function performing the lasso for the Cox proportional hazards model. Then open the file ''LassoScript.txt'', and proceed as outlined underneath.
- In line 10 in "LassoScript.txt" please set the number of folds K to be used in the K-fold cross-validation. The default value is K=10.
- In line 13 please specify the upper limit for the grid of tuning parameters that the cross-validation procedure will search through. The default value is 10^6.
- In line 15, please set the number of grid points. The default value is M=1000.
- Finally, in line 17, please specify the maximal number of iterations in the search for the optimal lambda value. Default is 1000 iterations. (C.f. the glmpath documentation for further instructions on how to use the coxpath function.)
- Run the script in R.
The outputs of the scripts are:
- beta, the estimated gene coefficients
- lambda, the optimal tuning parameter value(s),
and, if genomic variables of new patients (with unknown survival times) are given as additional input,
- PI, prognostic indices, i.e. estimates of prognosis for each of the new patients.
The Matlab programs were written by Hege Bøvelstad (1), Ståle Nygård (1) and Hege L. Størvold (2) with supervision of (in alphabetical order) Magne Aldrin (3), Ørnulf Borgan (1), Arnoldo Frigessi (4) and Ole Christian Lingjærde (2). 1=Department of Mathematics, University of Oslo; 2=Department of Informatics, University of Oslo; 3=Norwegian Computing Center; 4=(sfi)² Statistics for Innovation, Department of Biostatistics, University of Oslo.
Contact information: hegembo (a) math.uio.no