Predictive Modelling on Diabetes and Kidney Disease Data

Aug 2015 – Dec 2015

Building Predictive Models Using Publicly Available Data

We (my group of 5 students) used the nhgh dataset from Vanderbilt University to build predictive models to predict the kidney disease. Our dataset had 19 variables, 3 of which were qualitative variables. Our response variable was SCr (Serum Creatinine) which is an indicator for kidney disease.


We used the R programming language for coding. After descriptive analysis, we went through the process of:

Regression Modelling: Using Simple Linear Regression, Multiple Regression and accounted for Interaction terms and non-linear transformations

Re-sampling: Using Entire data set, Validation Set Approach, Leave-One-Out Cross Validation (LOOCV), K-Fold Cross Validation and we quantified uncertainty using the Bootstrap.

Model Selection: Using the Forward and Backward Stepwise Selection, Ridge Regression and Lasso, accounting for Non-Linearity using Polynomial Regression and Spline.



We finally come up with a General Additive Model (GAM) for our dataset.blaOur results shows that:

  • Blood Urea Nitrogen is the main predictor associated with SCr.
  • A model including 4 variables was the best model for our data. These 4 variables are: bun (blood Urea Nitrogen), ht(height), sexfemale and re-Non Hispanic Black( re- race)
  • publicly available data can be analyzed to predict healthcare outcomes and this should encourage more agencies to make de-identified Patient Data publicly available.