Aug 2015 – Dec 2015
Building Predictive Models Using Publicly Available Data
We (my group of 5 students) used the nhgh dataset from Vanderbilt University to build predictive models to predict the kidney disease. Our dataset had 19 variables, 3 of which were qualitative variables. Our response variable was SCr (Serum Creatinine) which is an indicator for kidney disease.
We used the R programming language for coding. After descriptive analysis, we went through the process of:
Regression Modelling: Using Simple Linear Regression, Multiple Regression and accounted for Interaction terms and non-linear transformations
Re-sampling: Using Entire data set, Validation Set Approach, Leave-One-Out Cross Validation (LOOCV), K-Fold Cross Validation and we quantified uncertainty using the Bootstrap.
Model Selection: Using the Forward and Backward Stepwise Selection, Ridge Regression and Lasso, accounting for Non-Linearity using Polynomial Regression and Spline.
We finally come up with a General Additive Model (GAM) for our dataset.Our results shows that:
- Blood Urea Nitrogen is the main predictor associated with SCr.
- A model including 4 variables was the best model for our data. These 4 variables are: bun (blood Urea Nitrogen), ht(height), sexfemale and re-Non Hispanic Black( re- race)
- publicly available data can be analyzed to predict healthcare outcomes and this should encourage more agencies to make de-identified Patient Data publicly available.