Return of the Patient

Data Science to Prevent Diabetes Remission

Fall 2019

Background

Diabetes is one of the most common and costly chronic diseases. An estimated 23.1 million people in the United States are diagnosed with diabetes at a cost of more than $245 billion per year. One particular quality measure of diabetes care is the readmission rate. Readmissions represent a failure of the health system to provide adequate support and are extremely costly. As a result, the Centers for Medicare and Medicaid Services no longer reimburse hospitals for services if a patient was readmitted within 30 days of discharge.

Given these policy changes, being able to identify the patients most at risk for costly readmissions has become a pressing priority for hospital administrators.

Data Source

The data is from the Center for Clinical and Translational Research at Virginia Commonwealth University. It covers data on diabetes patients across 130 U.S. hospitals from 1999 to 2008. There are over 100,000 unique hospital admissions in this dataset, from ~70,000 unique patients. It includes demographic elements, such as age, gender, and race, as well as clinical attributes such as tests conducted, emergency/inpatient visits, etc. Below are the important variables for our analysis of the data:

Nature of the Data

The dataset we are working with is quite clean, with very few unknowns. Our main modification to the dataset will be to create a factor response variable re30 that is 1 when the patient is readmitted within 30 days and 0 when the patient is not. In addition, we dropped the columns that served as patient ID codes since it is unique for each observation and therefore has no bearing on prediction. The base rate of our response is 0.1116, meaning over 88% of patients do not end up being readmitted to the hospital.

What are the characteristics of readmitted patients?

number_medication indicates the number of medications in the patient is on. We don’t see any clear patterns.

When looking at age, we see in general that the re-admittances tend to be congregated toward the center of the age range.

Screen Shot 2020-06-06 at 10.20.50 PM.png
Screen Shot 2020-06-06 at 10.21.01 PM.png

Overall, these two groups have pretty similar means and spreads. This underscores why it is so difficult for doctors to be able to distinguish which patients are likely going to be readmitted. As a result, we will need to rely on more complex modeling processes to differentiate between the two.

Modeling the Data

Testing & Training Data

I divided the data in to testing and training data to make sure that the models do not overfit the data. I randomly assigned 30% of my data as testing data and 70% as training data. All the models will be built using the training data and evaluated using testing data.

Logistic Regression

I build the first model using logistic regression. It allows us to use a group of variables to predict a binary outcome, which in this case is the re-admittance within 30 days. Due to some categorical variables having a large number of levels and the insignificance of certain predictors, I used LASSO to select a subset of variables to use to build my logistic regression. I used cross validation in order to determine the optimal λ to minimize binomial deviance. On the graph below, you can observe the mean and standard deviation as a function of log(λ).

Screen Shot 2020-06-06 at 10.25.35 PM.png

To choose my subset, I decided to use λfirst as opposed to λmin since it selected a model that was more sparse. One of my considerations is the interpretability of the model and having over 100 non-zero coefficients would make the model harder to explain. The variables and levels that the LASSO selected are: num_medications, number_emergency, number_inpatient, number_diagnoses, diabetesMed, disch_disp_modified, age_mod, diag1_mod, diag3_mod.

I then took those variables and plugged them into logistic regression for relaxed LASSO. After running an Anova() test to make sure all the predictors are significant, I used backwards elimination to remove any variables that had an α-level above 0.01. This left me with my final logistic model: re30 ~ number_emergency + number_inpatient + number_diagnoses + insulin + diabetesMed + disch_disp_modified + age_mod + diag1_mod + diag3_mod.

Prediction on Logistic Regression

The next step is to evaluate the accuracy of our model by using it to predict our testing data. First, however, we must determine a threshold for our data. Since logistic regression outputs a decimal between 0 and 1, we need to decide the cutoff for which patients will be predicted to be in which bucket. In this case, a 1/2 threshold may be too simplistic for our needs. In addition, we need to take into account the real-world impact of our predictions. We know that misidentifying a patient has different costs based on the type of error. We assume that the cost of mislabeling a patient that will be readmitted as a 0 is twice that of mislabeling a patient that will not be readmitted as a 1.

As a result of the bias on the error, I decided to use Bayes’ Rule with unequal losses as the criteria for evaluating my models. In this case, we will be penalizing false negatives with twice the magnitude of false positives. I used a loop to iterate through different threshold values and stored the Bayes’ MCE for each threshold. This graph shows us how increasing the threshold sharply drops the MCE and then levels off.

I found the minimum MCE is 0.21758 at the threshold of 0.31. We will use these results to compare against our other models. Using a threshold of 0.31, we obtained the following confusion matrix. The accuracy of prediction is 0.7824.

Classification Tree

Classification trees are another way to model binary responses. The advantage of using trees over regression is its flexibility. They are in general more accurate when the data does not follow a singular directional pattern. However, when using this model I should be careful since the drawbacks include a loss of interpretability and risk of overfitting the data. Using the same training data as before, I fit my tree on all the variables with a minimum split of 20 and a cp of .0008. This resulted in a tree with 10 terminal nodes with 9 splits. Node 12, 18, and 19 have proportions of 1 over 0.50. The most important variable in my tree is num_inpatient, which was the first split at 1.5 and the second split is at 4.5. Other variables used in the classification tree include diag1_mod, number_emergency, diag2_mod, and diag3_mod.

Screen Shot 2020-06-06 at 10.53.24 PM.png

Prediction on Classification Tree

The next step is to evaluate the accuracy of the model by using it to predict testing data. Just like with the logistic regression, we must determine a threshold for our data. We can set prediction type to p in order to get a decimal prediction instead of using the automatic 1/2 threshold. I used another for loop to run through the different thresholds, again using Bayes’ Rule with Unequal Losses. This graph shows us how increasing the threshold results in very distinct steps of decrease in MCE. This is likely due to the fact that we are using a tree which relies on cuts rather incremental increases like logistic regression. I found the minimum MCE is 0.21756 at the threshold of 0.21. The MCE is very close to the result from the logistic regression, however the threshold is lower.

Screen Shot 2020-06-06 at 11.07.57 PM.png

Random Forest

The final modeling method is random forest. This involves using bootstrap samples and building a deep random tree for each one by splitting on only mtry randomly chosen predictors rather than the entire set. After getting a full set of random trees, we bag them all by taking an average. For our dataset, I decided to run randomForest with mtry = 10 and ntree = 500.

Prediction on Random Forest

I again iterated through multiple thresholds to find the one that produces the minimum MCE accounting for Bayes’ Rule of Unequal Losses. This graph shows how increasing the threshold drops the MCE and then levels off. The minimum MCE is 0.21812 at the threshold of 0.47.

Screen Shot 2020-06-06 at 11.09.31 PM.png

Final Model

Finally, we can compare the three models on threshold and MCE in order to choose our final model. We see that all 3 models give us a very similar misclassification error. They have all been weighted using Bayes’ Rule of Unequal Losses to account for false negatives being twice as costly as false positives. Even though the single classification tree has the lowest MCE, the difference is minimal. As a result, I decided to use the Logistic Regression as my final model. This is since one of my priorities is being able to recommend clear steps to the hospitals in order to save costs. Of all the models, the logistic regression is easiest to explain and will therefore lead to more straightforward implementation plans down the line.

Screen Shot 2020-06-06 at 11.13.13 PM.png

Conclusion

Results

In the linear model, the log odds of being readmitted is, on average, positively correlated with the number of past emergencies, inpatient visits, and diagnoses. This makes sense since the more problems a patient has had in the past, the more likely they will have more problems in the future. Having the patients insulin being adjusted down, being prescribed diabetes medicine, and being older than 20, on average, is correlated with an increase in the log odds of re-admittance. Being prescribed diabetes medicine is more likely to be due to correlation rather than causation. Patients with more severe diabetes are both more likely to be prescribed medicine and readmitted, so the re-admittance is probably not a direct cause of the medication. A diag1_mod of 434 or 250.6 is associated with a higher likelihood of being readmitted and patients having an unknown or 272 as a diag3_mod had a lower log odds of being readmitted.

Recommendation

In order to reduce future hospital costs due to re-admittance, we recommend that doctors spend more time on patients who have had multiple hospital visits and diagnoses. It's no surprise that patients aged 60 and older are most likely to be readmitted, so it may be better to institute an additional waiting time & physician check at the hospital before the patients are sent home. Since more heads are better than one, the physicians could pay special attention to the previous diagnoses to inform their current treatment. Patients that are not being discharged home should also be given special care. In addition, patients who had their insulin adjusted down were more likely to be back at the hospital within a month. This could be as a direct result of having their insulin dosage lowered (or due to some confounding factor). To further investigate this, physicians should try to avoid lowering insulin dosage and see if that results in less of those patients being readmitted.