Addressing Racial Disparities in Lung Cancer Screening (DSI-SRP)
This DSI-SRP fellowship funded Andrew Gothard to work in the laboratory of Dr. Jeffery Blume in the Data Science Institute at Vanderbilt University during the summer of 2021. Andrew is a junior with majors in Mathematics and Computer Science.
The project funded by this fellowship aimed to introduce Andrew to data science techniques and the concepts in the textbook “An Introduction to Statistical Learning with Applications in R” and applied them by analyze what factors are indicative of being diagnosed with a high stage of lung cancer. The predicted variable was a binary variable in which individuals with a Sum Stage of 4+ were categorized as “High Stage.” Due to limitations in findings, a secondary model predicting whether comorbidity index was above or below the mean was added. The predictor variables included binary variables – sex, mother lung cancer, father lung cancer, ever smoker, insured – centered continuous variables – age diagnosed, comorbidity index, last doctor visit (months), distance to cancer center – and categorical variables for race and marital status. The data was provided by the Southern Community Cohort Study.
The primary analytical tool in this study was logistic regression. Additionally, cross validation and variable selection were accomplished through an elastic net to improve the model. Finally, model accuracy was determined through an AUC. The AUC, or area under the curve, can be very simplistically described as the percent of the predictions that the model predicts correctly.
First, a model with only main effects was created. Next, all possible interaction terms were added to the model. Variables from this model were removed via elastic net. The best model according to cross validation was one with four variables. The variables that remained were: Insurance Coverage, the interaction term of Male Indicator and mother lung cancer, the interaction term of Comorbidity index and father lung cancer, and the interaction between insurance coverage and last doctor visit (months). The relaxed model included the main effects. The associated ROC plot of the stage predictor model had a AUC of 0.5017. The AUC of the comorbidity index model had an AUC of 0.5231. Both of these models have poor predictive power. A few variables that I wish were available, and could potentially explain either comorbidity or stage include how many hours the patients work per week, how many sick days they took, as well as their area level.
In addition to receiving support through a DSI-SRP fellowship, this project was supported and facilitated by the DSI Data Science Team through their regular summer workshops and demo sessions.