Skip to main content

Comparing Traditional Statistics and Machine Learning Methods for Predicting Metabolic Syndrome Status from Social Factors (DSI-SRP)

Posted by on Thursday, August 1, 2019 in College of Arts and Science, Completed Research, DSI-SRP, Medical Sciences, Social and Behavioral Sciences.

This DSI-SRP fellowship funded Rachel Fan to work in the laboratory of Professor Lauren Gaydosh in the Department of Medicine, Health, and Society during the summer of 2019. Rachel is an undeclared student who anticipates graduating in 2022.

The project funded by this fellowship aimed to understand the effects of social relationships and isolation on mortality comparative to well-established risk factors like smoking and obesity. Recently, both loneliness and metabolic syndrome, a group of conditions that increases the risk for heart disease, stroke, and type 2 diabetes, have been called epidemics in the US. Understanding how social contexts influence health can inform public health efforts from government and community groups. However, the relationship between social variables and metabolic syndrome remains largely unexplored. Relationships between variables have traditionally been analyzed through statistics, but machine learning techniques have become increasingly popular for prediction.

The two objectives of the project were to investigate how social factors can predict metabolic syndrome status and to compare how this can be done with traditional statistics and machine learning methods. The source of data for this study was AddHealth, a school-based longitudinal study of a nationally-representative sample of adolescents starting in the 1994-95 school year.

Logistic regression was used as the traditional statistical method. Logistic regression models the probability of binary events, in this case having or not having metabolic syndrome. In this project, the logistic regression was used to calculate how strongly seven different social variables increased or decreased the chance of metabolic syndrome. Random forest was the machine learning method used in this research. A random forest is a group of decision trees, and it classifies an input as having or not having metabolic syndrome based on a “vote” by each decision tree’s classification output. The random forest first used a subset of the data to learn how to classify individuals based on predicted metabolic syndrome status, and then it was tested on a different set of data to see how accurately it could classify them. It also generated a ranked list of the most important variables used to make the classification decision.

Regarding the first objective, only one of the top 25 most important variables in the random forest was directly related to social support and isolation. In the logistic regression, the impact of the selected social variables differed, and based on the AUC metric, its accuracy was similar to the random forest’s. For the second objective, differences between the two methodologies included the rationale for selecting variables, the outputs and their interpretations, and the differing error rates (e.g. false positives). Indeed, one challenge of this project was that there were relatively few individuals in the dataset who had metabolic syndrome.

The DSI-SRP weekly meetings provided valuable opportunities to practice summarizing my research in front of an audience and to exchange advice with my fellow student researchers. Furthermore, after the program ended, I have continued to work with my mentor, Dr. Gaydosh, on a different project that applies machine learning methods to the same AddHealth dataset to investigate “deaths of despair,” which are caused by factors including suicide, alcohol-related liver disease, and drug overdose. The analytical and communication skills I developed through DSI-SRP have helped me in this role as well.

In addition to receiving support through a DSI-SRP fellowship, this project was supported and facilitated by the DSI Data Science Team through their regular summer workshops and demo sessions.

Tags: , , , , , ,