Li, Thomas Z., Xu, Kaiwen, Chada, Neil C., Chen, Heidi, Knight, Michael, Antic, Sanja, Sandler, Kim L., Maldonado, Fabien, Landman, Bennett A., & Lasko, Thomas A. (2025). Curating retrospective multimodal and longitudinal data for community cohorts at risk for lung cancer. *Cancer Biomarkers: Section A of Disease Markers, 42*(1). https://doi.org/10.3233/CBM-230340
Large community health studies are valuable tools for understanding lung cancer, helping researchers explore risk factors and build models to predict who might develop the disease. To make the most of this data, a reliable method is needed to identify cases of lung cancer and lung spots known as pulmonary nodules, and to link various types of health information collected over time from electronic health records (EHRs). In this study, researchers used medical coding systems, including SNOMED and ICD codes, to create rules for identifying patients with lung cancer or pulmonary nodules in EHR data. They also applied clinical expertise to determine appropriate timeframes for gathering related health and imaging data. Using this approach, they curated three patient groups, or cohorts, with pulmonary nodules and repeated imaging records from Vanderbilt University Medical Center.
The method proved highly accurate, correctly identifying lung cancer in 93% of cases (sensitivity) and correctly identifying those without lung cancer in 99.6% of cases (specificity). It also showed high reliability in predicting who truly had or didn’t have lung cancer, based on the data. This study presents an effective and scalable strategy for organizing long-term, multi-type health data about individuals at risk for lung cancer, using routinely collected information from medical records.
Figure 1. Archives linking EHRs to imaging allowed for the selection of subjects via ICD rules. Scans that were low quality and data that did not fall within observation windows were excluded. VU-SPN: subjects with no cancer history prior to an SPN code. VU-LI-SPN: subjects in VU-SPN with imaging. VU-LI-Incidence: subjects with imaging.
