Identifying who has long COVID in the USA: a machine learning approach using N3C data
University of North Carolina at Chapel Hill · University of North Carolina Health Care · +8 more institutions
Abstract
Post-acute sequelae of SARS-CoV-2 infection, known as long COVID, have severely affected recovery from the COVID-19 pandemic for patients and society alike. Long COVID is characterised by evolving, heterogeneous symptoms, making it challenging to derive an unambiguous definition. Studies of electronic health records are a crucial element of the US National Institutes of Health's RECOVER Initiative, which is addressing the urgent need to understand long COVID, identify treatments, and accurately identify who has it-the latter is the aim of this study.
Using the National COVID Cohort Collaborative's (N3C) electronic health record repository, we developed XGBoost machine learning models to identify potential patients with long COVID. We defined our base population (n=1 793 604) as any non-deceased adult patient (age ≥18 years) with either an International Classification of Diseases-10-Clinical Modification COVID-19 diagnosis code (U07.1) from an inpatient or emergency visit, or a positive SARS-CoV-2 PCR or antigen test, and for whom at least 90 days have passed since COVID-19 index date. We examined demographics, health-care utilisation, diagnoses, and medications for 97 995 adults with COVID-19. We used data on these features and 597 patients from a long COVID clinic to train three machine learning models to identify potential long COVID among all patients with COVID-19, patients hospitalised with COVID-19, and patients who had COVID-19 but were not hospitalised. Feature importance was determined via Shapley values. We further validated the models on data from a fourth site.
Citation impact
- FWCI
- 32.68
- Percentile
- 100%
- References
- 28
Authors
23- EPEmily PfaffCorresponding
University of North Carolina at Chapel Hill, University of North Carolina Health Care
- ATAndrew T. Girvin
- TDTellen D. Bennett
University of Colorado Anschutz Medical Campus
- ABAbhishek Bhatia
University of North Carolina at Chapel Hill
- IMIan M. Brooks
Personalis (United States), University of Colorado Anschutz Medical Campus
Topics & keywords
- Coronavirus disease 2019 (COVID-19)
- Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
- Artificial intelligence
- 2019-20 coronavirus outbreak
- Computer science
- Machine learning
- Data science
- Virology
- Good health and well-being