Authors
Jacob Koskimaki, Jenny Hu, Yiduo Zhang, Jose Mena, Nehanda Jones, Elizabeth
Lipschultz, Vivek Prabhakar Vaidya, Gabriel Altay, Vance Andrei Erese, Krishna Kumar
Swaminathan, Emma Mendonca, Tarun Dutt, Kuldeep Singh, Tian King, Vinay Phani
Santosh Lakkimsetty, Hussein Al-Olimat, Brittany Manning, George Anthony
Komatsoulis, Simon Chu, Jeff Ottens
Background:
Much information describing a patient’s cancer treatment remains in unstructured text in electronic health records and is not recorded in discrete data fields. Accurate data completeness is essential for quality care improvement and research studies on de-identified patient records. Accessing this high-value content often requires manual and extensive curation review.
Methods: AstraZeneca, CancerLinQ, ConcertAI, and Tempus have developed a natural language processing (NLP)-assisted process to improve clinical cohort selection for targeted curation efforts. Hybrid, machine-learning model development included text classification, named entity recognition, relation extraction and false positive removal. A subset of nearly 60,000 lung cancer cases were included from the CancerLinQ database, comprised of multiple source EHR systems. NLP models extracted EGFR status, stage, histology, radiation therapy, surgical resection and oral medications. Based on the results, cases were selected for additional manual curation, where curators confirmed findings of the NLP-processed data.
Results: NLP methods improved cohort identification. Successfully returned cases using the NLP method ranged from 75.2% to 96.5% over more general case selection criteria based on limited structured data. For all cohorts combined, 84.2% of the cases sent out for NLP curation were returned with curated content (Table). Each cohort contained a range of NLP-derived elements for curators to further review. In comparison, more general case selection criteria yielded a total of 3,878 cases returned out of 41,186 lung cancer cases sent for curation, for a success rate of only 9.6%.
|
Cohort Cohort Description |
Number of
cases
available
from NLP
assisted
identification methods |
Number of
cases sent to Tempus and
ConcertAI for curation |
Number of cases
returned to CancerLinQwith
curated
content |
Percent of successfully curated
cases |
1A |
NSCLC, stage I, II, III, EGFR+,
complete
resection |
408 |
408 |
341 |
83.60% |
1B |
NSCLC, non
squamous, stage I, II, III, EGFR wild
type/unknown, complete
resection |
4313 |
1500 |
1285 |
85.70% |
2A |
NSCLC, stage III, unresectable,
curative radiation to the chest total dose > = 50 Gy, did receive
Imfinzi |
852 |
620 |
466 |
75.20% |
2B |
NSCLC, stage III, unresectable,
curative radiation to the chest total dose > = 50 Gy,
did not receive Imfinzi |
3050 |
750 |
724 |
96.50% |
3 |
SCLC, received Imfinzi or
Tecentriq |
559 |
500 |
402 |
80.40% |
4 |
NSCLC, received Tagrisso as first line treatment |
971 |
812 |
647 |
79.70% |
Total: |
10153 |
4590 |
3865 |
|
Conclusions: NLP-driven case selection of six distinct, complex lung cohorts resulted in an order of magnitude improvement in eligibility over candidate selection using structured EHR data alone. This study demonstrates NLP-assisted approaches can significantly improve efficiency in curating unstructured health data.
VIEW THE PUBLICATION
VIEW THE POSTER