My research centers on developing and evaluating foundational natural language processing (NLP) methods, particularly large language models (LLMs), to convert complex data, such as electronic health records (EHRs), into actionable insights for improving decision-making in healthcare and beyond. I explore broader questions of how both humans and machines understand and utilize language, aiming to develop systems that not only enhance decision-making across various domains but also contribute to a future where artificial intelligence (AI) is safe, trustworthy, and reliably aligned with human needs.
Publications
Gao, Yanjun, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A. Miller, Danielle Bitterman, Matthew Churpek, and Majid Afshar. "When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications?." Findings of Empirical Methods in Natural Language Processing (EMNLP 2024).
Gao Y, Mahajan D, Uzuner Ö, Yetisgen M. Clinical natural language processing for secondary uses. J Biomed Inform. 2024 Feb;150:104596. PubMed PMID: 38278312
Majid Afshar, Yanjun Gao, Graham Wills, Jason Wang, Matthew M Churpek, Christa J Westenberger, David T Kunstman, Joel E Gordon, Cherodeep Goswami, Frank J Liao, Brian Patterson, Prompt engineering with a large language model to assist providers in responding to patient inquiries: a real-time implementation in the electronic health record, JAMIA Open, Volume 7, Issue 3, October 2024, ooae080, https://doi.org/10.1093/jamiaopen/ooae080
Xin Chen, Hanxian Huang, Yanjun Gao, Yi Wang, Jishen Zhao, and Ke Ding. 2024. Learning to Maximize Mutual Information for Chain-of-Thought Distillation. In Findings of the Association for Computational Linguistics ACL 2024, pages 6857–6868, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Afshar M, Gao Y, Gupta D, Croxford E, Demner-Fushman D. On the role of the UMLS in supporting diagnosis generation differential diagnoses proposed by Large Language Models. Journal of Biomedical Informatics. 2024 Aug 13:104707.
Gao Y, Dligach D, Miller T, Churpek MM, Uzuner O, Afshar M. Progress Note Understanding - Assessment and Plan Reasoning: Overview of the 2022 N2C2 Track 3 shared task. J Biomed Inform. 2023 Jun;142:104346. PubMed PMID: 37061012
Zhou W, Dligach D, Afshar M, Gao Y, Miller TA. Improving the Transferability of Clinical Note Section Classification Models with BERT and Large Language Model Ensembles. Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:125-130. PubMed PMID: 37786810
Gao Y, Dligach D, Miller T, Churpek MM, Afshar M. Overview of the Problem List Summarization (ProbSum) 2023 Shared Task on Summarizing Patients' Active Diagnoses and Problems from Electronic Health Record Progress Notes. Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:461-467. PubMed PMID: 37583489
Sharma B, Gao Y, Miller T, Churpek MM, Afshar M, Dligach D. Multi-Task Training with In-Domain Language Models for Diagnostic Reasoning. Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023(ClinicalNLP):78-85. PubMed PMID: 37492270
Zhou W, Yetisgen M, Afshar M, Gao Y, Savova G, Miller TA. Improving model transferability for clinical note section classification models using continued pretraining. J Am Med Inform Assoc. 2023 Dec 22;31(1):89-97. PubMed PMID: 37725927
Gao Y, Dligach D, Christensen L, Tesch S, Laffin R, Xu D, Miller T, Uzuner O, Churpek MM, Afshar M. A scoping review of publicly available language tasks in clinical natural language processing. J Am Med Inform Assoc. 2022 Sep 12;29(10):1797-1806. PubMed PMID: 35923088
Yetisgen M, Uzuner O, Gao Y, Mahajan D. Call for papers: Special issue on clinical natural language processing for secondary use applications. J Biomed Inform. 2022 Sep;133:104152. PubMed PMID: 35985622
Gao Y, Miller T, Xu D, Dligach D, Churpek MM, Afshar M. Summarizing Patients' Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models. Proc Int Conf Comput Ling. 2022 Oct;2022:2979-2991. PubMed PMID: 36268128
Gao Y, Dligach D, Miller T, Tesch S, Laffin R, Churpek MM, Afshar M. Hierarchical Annotation for Building A Suite of Clinical Natural Language Processing Tasks: Progress Note Understanding. LREC Int Conf Lang Resour Eval. 2022 Jun;2022:5484-5493. PubMed PMID: 35939277