Yanjun Gao, PhD

Assistant Professor, Biomedical Informatics


FacultyPhoto
Graduate School
  • PhD, Pennsylvania State University - University Park Campus (2021)
Languages
English
Department
Biomedical Informatics

Professional Titles

  • Assistant Professor

Research Interests

My research centers on developing and evaluating foundational natural language processing (NLP) methods, particularly large language models (LLMs), to convert complex data, such as electronic health records (EHRs), into actionable insights for improving decision-making in healthcare and beyond. I explore broader questions of how both humans and machines understand and utilize language, aiming to develop systems that not only enhance decision-making across various domains but also contribute to a future where artificial intelligence (AI) is safe, trustworthy, and reliably aligned with human needs.

Publications

  • Gao Y, Mahajan D, Uzuner Ö, Yetisgen M. Clinical natural language processing for secondary uses. J Biomed Inform. 2024 Feb;150:104596. PubMed PMID: 38278312
  • Gao Y, Dligach D, Miller T, Churpek MM, Uzuner O, Afshar M. Progress Note Understanding - Assessment and Plan Reasoning: Overview of the 2022 N2C2 Track 3 shared task. J Biomed Inform. 2023 Jun;142:104346. PubMed PMID: 37061012
  • Zhou W, Dligach D, Afshar M, Gao Y, Miller TA. Improving the Transferability of Clinical Note Section Classification Models with BERT and Large Language Model Ensembles. Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:125-130. PubMed PMID: 37786810
  • Gao Y, Dligach D, Miller T, Churpek MM, Afshar M. Overview of the Problem List Summarization (ProbSum) 2023 Shared Task on Summarizing Patients' Active Diagnoses and Problems from Electronic Health Record Progress Notes. Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:461-467. PubMed PMID: 37583489
  • Gao, Yanjun, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A. Miller, Danielle Bitterman, Matthew Churpek, and Majid Afshar. "When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications?." Findings of Empirical Methods in Natural Language Processing (EMNLP 2024).
  • Sharma B, Gao Y, Miller T, Churpek MM, Afshar M, Dligach D. Multi-Task Training with In-Domain Language Models for Diagnostic Reasoning. Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023(ClinicalNLP):78-85. PubMed PMID: 37492270
  • Zhou W, Yetisgen M, Afshar M, Gao Y, Savova G, Miller TA. Improving model transferability for clinical note section classification models using continued pretraining. J Am Med Inform Assoc. 2023 Dec 22;31(1):89-97. PubMed PMID: 37725927
  • Gao Y, Dligach D, Christensen L, Tesch S, Laffin R, Xu D, Miller T, Uzuner O, Churpek MM, Afshar M. A scoping review of publicly available language tasks in clinical natural language processing. J Am Med Inform Assoc. 2022 Sep 12;29(10):1797-1806. PubMed PMID: 35923088
  • Yetisgen M, Uzuner O, Gao Y, Mahajan D. Call for papers: Special issue on clinical natural language processing for secondary use applications. J Biomed Inform. 2022 Sep;133:104152. PubMed PMID: 35985622
  • Gao Y, Miller T, Xu D, Dligach D, Churpek MM, Afshar M. Summarizing Patients' Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models. Proc Int Conf Comput Ling. 2022 Oct;2022:2979-2991. PubMed PMID: 36268128
  • Gao Y, Dligach D, Miller T, Tesch S, Laffin R, Churpek MM, Afshar M. Hierarchical Annotation for Building A Suite of Clinical Natural Language Processing Tasks: Progress Note Understanding. LREC Int Conf Lang Resour Eval. 2022 Jun;2022:5484-5493. PubMed PMID: 35939277
  • Majid Afshar, Yanjun Gao, Graham Wills, Jason Wang, Matthew M Churpek, Christa J Westenberger, David T Kunstman, Joel E Gordon, Cherodeep Goswami, Frank J Liao, Brian Patterson, Prompt engineering with a large language model to assist providers in responding to patient inquiries: a real-time implementation in the electronic health record, JAMIA Open, Volume 7, Issue 3, October 2024, ooae080, https://doi.org/10.1093/jamiaopen/ooae080
  • Xin Chen, Hanxian Huang, Yanjun Gao, Yi Wang, Jishen Zhao, and Ke Ding. 2024. Learning to Maximize Mutual Information for Chain-of-Thought Distillation. In Findings of the Association for Computational Linguistics ACL 2024, pages 6857–6868, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
  • Afshar M, Gao Y, Gupta D, Croxford E, Demner-Fushman D. On the role of the UMLS in supporting diagnosis generation differential diagnoses proposed by Large Language Models. Journal of Biomedical Informatics. 2024 Aug 13:104707.