Yanjun Gao, PhD

Download CV

Website

Graduate School

PhD, Pennsylvania State University - University Park Campus (2021)

Languages

English

Department

Biomedical Informatics

Professional Titles

Assistant Professor

Research Interests

My research centers on developing and evaluating foundational natural language processing (NLP) methods, particularly large language models (LLMs), to convert complex data, such as electronic health records (EHRs), into actionable insights for improving decision-making in healthcare and beyond. I explore broader questions of how both humans and machines understand and utilize language, aiming to develop systems that not only enhance decision-making across various domains but also contribute to a future where artificial intelligence (AI) is safe, trustworthy, and reliably aligned with human needs.

Publications

Gao, Yanjun, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A. Miller, Danielle Bitterman, Matthew Churpek, and Majid Afshar. "When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications?." Findings of Empirical Methods in Natural Language Processing (EMNLP 2024).
Gao Y, Mahajan D, Uzuner Ö, Yetisgen M. Clinical natural language processing for secondary uses. J Biomed Inform. 2024 Feb;150:104596. PubMed PMID: 38278312
Majid Afshar, Yanjun Gao, Graham Wills, Jason Wang, Matthew M Churpek, Christa J Westenberger, David T Kunstman, Joel E Gordon, Cherodeep Goswami, Frank J Liao, Brian Patterson, Prompt engineering with a large language model to assist providers in responding to patient inquiries: a real-time implementation in the electronic health record, JAMIA Open, Volume 7, Issue 3, October 2024, ooae080, https://doi.org/10.1093/jamiaopen/ooae080
Xin Chen, Hanxian Huang, Yanjun Gao, Yi Wang, Jishen Zhao, and Ke Ding. 2024. Learning to Maximize Mutual Information for Chain-of-Thought Distillation. In Findings of the Association for Computational Linguistics ACL 2024, pages 6857–6868, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Afshar M, Gao Y, Gupta D, Croxford E, Demner-Fushman D. On the role of the UMLS in supporting diagnosis generation differential diagnoses proposed by Large Language Models. Journal of Biomedical Informatics. 2024 Aug 13:104707.
Afshar M, Gao Y, Wills G, Wang J, Churpek MM, Westenberger CJ, Kunstman DT, Gordon JE, Goswami C, Liao FJ, Patterson B. Prompt engineering with a large language model to assist providers in responding to patient inquiries: a real-time implementation in the electronic health record. JAMIA Open. 2024 Oct;7(3):ooae080. PubMed PMID: 39166170
Afshar M, Gao Y, Gupta D, Croxford E, Demner-Fushman D. On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models. J Biomed Inform. 2024 Sep;157:104707. PubMed PMID: 39142598
Croxford E, Gao Y, Patterson B, To D, Tesch S, Dligach D, Mayampurath A, Churpek MM, Afshar M. Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses. medRxiv. 2024 Apr 9. PubMed PMID: 38562730
Gao Y, Mahajan D, Uzuner Ö, Yetisgen M. Clinical natural language processing for secondary uses. J Biomed Inform. 2024 Feb;150:104596. PubMed PMID: 38278312
Croxford, E., Gao, Y., Pellegrino, N. et al. Current and future state of evaluation of large language models for medical summarization tasks. npj Health Syst. 2, 6 (2025). https://doi.org/10.1038/s44401-024-00011-2
Gao Y, Myers S, Chen S, Dligach D, Miller T, Bitterman D, Churpek M, Afshar M. When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications?. InFindings of the Association for Computational Linguistics: EMNLP 2024 2024 Nov (pp. 5414-5428).
Li R, Gao Y. Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions. arXiv preprint arXiv:2405.03205. 2024 May 6.
Chen X, Huang H, Gao Y, Wang Y, Zhao J, Ding K. Learning to Maximize Mutual Information for Chain-of-Thought Distillation. InFindings of the Association for Computational Linguistics ACL 2024 2024 Aug (pp. 6857-6868).
Zhou W, Yetisgen M, Afshar M, Gao Y, Savova G, Miller TA. Improving model transferability for clinical note section classification models using continued pretraining. Journal of the American Medical Informatics Association. 2024 Jan 1;31(1):89-97.
Eslami B, Afshar M, Tootooni MS, Miller T, Churpek M, Gao Y, Dligach D. Toward Digital Twins in the Intensive Care Unit: A Medication Management Case Study. medRxiv. 2024 Dec 28:2024-12.
Chen S, Gallifant J, Guevara M, Gao Y, Afshar M, Miller T, Dligach D, Bitterman DS. Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data. arXiv preprint arXiv:2403.19511. 2024 Mar 28
Gao Y, Dligach D, Miller T, Churpek MM, Uzuner O, Afshar M. Progress Note Understanding - Assessment and Plan Reasoning: Overview of the 2022 N2C2 Track 3 shared task. J Biomed Inform. 2023 Jun;142:104346. PubMed PMID: 37061012
Zhou W, Dligach D, Afshar M, Gao Y, Miller TA. Improving the Transferability of Clinical Note Section Classification Models with BERT and Large Language Model Ensembles. Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:125-130. PubMed PMID: 37786810
Gao Y, Dligach D, Miller T, Churpek MM, Afshar M. Overview of the Problem List Summarization (ProbSum) 2023 Shared Task on Summarizing Patients' Active Diagnoses and Problems from Electronic Health Record Progress Notes. Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:461-467. PubMed PMID: 37583489
Sharma B, Gao Y, Miller T, Churpek MM, Afshar M, Dligach D. Multi-Task Training with In-Domain Language Models for Diagnostic Reasoning. Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023(ClinicalNLP):78-85. PubMed PMID: 37492270
Zhou W, Yetisgen M, Afshar M, Gao Y, Savova G, Miller TA. Improving model transferability for clinical note section classification models using continued pretraining. J Am Med Inform Assoc. 2023 Dec 22;31(1):89-97. PubMed PMID: 37725927
Gao Y, Dligach D, Christensen L, Tesch S, Laffin R, Xu D, Miller T, Uzuner O, Churpek MM, Afshar M. A scoping review of publicly available language tasks in clinical natural language processing. J Am Med Inform Assoc. 2022 Sep 12;29(10):1797-1806. PubMed PMID: 35923088
Yetisgen M, Uzuner O, Gao Y, Mahajan D. Call for papers: Special issue on clinical natural language processing for secondary use applications. J Biomed Inform. 2022 Sep;133:104152. PubMed PMID: 35985622
Gao Y, Miller T, Xu D, Dligach D, Churpek MM, Afshar M. Summarizing Patients' Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models. Proc Int Conf Comput Ling. 2022 Oct;2022:2979-2991. PubMed PMID: 36268128
Gao Y, Dligach D, Miller T, Tesch S, Laffin R, Churpek MM, Afshar M. Hierarchical Annotation for Building A Suite of Clinical Natural Language Processing Tasks: Progress Note Understanding. LREC Int Conf Lang Resour Eval. 2022 Jun;2022:5484-5493. PubMed PMID: 35939277

View All (25 Total) View Less

Assistant Professor, Biomedical Informatics

Graduate School

Languages

Department

Professional Titles

Research Interests

Publications

School of Medicine

CU Anschutz

Fitzsimons Building