TY - GEN
T1 - Comparative Fine-Tuning of GPT-2 on Question Answering and Dialogue Datasets for Medical Text Generation
AU - Nhkum, Caleb
AU - Rahman, Mohammad Masudur
AU - Ahmed, Tanvir
AU - Kabir, Md Faisal
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
PY - 2026
Y1 - 2026
N2 - Fine-tuning large language models (LLM) using medical data-sets presents significant opportunities for developing reliable and informative AI-driven health applications. This research investigates how different dataset structures (formatted question-answer (QA) pairs versus conversational doctor-patient dialogues) influence the effectiveness of a GPT-2-based generative model. Models trained on each dataset were evaluated using established NLP metrics (BLEU, ROUGE-1, ROUGE-L, BERTScore) and qualitative evaluations covering sentiment alignment, factual consistency (assessed via natural language inference), and readability. The results indicate that the QA-trained model achieves superior performance in semantic accuracy and sentiment alignment compared to the dialogue-based model, which produced responses that were marginally more readable. However, both models exhibited notably low factual entailment scores, highlighting an essential area for further improvement. These insights emphasize the importance of cautious dataset selection and model assessment strategies in clinical NLP. They also suggest promising directions for enhancing factual accuracy, domain specificity, and explanatory capabilities in future research.
AB - Fine-tuning large language models (LLM) using medical data-sets presents significant opportunities for developing reliable and informative AI-driven health applications. This research investigates how different dataset structures (formatted question-answer (QA) pairs versus conversational doctor-patient dialogues) influence the effectiveness of a GPT-2-based generative model. Models trained on each dataset were evaluated using established NLP metrics (BLEU, ROUGE-1, ROUGE-L, BERTScore) and qualitative evaluations covering sentiment alignment, factual consistency (assessed via natural language inference), and readability. The results indicate that the QA-trained model achieves superior performance in semantic accuracy and sentiment alignment compared to the dialogue-based model, which produced responses that were marginally more readable. However, both models exhibited notably low factual entailment scores, highlighting an essential area for further improvement. These insights emphasize the importance of cautious dataset selection and model assessment strategies in clinical NLP. They also suggest promising directions for enhancing factual accuracy, domain specificity, and explanatory capabilities in future research.
UR - https://www.scopus.com/pages/publications/105027101441
UR - https://www.scopus.com/pages/publications/105027101441#tab=citedBy
U2 - 10.1007/978-3-032-08977-9_38
DO - 10.1007/978-3-032-08977-9_38
M3 - Conference contribution
AN - SCOPUS:105027101441
SN - 9783032089762
T3 - Communications in Computer and Information Science
SP - 575
EP - 590
BT - SEET - Software Engineering for Emerging Technologies - 1st International Conference, SEET 2025, Proceedings
A2 - Hussain, Shahid
A2 - Khan, Arif Ali
A2 - Abdul Basit Ur Rahim, Muhammad
A2 - Khan, Saif Ur Rehman
PB - Springer Science and Business Media Deutschland GmbH
T2 - 1st International Conference on Software Engineering of Emerging Technologies, SEET 2025
Y2 - 11 August 2025 through 12 August 2025
ER -