TY - JOUR
T1 - Ensemble Transfer Learning on Augmented Domain Resources for Oncological Named Entity Recognition in Chinese Clinical Records
AU - Zhou, Meifeng
AU - Tan, Jindian
AU - Yang, Song
AU - Wang, Haixia
AU - Wang, Lin
AU - Xiao, Zhifeng
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2023
Y1 - 2023
N2 - Biomedical Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) and can help mine knowledge from massive clinical and diagnostic records. However, the biomedical NER task often undergoes a low-resource training setting due to the high cost of human annotation, limiting the capability of traditional NER models. In this study, we propose a two-stage learning pipeline to tackle the oncological NER task in Chinese language, which is a typical task lacking training resources. In the first stage, two base models pre-trained by Word to Vector (Word2Vec) and Bidirectional Embeddings Representations from Transformer (BERT) are fine tuned to obtain domains-specific word embeddings that serve as the input for the downstream NER task. In the second stage, we feed the word embeddings into a neural network that consists of a Bidirectional Long and Short Time Memory Recurrent Neural Network (BiLSTM) and Linear-chain Conditional Random Field (CRF) for end task training. Meanwhile, we utilize a substitution-based generative model for data augmentation (DA), aiming to enhance the quantity and diversity of the training data. Experiments show that our proposed learning pipeline demonstrates superior performance compared to other model alternatives under a low-resource setting. Specifically, results show that the proposed fine-tuning strategy, when conducted on an augmented domain resource, can effectively incorporate rich domain knowledge into the final NER model, presenting a great potential in boosting a model's predictive power with limited training data.
AB - Biomedical Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) and can help mine knowledge from massive clinical and diagnostic records. However, the biomedical NER task often undergoes a low-resource training setting due to the high cost of human annotation, limiting the capability of traditional NER models. In this study, we propose a two-stage learning pipeline to tackle the oncological NER task in Chinese language, which is a typical task lacking training resources. In the first stage, two base models pre-trained by Word to Vector (Word2Vec) and Bidirectional Embeddings Representations from Transformer (BERT) are fine tuned to obtain domains-specific word embeddings that serve as the input for the downstream NER task. In the second stage, we feed the word embeddings into a neural network that consists of a Bidirectional Long and Short Time Memory Recurrent Neural Network (BiLSTM) and Linear-chain Conditional Random Field (CRF) for end task training. Meanwhile, we utilize a substitution-based generative model for data augmentation (DA), aiming to enhance the quantity and diversity of the training data. Experiments show that our proposed learning pipeline demonstrates superior performance compared to other model alternatives under a low-resource setting. Specifically, results show that the proposed fine-tuning strategy, when conducted on an augmented domain resource, can effectively incorporate rich domain knowledge into the final NER model, presenting a great potential in boosting a model's predictive power with limited training data.
UR - http://www.scopus.com/inward/record.url?scp=85166358862&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85166358862&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2023.3299824
DO - 10.1109/ACCESS.2023.3299824
M3 - Article
AN - SCOPUS:85166358862
SN - 2169-3536
VL - 11
SP - 80416
EP - 80428
JO - IEEE Access
JF - IEEE Access
ER -