TY - JOUR
T1 - UE-NER-2025
T2 - A GPT-Based Approach to Multi-Lingual Named Entity Recognition on Urdu and English
AU - Ahmad, Muhammad
AU - Farid, Humaira
AU - Ameer, Iqra
AU - Ullah, Fida
AU - Muzamil, Muhammad
AU - Jalal, Muhammad
AU - Hamza, Ameer
AU - Batyrshin, Ildar
AU - Sidorov, Grigori
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2025
Y1 - 2025
N2 - Named Entity Recognition (NER) is a fundamental task that identifies and classifies entities into predefined categories from unstructured text. As textual data continues to grow and span diverse linguistic communities, NER is rarely studied as a multilingual task, particularly for low-resource languages. While many researchers have focused on name identification in various high-resource languages, only a few research efforts have addressed NER for the Urdu script. This is primarily due to a lack of resources and annotated datasets. Furthermore, previous research has mostly concentrated on monolingual techniques, leaving significant gaps in addressing multilingual challenges, especially for the Urdu language. To fill this gap, this study makes four key contributions. First, we created a unique multilingual dataset (UE-NER-2025) sourced from Twitter, which contains 182,411 tokens and 8 uniquely annotated entity types. Second, we applied two novel techniques that are relatively new to the UE-NER-2025 dataset: 1) a joint multilingual approach and 2) a joint translation-based approach. Third, we conducted 30 different experiments using 5-fold cross-validation, combining traditional supervised learning with token-based feature extraction, deep learning with pre-trained word embeddings such as FastText and GloVe, and advanced transfer learning models using contextual embeddings, to evaluate their effectiveness in enhancing NER performance for both English and Urdu, particularly addressing the challenges of low-resource and morphologically rich languages. Finally, we performed statistical analysis on our top-performing models to determine whether the differences in performance were statistically significant or occurred by chance. Based on the analysis of the results, our transformer-based language model (XLM-RoBERTa-base) achieved strong performance compared to traditional supervised learning models. We observed a performance improvement of 3.99% in the English translation-based approach, 3.72% in the multilingual approach, and 2.32% in the Urdu translation-based approach over traditional supervised learning (RF in Urdu = 0.927, in English = 0.9258, and multilingual = 0.9272).
AB - Named Entity Recognition (NER) is a fundamental task that identifies and classifies entities into predefined categories from unstructured text. As textual data continues to grow and span diverse linguistic communities, NER is rarely studied as a multilingual task, particularly for low-resource languages. While many researchers have focused on name identification in various high-resource languages, only a few research efforts have addressed NER for the Urdu script. This is primarily due to a lack of resources and annotated datasets. Furthermore, previous research has mostly concentrated on monolingual techniques, leaving significant gaps in addressing multilingual challenges, especially for the Urdu language. To fill this gap, this study makes four key contributions. First, we created a unique multilingual dataset (UE-NER-2025) sourced from Twitter, which contains 182,411 tokens and 8 uniquely annotated entity types. Second, we applied two novel techniques that are relatively new to the UE-NER-2025 dataset: 1) a joint multilingual approach and 2) a joint translation-based approach. Third, we conducted 30 different experiments using 5-fold cross-validation, combining traditional supervised learning with token-based feature extraction, deep learning with pre-trained word embeddings such as FastText and GloVe, and advanced transfer learning models using contextual embeddings, to evaluate their effectiveness in enhancing NER performance for both English and Urdu, particularly addressing the challenges of low-resource and morphologically rich languages. Finally, we performed statistical analysis on our top-performing models to determine whether the differences in performance were statistically significant or occurred by chance. Based on the analysis of the results, our transformer-based language model (XLM-RoBERTa-base) achieved strong performance compared to traditional supervised learning models. We observed a performance improvement of 3.99% in the English translation-based approach, 3.72% in the multilingual approach, and 2.32% in the Urdu translation-based approach over traditional supervised learning (RF in Urdu = 0.927, in English = 0.9258, and multilingual = 0.9272).
UR - https://www.scopus.com/pages/publications/105008665245
UR - https://www.scopus.com/inward/citedby.url?scp=105008665245&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2025.3579289
DO - 10.1109/ACCESS.2025.3579289
M3 - Article
AN - SCOPUS:105008665245
SN - 2169-3536
VL - 13
SP - 111175
EP - 111186
JO - IEEE Access
JF - IEEE Access
ER -