UE-NER-2025: A GPT-Based Approach to Multi-Lingual Named Entity Recognition on Urdu and English

Muhammad Ahmad, Humaira Farid, Iqra Ameer, Fida Ullah, Muhammad Muzamil, Muhammad Jalal, Ameer Hamza, Ildar Batyrshin, Grigori Sidorov

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Named Entity Recognition (NER) is a fundamental task that identifies and classifies entities into predefined categories from unstructured text. As textual data continues to grow and span diverse linguistic communities, NER is rarely studied as a multilingual task, particularly for low-resource languages. While many researchers have focused on name identification in various high-resource languages, only a few research efforts have addressed NER for the Urdu script. This is primarily due to a lack of resources and annotated datasets. Furthermore, previous research has mostly concentrated on monolingual techniques, leaving significant gaps in addressing multilingual challenges, especially for the Urdu language. To fill this gap, this study makes four key contributions. First, we created a unique multilingual dataset (UE-NER-2025) sourced from Twitter, which contains 182,411 tokens and 8 uniquely annotated entity types. Second, we applied two novel techniques that are relatively new to the UE-NER-2025 dataset: 1) a joint multilingual approach and 2) a joint translation-based approach. Third, we conducted 30 different experiments using 5-fold cross-validation, combining traditional supervised learning with token-based feature extraction, deep learning with pre-trained word embeddings such as FastText and GloVe, and advanced transfer learning models using contextual embeddings, to evaluate their effectiveness in enhancing NER performance for both English and Urdu, particularly addressing the challenges of low-resource and morphologically rich languages. Finally, we performed statistical analysis on our top-performing models to determine whether the differences in performance were statistically significant or occurred by chance. Based on the analysis of the results, our transformer-based language model (XLM-RoBERTa-base) achieved strong performance compared to traditional supervised learning models. We observed a performance improvement of 3.99% in the English translation-based approach, 3.72% in the multilingual approach, and 2.32% in the Urdu translation-based approach over traditional supervised learning (RF in Urdu = 0.927, in English = 0.9258, and multilingual = 0.9272).

Original languageEnglish (US)
Pages (from-to)111175-111186
Number of pages12
JournalIEEE Access
Volume13
DOIs
StatePublished - 2025

All Science Journal Classification (ASJC) codes

  • General Computer Science
  • General Materials Science
  • General Engineering

Fingerprint

Dive into the research topics of 'UE-NER-2025: A GPT-Based Approach to Multi-Lingual Named Entity Recognition on Urdu and English'. Together they form a unique fingerprint.

Cite this