Abstract
Access to social media often enables users to engage in conversation with limited accountability. This allows a user to share their opinions and ideology, especially regarding public content, occasionally adopting offensive language. This may encourage hate crimes or cause mental harm to targeted individuals or groups. Hence, it is important to detect offensive comments in social media platforms. Typically, most studies focus on offensive commenting in one platform only, even though the problem of offensive language is observed across multiple platforms. Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment dataset, collected from multiple social media platforms, including Twitter, Facebook, and YouTube. We follow two-step crowd-annotator selection criteria for low-representative language annotation task in a crowdsourcing platform. Furthermore, we analyze the distinctive lexical content along with the use of emojis in offensive comments. We train and evaluate the classifiers using the annotated multi-platform dataset along with other publicly available data. Our results highlight the importance of multiple platform dataset for (a) cross-platform, (b) cross-domain, and (c) cross-dialect generalization of classifier performance.
| Original language | English (US) |
|---|---|
| Title of host publication | LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings |
| Editors | Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis |
| Publisher | European Language Resources Association (ELRA) |
| Pages | 6203-6212 |
| Number of pages | 10 |
| ISBN (Electronic) | 9791095546344 |
| State | Published - 2020 |
| Event | 12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, France Duration: May 11 2020 → May 16 2020 |
Publication series
| Name | LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings |
|---|
Conference
| Conference | 12th International Conference on Language Resources and Evaluation, LREC 2020 |
|---|---|
| Country/Territory | France |
| City | Marseille |
| Period | 5/11/20 → 5/16/20 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 16 Peace, Justice and Strong Institutions
All Science Journal Classification (ASJC) codes
- Language and Linguistics
- Education
- Library and Information Sciences
- Linguistics and Language
Fingerprint
Dive into the research topics of 'A multi-platform arabic news comment dataset for offensive language detection'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver