TY - GEN
T1 - The Causal News Corpus
T2 - 13th International Conference on Language Resources and Evaluation Conference, LREC 2022
AU - Tan, Fiona Anting
AU - Hürriyetoğlu, Ali
AU - Caselli, Tommaso
AU - Oostdijk, Nelleke
AU - Nomoto, Tadashi
AU - Hettiarachchi, Hansi
AU - Ameer, Iqra
AU - Uca, Onur
AU - Liza, Farhana Ferdousi
AU - Hu, Tiancheng
N1 - Publisher Copyright:
© European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0.
PY - 2022
Y1 - 2022
N2 - Despite the importance of understanding causality, corpora addressing causal relations are limited. There is a discrepancy between existing annotation guidelines of event causality and conventional causality corpora that focus more on linguistics. Many guidelines restrict themselves to include only explicit relations or clause-based arguments. Therefore, we propose an annotation schema for event causality that addresses these concerns. We annotated 3,559 event sentences from protest event news with labels on whether it contains causal relations or not. Our corpus is known as the Causal News Corpus (CNC). A neural network built upon a state-of-the-art pre-trained language model performed well with 81.20% F1 score on test set, and 83.46% in 5-folds cross-validation. CNC is transferable across two external corpora: CausalTimeBank (CTB) and Penn Discourse Treebank (PDTB). Leveraging each of these external datasets for training, we achieved up to approximately 64% F1 on the CNC test set without additional fine-tuning. CNC also served as an effective training and pre-training dataset for the two external corpora. Lastly, we demonstrate the difficulty of our task to the layman in a crowd-sourced annotation exercise. Our annotated corpus is publicly available, providing a valuable resource for causal text mining researchers.
AB - Despite the importance of understanding causality, corpora addressing causal relations are limited. There is a discrepancy between existing annotation guidelines of event causality and conventional causality corpora that focus more on linguistics. Many guidelines restrict themselves to include only explicit relations or clause-based arguments. Therefore, we propose an annotation schema for event causality that addresses these concerns. We annotated 3,559 event sentences from protest event news with labels on whether it contains causal relations or not. Our corpus is known as the Causal News Corpus (CNC). A neural network built upon a state-of-the-art pre-trained language model performed well with 81.20% F1 score on test set, and 83.46% in 5-folds cross-validation. CNC is transferable across two external corpora: CausalTimeBank (CTB) and Penn Discourse Treebank (PDTB). Leveraging each of these external datasets for training, we achieved up to approximately 64% F1 on the CNC test set without additional fine-tuning. CNC also served as an effective training and pre-training dataset for the two external corpora. Lastly, we demonstrate the difficulty of our task to the layman in a crowd-sourced annotation exercise. Our annotated corpus is publicly available, providing a valuable resource for causal text mining researchers.
UR - http://www.scopus.com/inward/record.url?scp=85136573341&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85136573341&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85136573341
T3 - 2022 Language Resources and Evaluation Conference, LREC 2022
SP - 2298
EP - 2310
BT - 2022 Language Resources and Evaluation Conference, LREC 2022
A2 - Calzolari, Nicoletta
A2 - Bechet, Frederic
A2 - Blache, Philippe
A2 - Choukri, Khalid
A2 - Cieri, Christopher
A2 - Declerck, Thierry
A2 - Goggi, Sara
A2 - Isahara, Hitoshi
A2 - Maegaard, Bente
A2 - Mariani, Joseph
A2 - Mazo, Helene
A2 - Odijk, Jan
A2 - Piperidis, Stelios
PB - European Language Resources Association (ELRA)
Y2 - 20 June 2022 through 25 June 2022
ER -