TY - GEN
T1 - TOPFORMER
T2 - 27th European Conference on Artificial Intelligence, ECAI 2024
AU - Uchendu, Adaku
AU - Le, Thai
AU - Lee, Dongwon
N1 - Publisher Copyright:
© 2024 The Authors.
PY - 2024/10/16
Y1 - 2024/10/16
N2 - Recent advances in Large Language Models (LLMs) have enabled the generation of open-ended high-quality texts, that are non-trivial to distinguish from human-written texts. We refer to such LLM-generated texts as deepfake texts. There are currently over 72K text generation models in the huggingface model repo. As such, users with malicious intent can easily use these open-sourced LLMs to generate harmful texts and dis/misinformation at scale. To mitigate this problem, a computational method to determine if a given text is a deepfake text or not is desired-i.e., Turing Test (TT). In particular, in this work, we investigate the more general version of the problem, known as Authorship Attribution (AA), in a multi-class setting-i.e., not only determining if a given text is a deepfake text or not but also being able to pinpoint which LLM is the author. We propose TOPFORMER to improve existing AA solutions by capturing more linguistic patterns in deepfake texts by including a Topological Data Analysis (TDA) layer in the Transformer-based model. We show the benefits of having a TDA layer when dealing with imbalanced, and multi-style datasets, by extracting TDA features from the reshaped pooled_output of our backbone as input. This Transformer-based model captures contextual representations (i.e., semantic and syntactic linguistic features), while TDA captures the shape and structure of data (i.e., linguistic structures). Finally, TOPFORMER, outperforms all baselines in all 3 datasets, achieving up to 7% increase in Macro F1 score. Our code and datasets are available at: https://github.com/AdaUchendu/topformer.
AB - Recent advances in Large Language Models (LLMs) have enabled the generation of open-ended high-quality texts, that are non-trivial to distinguish from human-written texts. We refer to such LLM-generated texts as deepfake texts. There are currently over 72K text generation models in the huggingface model repo. As such, users with malicious intent can easily use these open-sourced LLMs to generate harmful texts and dis/misinformation at scale. To mitigate this problem, a computational method to determine if a given text is a deepfake text or not is desired-i.e., Turing Test (TT). In particular, in this work, we investigate the more general version of the problem, known as Authorship Attribution (AA), in a multi-class setting-i.e., not only determining if a given text is a deepfake text or not but also being able to pinpoint which LLM is the author. We propose TOPFORMER to improve existing AA solutions by capturing more linguistic patterns in deepfake texts by including a Topological Data Analysis (TDA) layer in the Transformer-based model. We show the benefits of having a TDA layer when dealing with imbalanced, and multi-style datasets, by extracting TDA features from the reshaped pooled_output of our backbone as input. This Transformer-based model captures contextual representations (i.e., semantic and syntactic linguistic features), while TDA captures the shape and structure of data (i.e., linguistic structures). Finally, TOPFORMER, outperforms all baselines in all 3 datasets, achieving up to 7% increase in Macro F1 score. Our code and datasets are available at: https://github.com/AdaUchendu/topformer.
UR - http://www.scopus.com/inward/record.url?scp=85213399323&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85213399323&partnerID=8YFLogxK
U2 - 10.3233/FAIA240647
DO - 10.3233/FAIA240647
M3 - Conference contribution
AN - SCOPUS:85213399323
T3 - Frontiers in Artificial Intelligence and Applications
SP - 1446
EP - 1454
BT - ECAI 2024 - 27th European Conference on Artificial Intelligence, Including 13th Conference on Prestigious Applications of Intelligent Systems, PAIS 2024, Proceedings
A2 - Endriss, Ulle
A2 - Melo, Francisco S.
A2 - Bach, Kerstin
A2 - Bugarin-Diz, Alberto
A2 - Alonso-Moral, Jose M.
A2 - Barro, Senen
A2 - Heintz, Fredrik
PB - IOS Press BV
Y2 - 19 October 2024 through 24 October 2024
ER -