TY - GEN
T1 - DART
T2 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021
AU - Nan, Linyong
AU - Radev, Dragomir
AU - Zhang, Rui
AU - Rau, Amrit
AU - Sivaprasad, Abhinand
AU - Hsieh, Chiachun
AU - Tang, Xiangru
AU - Vyas, Aadit
AU - Verma, Neha
AU - Krishna, Pranav
AU - Liu, Yangxiaokang
AU - Irwanto, Nadia
AU - Pan, Jessica
AU - Rahman, Faiaz
AU - Zaidi, Ahmad
AU - Mutuma, Mutethia
AU - Tarabar, Yasin
AU - Gupta, Ankit
AU - Yu, Tao
AU - Tan, Yi Chern
AU - Lin, Xi Victoria
AU - Xiong, Caiming
AU - Socher, Richard
AU - Rajani, Nazneen Fatema
N1 - Publisher Copyright:
© 2021 Association for Computational Linguistics.
PY - 2021
Y1 - 2021
N2 - We present DART, an open domain structured DAta-Record-to-Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.
AB - We present DART, an open domain structured DAta-Record-to-Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.
UR - http://www.scopus.com/inward/record.url?scp=85123437253&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123437253&partnerID=8YFLogxK
U2 - 10.18653/v1/2021.naacl-main.467
DO - 10.18653/v1/2021.naacl-main.467
M3 - Conference contribution
AN - SCOPUS:85123437253
T3 - NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference
SP - 432
EP - 447
BT - NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics
PB - Association for Computational Linguistics (ACL)
Y2 - 6 June 2021 through 11 June 2021
ER -