Supervised and unsupervised methods for robust separation of section titles and prose text in web documents

Abhijith Athreya Mysore Gopinath, Shomir Wilson, Norman Sadeh

Research output: Chapter in Book/Report/Conference proceedingConference contribution

16 Scopus citations

Abstract

The text in many web documents is organized into a hierarchy of section titles and corresponding prose content, a structure which provides potentially exploitable information on discourse structure and topicality. However, this organization is generally discarded during text collection, and collecting it is not straightforward: the same visual organization can be implemented in a myriad of different ways in the underlying HTML. To remedy this, we present a flexible system for automatically extracting the hierarchical section titles and prose organization of web documents irrespective of differences in HTML representation. This system uses features from syntax, semantics, discourse and markup to build two models which classify HTML text into section titles and prose text. When tested on three different domains of web text, our domain-independent system achieves an overall precision of 0.82 and a recall of 0.98. The domain-dependent variation produces very high precision (0.99) at the expense of recall (0.75). These results exhibit a robust level of accuracy suitable for enhancing question answering, information extraction, and summarization.1

Original languageEnglish (US)
Title of host publicationProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018
EditorsEllen Riloff, David Chiang, Julia Hockenmaier, Jun'ichi Tsujii
PublisherAssociation for Computational Linguistics
Pages850-855
Number of pages6
ISBN (Electronic)9781948087841
StatePublished - 2018
Event2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 - Brussels, Belgium
Duration: Oct 31 2018Nov 4 2018

Publication series

NameProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018

Conference

Conference2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018
Country/TerritoryBelgium
CityBrussels
Period10/31/1811/4/18

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'Supervised and unsupervised methods for robust separation of section titles and prose text in web documents'. Together they form a unique fingerprint.

Cite this