SciBERTSUM: Extractive Summarization for Scientific Documents

Athar Sefid, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

The summarization literature focuses on the summarization of news articles. The news articles in the CNN-DailyMail are relatively short documents with about 30 sentences per document on average. We introduce SciBERTSUM, our summarization framework designed for the summarization of long documents like scientific papers with more than 500 sentences. SciBERTSUM extends BERTSUM to long documents by 1) adding a section embedding layer to include section information in the sentence vector and 2) applying a sparse attention mechanism where each sentences will attend locally to nearby sentences and only a small number of sentences attend globally to all other sentences. We used slides generated by the authors of scientific papers as reference summaries since they contain the technical details from the paper. The results show the superiority of our model in terms of ROUGE scores. (The code is available at https://github.com/atharsefid/SciBERTSUM ).

Original languageEnglish (US)
Title of host publicationDocument Analysis Systems - 15th IAPR International Workshop, DAS 2022, Proceedings
EditorsSeiichi Uchida, Elisa Barney, Véronique Eglin
PublisherSpringer Science and Business Media Deutschland GmbH
Pages688-701
Number of pages14
ISBN (Print)9783031065545
DOIs
StatePublished - 2022
Event15th IAPR International Workshop on Document Analysis Systems, DAS 2022 - La Rochelle, France
Duration: May 22 2022May 25 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13237 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference15th IAPR International Workshop on Document Analysis Systems, DAS 2022
Country/TerritoryFrance
CityLa Rochelle
Period5/22/225/25/22

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'SciBERTSUM: Extractive Summarization for Scientific Documents'. Together they form a unique fingerprint.

Cite this