Automatic Pseudocode Extraction at Scale

Levent Toksoz, Gang Tan, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Pseudocode in a scholarly paper provides a concise way to express the algorithms implemented therein. Pseudocode can also be thought of as an intermediary representation that helps bridge the gap between programming languages and natural languages. Having access to a large collection of pseudocode can provide various benefits ranging from enhancing algorithmic understanding, facilitating further algorithmic design, to empowering NLP or computer vision based models for tasks such as automated code generation and optical character recognition (OCR). We have created a large pseudocode collection by extracting around 320,000 pseudocode examples from arXiv papers. This process involved scanning over 2.2 million scholarly papers, with 1,000 of them being manually inspected and labeled. Our approach encompasses an extraction mechanism tailored to optimize the coverage and a validation mechanism based on random sampling to check its accuracy and reliability, given the inherent heterogeneity of the collection. In addition, we offer insights into common pseudocode structures, supported by clustering and statistical analyses. Notably, these analyses indicate an exponential-like growth in the usage of pseudocodes, highlighting their increasing significance.

Original languageEnglish (US)
Title of host publicationProceedings - 2024 IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages264-269
Number of pages6
ISBN (Electronic)9798350351187
DOIs
StatePublished - 2024
Event25th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2024 - San Jose, United States
Duration: Aug 7 2024Aug 9 2024

Publication series

NameProceedings - 2024 IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2024

Conference

Conference25th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2024
Country/TerritoryUnited States
CitySan Jose
Period8/7/248/9/24

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition
  • Information Systems
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Automatic Pseudocode Extraction at Scale'. Together they form a unique fingerprint.

Cite this