Skip to main navigation Skip to search Skip to main content

SoAC and SoACer: A Sector-Based Corpus and LLM-Based Framework for Sectoral Website Classification

  • Shahriar Shayesteh
  • , Mukund Srinath
  • , Lee Matheson
  • , Lu Xian
  • , Sinjoy Saha
  • , C. Lee Giles
  • , Shomir Wilson

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

One approach to understanding the vastness and complexity of the web is to categorize websites into sectors that reflect the specific industries or domains in which they operate. However, existing website classification approaches often struggle to handle the noisy, unstructured, and lengthy nature of web content, and current datasets lack a universal sector classification labeling system specifically designed for the web. To address these issues, we introduce SoAC (Sector of Activity Corpus), a large-scale corpus comprising 195, 495 websites categorized into 10 broad sectors tailored for web content, which serves as the benchmark for evaluating our proposed classification framework, SoACer (Sector of Activity Classifier). Building on this resource, SoACer is a novel end-to-end classification framework that first fetches website information, then incorporates extractive summarization to condense noisy and lengthy content into a concise representation, and finally employs large language model (LLM) embeddings (Llama3-8B) combined with a classification head to achieve accurate sectoral prediction. Through extensive experiments, including ablation studies and detailed error analysis, we demonstrate that SoACer achieves an overall accuracy of 72.6% on our proposed SoAC dataset. Our ablation study confirms that extractive summarization not only reduces computational overhead but also enhances classification performance, while our error analysis reveals meaningful sector overlaps that underscore the need for multi-label and hierarchical classification frameworks. These findings provide a robust foundation for future exploration of advanced classification techniques that better capture the complex nature of modern website content.1

Original languageEnglish (US)
Title of host publicationDocEng 2025 - Proceedings of the 2025 ACM Symposium on Document Engineering
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9798400713514
DOIs
StatePublished - Aug 27 2025
Event25th ACM Symposium on Document Engineering, DocEng 2025 - Nottingham, United Kingdom
Duration: Sep 2 2025Sep 5 2025

Publication series

NameDocEng 2025 - Proceedings of the 2025 ACM Symposium on Document Engineering

Conference

Conference25th ACM Symposium on Document Engineering, DocEng 2025
Country/TerritoryUnited Kingdom
CityNottingham
Period9/2/259/5/25

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'SoAC and SoACer: A Sector-Based Corpus and LLM-Based Framework for Sectoral Website Classification'. Together they form a unique fingerprint.

Cite this