TY - GEN
T1 - SoAC and SoACer
T2 - 25th ACM Symposium on Document Engineering, DocEng 2025
AU - Shayesteh, Shahriar
AU - Srinath, Mukund
AU - Matheson, Lee
AU - Xian, Lu
AU - Saha, Sinjoy
AU - Giles, C. Lee
AU - Wilson, Shomir
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/8/27
Y1 - 2025/8/27
N2 - One approach to understanding the vastness and complexity of the web is to categorize websites into sectors that reflect the specific industries or domains in which they operate. However, existing website classification approaches often struggle to handle the noisy, unstructured, and lengthy nature of web content, and current datasets lack a universal sector classification labeling system specifically designed for the web. To address these issues, we introduce SoAC (Sector of Activity Corpus), a large-scale corpus comprising 195, 495 websites categorized into 10 broad sectors tailored for web content, which serves as the benchmark for evaluating our proposed classification framework, SoACer (Sector of Activity Classifier). Building on this resource, SoACer is a novel end-to-end classification framework that first fetches website information, then incorporates extractive summarization to condense noisy and lengthy content into a concise representation, and finally employs large language model (LLM) embeddings (Llama3-8B) combined with a classification head to achieve accurate sectoral prediction. Through extensive experiments, including ablation studies and detailed error analysis, we demonstrate that SoACer achieves an overall accuracy of 72.6% on our proposed SoAC dataset. Our ablation study confirms that extractive summarization not only reduces computational overhead but also enhances classification performance, while our error analysis reveals meaningful sector overlaps that underscore the need for multi-label and hierarchical classification frameworks. These findings provide a robust foundation for future exploration of advanced classification techniques that better capture the complex nature of modern website content.1
AB - One approach to understanding the vastness and complexity of the web is to categorize websites into sectors that reflect the specific industries or domains in which they operate. However, existing website classification approaches often struggle to handle the noisy, unstructured, and lengthy nature of web content, and current datasets lack a universal sector classification labeling system specifically designed for the web. To address these issues, we introduce SoAC (Sector of Activity Corpus), a large-scale corpus comprising 195, 495 websites categorized into 10 broad sectors tailored for web content, which serves as the benchmark for evaluating our proposed classification framework, SoACer (Sector of Activity Classifier). Building on this resource, SoACer is a novel end-to-end classification framework that first fetches website information, then incorporates extractive summarization to condense noisy and lengthy content into a concise representation, and finally employs large language model (LLM) embeddings (Llama3-8B) combined with a classification head to achieve accurate sectoral prediction. Through extensive experiments, including ablation studies and detailed error analysis, we demonstrate that SoACer achieves an overall accuracy of 72.6% on our proposed SoAC dataset. Our ablation study confirms that extractive summarization not only reduces computational overhead but also enhances classification performance, while our error analysis reveals meaningful sector overlaps that underscore the need for multi-label and hierarchical classification frameworks. These findings provide a robust foundation for future exploration of advanced classification techniques that better capture the complex nature of modern website content.1
UR - https://www.scopus.com/pages/publications/105015674087
UR - https://www.scopus.com/pages/publications/105015674087#tab=citedBy
U2 - 10.1145/3704268.3742691
DO - 10.1145/3704268.3742691
M3 - Conference contribution
AN - SCOPUS:105015674087
T3 - DocEng 2025 - Proceedings of the 2025 ACM Symposium on Document Engineering
BT - DocEng 2025 - Proceedings of the 2025 ACM Symposium on Document Engineering
PB - Association for Computing Machinery, Inc
Y2 - 2 September 2025 through 5 September 2025
ER -