TY - GEN
T1 - Layout-Aware OCR for Black Digital Archives with Unsupervised Evaluation
AU - Beyene, Fitsum Sileshi
AU - Dancy, Christopher L.
N1 - Publisher Copyright:
©2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Despite their cultural and historical significance, Black digital archives continue to be a structurally underrepresented area in AI research and infrastructure. This is especially evident in efforts to digitize historical Black newspapers, where inconsistent typography, visual degradation, and limited annotated layout data hinder accurate transcription, despite the availability of various systems that claim to handle optical character recognition (OCR) well. In this short paper, we present a layout-aware OCR pipeline tailored for Black newspaper archives and introduce an unsupervised evaluation framework suited to low-resource archival contexts. Our approach integrates synthetic layout generation, model pretraining on augmented data, and a fusion of state-of-the-art You Only Look Once (YOLO) detectors. We used three annotation-free evaluation metrics, the Semantic Coherence Score (SCS), Region Entropy (RE), and Textual Redundancy Score (TRS), which quantify linguistic fluency, informational diversity, and redundancy across OCR regions. Our evaluation on a 400-page dataset from ten Black newspaper titles demonstrates that layout-aware OCR improves structural diversity and reduces redundancy compared to full-page baselines, with modest trade-offs in coherence. Our results highlight the importance of respecting cultural layout logic in AI-driven document understanding and lay the foundation for future community-driven and ethically grounded archival AI systems.
AB - Despite their cultural and historical significance, Black digital archives continue to be a structurally underrepresented area in AI research and infrastructure. This is especially evident in efforts to digitize historical Black newspapers, where inconsistent typography, visual degradation, and limited annotated layout data hinder accurate transcription, despite the availability of various systems that claim to handle optical character recognition (OCR) well. In this short paper, we present a layout-aware OCR pipeline tailored for Black newspaper archives and introduce an unsupervised evaluation framework suited to low-resource archival contexts. Our approach integrates synthetic layout generation, model pretraining on augmented data, and a fusion of state-of-the-art You Only Look Once (YOLO) detectors. We used three annotation-free evaluation metrics, the Semantic Coherence Score (SCS), Region Entropy (RE), and Textual Redundancy Score (TRS), which quantify linguistic fluency, informational diversity, and redundancy across OCR regions. Our evaluation on a 400-page dataset from ten Black newspaper titles demonstrates that layout-aware OCR improves structural diversity and reduces redundancy compared to full-page baselines, with modest trade-offs in coherence. Our results highlight the importance of respecting cultural layout logic in AI-driven document understanding and lay the foundation for future community-driven and ethically grounded archival AI systems.
UR - https://www.scopus.com/pages/publications/105030544206
UR - https://www.scopus.com/pages/publications/105030544206#tab=citedBy
U2 - 10.1109/ISTAS65609.2025.11269615
DO - 10.1109/ISTAS65609.2025.11269615
M3 - Conference contribution
AN - SCOPUS:105030544206
T3 - International Symposium on Technology and Society, Proceedings
BT - 2025 IEEE International Symposium on Technology and Society, ISTAS 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 IEEE International Symposium on Technology and Society, ISTAS 2025
Y2 - 10 September 2025 through 12 September 2025
ER -