Privacy Lost and Found: An Investigation at Scale of Web Privacy Policy Availability

Mukund Srinath, Soundarya Sundareswara, Pranav Venkit, C. Lee Giles, Shomir Wilson

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

Legal jurisdictions around the world require organisations to post privacy policies on their websites. However, in spite of laws such as GDPR and CCPA reinforcing this requirement, organisations sometimes do not comply, and a variety of semi-compliant failure modes exist. To investigate the landscape of web privacy policies, we crawl the privacy policies from 7 million organisation websites with the goal of identifying when policies are unavailable. We conduct a large-scale investigation of the availability of privacy policies and identify potential reasons for unavailability such as dead links, documents with empty content, documents that consist solely of placeholder text, and documents unavailable in the specific languages offered by their respective websites. We estimate the frequencies of these failure modes and the overall unavailability of privacy policies on the web and find that privacy policies URLs are only available in 34% of websites. Further, 1.37% of these URLs are broken links and 1.23% of the valid links lead to pages without a policy. Further, to enable investigation of privacy policies at scale, we use the capture-recapture technique to estimate the total number of English language privacy policies on the web and the distribution of these documents across top level domains and sectors of commerce. We estimate the lower bound on the number of English language privacy policies to be around 3 million. Finally, we release the CoLIPPs Corpus containing around 600k policies and their metadata consisting of policy URL, length, readability, sector of commerce, and policy crawl date.

Original languageEnglish (US)
Title of host publicationDocEng 2023 - Proceedings of the 2023 ACM Symposium on Document Engineering
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9798400700279
DOIs
StatePublished - Aug 22 2023
Event2023 ACM Symposium on Document Engineering, DocEng 2023 - Limerick, Ireland
Duration: Aug 22 2023Aug 25 2023

Publication series

NameDocEng 2023 - Proceedings of the 2023 ACM Symposium on Document Engineering

Conference

Conference2023 ACM Symposium on Document Engineering, DocEng 2023
Country/TerritoryIreland
CityLimerick
Period8/22/238/25/23

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'Privacy Lost and Found: An Investigation at Scale of Web Privacy Policy Availability'. Together they form a unique fingerprint.

Cite this