The creation and analysis of a Website privacy policy corpus

Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel Reidenberg, Norman Sadeh

Research output: Chapter in Book/Report/Conference proceedingConference contribution

239 Scopus citations

Abstract

Website privacy policies are often ignored by Internet users, because these documents tend to be long and difficult to understand. However, the significance of privacy policies greatly exceeds the attention paid to them: these documents are binding legal agreements between website operators and their users, and their opaqueness is a challenge not only to Internet users but also to policy regulators. One proposed alternative to the status quo is to automate or semi-automate the extraction of salient details from privacy policy text, using a combination of crowdsourcing, natural language processing, and machine learning. However, there has been a relative dearth of dataseis appropriate for identifying data practices in privacy policies. To remedy this problem, we introduce a corpus of 115 privacy policies (267K words) with manual annotations for 23K fine-grained data practices. We describe the process of using skilled annotators and a purpose-built annotation tool to produce the data. We provide findings based on a census of the annotations and show results toward automating the annotation procedure. Finally, we describe challenges and opportunities for the research community to use this corpus to advance research in both privacy and language technologies.

Original languageEnglish (US)
Title of host publication54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers
PublisherAssociation for Computational Linguistics (ACL)
Pages1330-1340
Number of pages11
ISBN (Electronic)9781510827585
DOIs
StatePublished - 2016
Event54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Berlin, Germany
Duration: Aug 7 2016Aug 12 2016

Publication series

Name54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers
Volume3

Other

Other54th Annual Meeting of the Association for Computational Linguistics, ACL 2016
Country/TerritoryGermany
CityBerlin
Period8/7/168/12/16

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'The creation and analysis of a Website privacy policy corpus'. Together they form a unique fingerprint.

Cite this