TY - GEN
T1 - The creation and analysis of a Website privacy policy corpus
AU - Wilson, Shomir
AU - Schaub, Florian
AU - Dara, Aswarth Abhilash
AU - Liu, Frederick
AU - Cherivirala, Sushain
AU - Leon, Pedro Giovanni
AU - Andersen, Mads Schaarup
AU - Zimmeck, Sebastian
AU - Sathyendra, Kanthashree Mysore
AU - Russell, N. Cameron
AU - Norton, Thomas B.
AU - Hovy, Eduard
AU - Reidenberg, Joel
AU - Sadeh, Norman
N1 - Funding Information:
This work is funded by the National Science Foundation under grants CNS-1330596 and CNS-1330214. The authors would like to acknowledge the law students at Fordham University and the University of Pittsburgh who worked as annotators to make this corpus possible. The authors also wish to acknowledge all members of the Usable Privacy Policy Project (www. usableprivacy. org) for their contributions.
PY - 2016
Y1 - 2016
N2 - Website privacy policies are often ignored by Internet users, because these documents tend to be long and difficult to understand. However, the significance of privacy policies greatly exceeds the attention paid to them: these documents are binding legal agreements between website operators and their users, and their opaqueness is a challenge not only to Internet users but also to policy regulators. One proposed alternative to the status quo is to automate or semi-automate the extraction of salient details from privacy policy text, using a combination of crowdsourcing, natural language processing, and machine learning. However, there has been a relative dearth of dataseis appropriate for identifying data practices in privacy policies. To remedy this problem, we introduce a corpus of 115 privacy policies (267K words) with manual annotations for 23K fine-grained data practices. We describe the process of using skilled annotators and a purpose-built annotation tool to produce the data. We provide findings based on a census of the annotations and show results toward automating the annotation procedure. Finally, we describe challenges and opportunities for the research community to use this corpus to advance research in both privacy and language technologies.
AB - Website privacy policies are often ignored by Internet users, because these documents tend to be long and difficult to understand. However, the significance of privacy policies greatly exceeds the attention paid to them: these documents are binding legal agreements between website operators and their users, and their opaqueness is a challenge not only to Internet users but also to policy regulators. One proposed alternative to the status quo is to automate or semi-automate the extraction of salient details from privacy policy text, using a combination of crowdsourcing, natural language processing, and machine learning. However, there has been a relative dearth of dataseis appropriate for identifying data practices in privacy policies. To remedy this problem, we introduce a corpus of 115 privacy policies (267K words) with manual annotations for 23K fine-grained data practices. We describe the process of using skilled annotators and a purpose-built annotation tool to produce the data. We provide findings based on a census of the annotations and show results toward automating the annotation procedure. Finally, we describe challenges and opportunities for the research community to use this corpus to advance research in both privacy and language technologies.
UR - http://www.scopus.com/inward/record.url?scp=85011915291&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85011915291&partnerID=8YFLogxK
U2 - 10.18653/v1/p16-1126
DO - 10.18653/v1/p16-1126
M3 - Conference contribution
AN - SCOPUS:85011915291
T3 - 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers
SP - 1330
EP - 1340
BT - 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers
PB - Association for Computational Linguistics (ACL)
T2 - 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016
Y2 - 7 August 2016 through 12 August 2016
ER -