CS Table: Privacy, Anonymity, and Big Data in the Social Sciences

On Friday, 26 September 2014, at CS Table, we will consider some recent ethical issues with the use of "Big Data" in social sciences research, including data from xMOOCs (Massive, Open, Online, Courses). Our reading will include a short article from Atlantic Monthly on the recent Facebook Controversy and a CACM article on uses of xMOOC data.

Sara M. Watson. Data Science: What the Facebook Controversy is Really About. The Atlantic. July 1, 2014. Available online at http://www.theatlantic.com/technology/archive/2014/07/data-science-what-the-facebook-controversy-is-really-about/373770/>.

Facebook has always “manipulated” the results shown in its users’ News Feeds by filtering and personalizing for relevance. But this weekend, the social giant seemed to cross a line, when it announced that it engineered emotional responses two years ago in an “emotional contagion” experiment, published in the Proceedings of the National Academy of Sciences (PNAS).

Since then, critics have examined many facets of the experiment, including itsdesign, methodology, approval process, and ethics. Each of these tacks tacitly accepts something important, though: the validity of Facebook’s science and scholarship. There is a more fundamental question in all this: What does it mean when we call proprietary data research data science?

As a society, we haven't fully established how we ought to think about data science in practice. It's time to start hashing that out.

Jon P. Daries, Justin Reich, Jim Waldo, Elise M. Young, Jonathan Whittinghill, Andrew Dean Ho, Daniel Thomas Seaton, and Isaac Chuang. 2014. Privacy, anonymity, and big data in the social sciences. Commun. ACM 57, 9 (September 2014), 56-63. DOI=10.1145/2643132 http://doi.acm.org/10.1145/2643132.

Open data has tremendous potential for science, but, in human subjects research, there is a tension between privacy and releasing high-quality open data. Federal law governing student privacy and the release of student records suggests that anonymizing student data protects student privacy. Guided by this standard, we de-identified and released a data set from 16 MOOCs (massive open online courses) from MITx and HarvardX on the edX platform. In this article, we show that these and other de-identification procedures necessitate changes to data sets that threaten replication and extension of baseline analyses. To balance student privacy and the benefits of open data, we suggest focusing on protecting privacy without anonymizing data by instead expanding policies that compel researchers to uphold the privacy of the subjects in open data sets. If we want to have high-quality social science research and also protect the privacy of human subjects, we must eventually have trust in researchers. Otherwise, we'll always have the strict tradeoff between anonymity and science illustrated here.

Printed copies of the readings are available next to Science 3821.

