February 13, 2020
Gary King and Nathaniel Persily
We are excited to announce that Social Science One and Facebook have completed, and are now making available to academic researchers, one of the largest social science datasets ever constructed. We processed approximately an exabyte (a quintillion bytes, or a billion gigabytes) of raw data from the platform. The dataset itself contains a total of more than 10 trillion numbers that summarize information about 38 million URLs shared worldwide more than 100 times publicly on Facebook (between 1/1/2017 and 7/31/2019). It also includes characteristics of the URLs (such as in which country they were shared and whether they were fact-checked or flagged by users as hate speech) and the aggregated data concerning the types of people who viewed, shared, liked, reacted to, shared without viewing, and otherwise interacted with these links. This dataset enables social scientists to study some of the most important questions of our time about the effects of social media on democracy and elections with information to which they have never before had access. The full codebook for the dataset is here.
When Facebook originally agreed to make data available to academics through a structure we developed (King and Persily, 2019, GaryKing.org/partnerships) and Mark Zuckerberg testified about our idea before Congress, we thought this day would take about two months of work; it has taken twenty. Since the original Request for Proposals was announced, we have been able to approve large numbers of researchers, and we continue to do so. When this project began, we thought the political and legal aspects of our job were over, and we merely needed to identify, prepare, and document data for researchers with our Facebook counterparts. In fact, most of the last twenty months has involved negotiating with Facebook over their increasingly conservative views of privacy and the law, trying to get different groups within the company on the same page, and watching Facebook build an information security and data privacy infrastructure adequate to share data with academics.
The difficult lessons we learned in the production of this dataset may be useful for other platforms, governments, and academics going forward with many types of data sharing projects. It turned out that Facebook’s legal, engineering, and data science infrastructures were not prepared for a data sharing initiative of the magnitude we jointly envisioned. It has taken dozens of employees countless hours, since then, to build all that is necessary for data sharing with independent academic researchers.
Along the way, we also facilitated access to the Crowdtangle and Ad Library APIs, which researchers are finding useful. We have also begun to develop with Facebook a survey protocol in which respondents to several prominent academic surveys will have an opportunity to give consent to enable researchers to simultaneously analyze their survey responses with information about their activities on Facebook.
The greatest barrier we have faced concerned Facebook’s interpretation of the relevant privacy restrictions contained in the General Data Protection Regulation (GDPR) from the European Union and the consent decree they operate under with the Federal Trade Commission. They sometimes take the position that those restrictions prevent researchers from analyzing individual level data, even if de-identified or aggregated. We disagree with these legal interpretations, and we think, in particular, that recent guidance from the European Data Protection Supervisor on research under GDPR, supports a more permissive interpretation with respect to academic data sharing for public good. However, we are not the ones who have had to pay a five billion dollar fine in the wake of the Cambridge Analytica scandal, and we would not be on the hook if our legal interpretation did not win the day in court or with regulators. So while we disagree with the hard line Facebook has taken on privacy and academic data sharing, we understand the legal context in which these arguments are made. At the same time, we encourage the European Commission and the FTC both to create clearly defined safe harbors specifically for research on social media data, and even to mandate that these companies share privacy-protected data with independent academics under a broad regulatory regime aimed at transparency. Only then will the legal equities shift in favor of independent analysis of the political phenomena taking place on the platforms that govern an unprecedented amount of the world’s political communication and social interaction.
To facilitate data access while complying with Facebook’s interpretation of the applicable privacy laws, we agreed to move forward with a regime of “differential privacy” as applied to the URLs dataset. Differential privacy describes a suite of technologies that introduce statistical noise and censoring into datasets (or in results from those datasets) in order to prevent reidentification of any given individual who may be represented in the data. We had thought that aggregating data at the URL level instead of providing individual level data, vetting researchers through our process, and monitoring and auditing all analyses would provide the necessary privacy protection. But after over a year of discussion, it became clear that applying differential privacy would be a legally acceptable path forward for Facebook. We disagreed with Facebook’s legal view, but the technology is now being used by the U.S. Census Bureau and other leading technology companies, and regulators seem to find it acceptable for sharing data in ways they would not otherwise allow. We think of differential privacy as a technological solution to a political problem, just as the organizational structure we proposed for this project is an innovation in constitutional design that solved a different political problem.
Recognizing that differential privacy represented a way to surmount the roadblocks to data access we had experienced, we undertook a substantial research program to build new methodologies that enable researchers to make discoveries with differentially private data, to fix the methodological challenges this technology poses for social scientists, and to reduce the risks it may pose to society at large. Differential privacy works by censoring certain values in the data and adding specially calibrated noise to statistical results or data cell values. This appropriately obscures any individual’s actions who may be in the data. However, from a statistical point of view, censoring and noise are the same as selection bias and measurement error bias — both serious statistical issues. It makes no sense to go through all this effort to provide data to researchers only to have researchers (and society at large) being misled and drawing the wrong conclusions about the effects of social media on elections and democracy. We thus set out to solve these statistical problems and have now released two papers intended to help, along with open source software (see Evans and King, 2020, GaryKing.org/dp; Evans, King, Schwenzfeier, and Thakura, 2019, GaryKing.org/dpd). Facebook is now implementing these methods in ways that will scale to the massively sized datasets we are releasing today.
This enables the release of the URLs dataset in ways that protect individual privacy along with methods and software that protect researchers and society from biased conclusions. And our organizational design continues to protect us all, because Facebook has abided by its promise to let us approve researchers for data access without any involvement from the company or any pre-publication approval of academic research findings.
The privacy protective procedures instituted mean that researchers will not be able to learn about any individual or their actions, and small groups will also be obscured in the data which may make certain valid research questions impossible. Most conclusions drawn will be more uncertain than if researchers had access to the original data, but the original data is not on offer. In most situations, analyzing the data we are releasing with the methods we are making available will be statistically equivalent to having a large sample rather than the entire dataset, a situation to which researchers are already accustomed.
We are hopeful that the dataset provided will enable the social science research community to gain unprecedented insights into behavior and communication on social media in ways that will benefit the public quite broadly. There are variables in this dataset — such as URLs exposure data — that have never been provided by an internet company for academic research. We are confident that the researchers who have already been awarded access through our process, as well as those who are now applying and gain access through our procedures at Social Science One, will publish important results, critical for understanding life, democracy, elections, and modern communication. We have recommended to Facebook that all research teams previously approved but delayed be given immediate access to this dataset. Others who have appllied will receive access too. If you have a great idea for the analysis of this dataset, please see our RFP and send us a proposal.