The era of “big data” has transformed our understanding of the human world, making it possible for researchers to study billions of users at once, while at the same time making it impossible for other researchers to replicate their work. The ever-larger datasets that increasingly define modern quantitative social science are controlled by an ever-shrinking number of researchers that have exclusive access to the digital riches of our modern world. Facebook’s new academic research initiative with Social Science One was supposed to fix all of this, granting researchers across the world access to the private data of Facebook’s two billion users to mine, but the unanswered question is how initiatives like Social Science One will address the replication crisis.
The growing “replication crisis” in academia refers to the inability to reproduce the findings of published papers either due to access restrictions that prevent others from being able to examine the underlying data or due to the data and methods outlined in the paper not producing the results described by the researchers.
Solving the replication crisis requires that researchers commit to making the data that underlies their studies available to other academics where legally and ethically possible. This can be as simple as posting a giant ZIP file to their faculty website or as sophisticated as providing an advanced interactive data mart capable of customized extracts and interactive online exploration and analysis.
Yet, in spite of all of the calls for researchers to make their data more open, few scholars actually release the datasets undergirding their work, preferring to keep the data confidential so they can exclusively publish with it. Others might not want to spend the effort required to document and box up the data to make it available or may not want to risk that other researchers could spot errors in their data and call their findings into question.
Even the most prominent researchers who publicly call on others to release their data do not always do so themselves. Take a recent paper by a member of Science’s Board of Reviewing Editors who had previously criticized work from the commercial sector as “not meet[ing] emerging community standards” for its failure to provide replication datasets. When asked to provide a replication dataset for one of his own studies or even answer a few basic methodology questions about it, he refused despite multiple requests to himself and his institution. It seems even the loudest and most senior voices touting open replication datasets and criticizing others for not providing replication data are not open to providing replication data for their own studies.
Social Science One was supposed to fix all of this, creating a centralized entity that would make large Facebook datasets available to academic researchers, while also ensuring the availability of those datasets for replication analyses.
However, despite touting replication as a central tenant of its mission, the organization has released almost no detail on how precisely it intents to support its replication mission.
Its inaugural dataset explicitly mentions that due to legal restrictions it cannot include data from deleted user accounts. This raises the question of what happens over time as users delete their accounts? Will Social Science One regenerate its datasets on a daily basis to remove data from deleted accounts or will Facebook treat those datasets as a special exemption from its public promises to delete all data it holds from users who delete their accounts?
Neither Social Science One nor Facebook would comment, despite repeated requests.
Given that the large coordinated bot networks used in state-sponsored disinformation campaigns are precisely the kinds of accounts that would simultaneously be of great interest to researchers and also delete their accounts in unison ahead of detection, the question of how Social Science One handles deleted account data is far from an idle question.
Facebook’s past history of creatively describing its data access partners suggests the most likely scenario is that Facebook’s promise that it deletes all data from deleted user accounts only extents to Facebook itself, not the outside entities like Social Science One that it provides data to. In this way, Facebook could keep its promise to delete user data from its own site, while still allowing Social Science One to archive and republish all of the private data from those deleted accounts, preventing Facebook’s users from having any ability at all to exclude their data from being accessed by researchers across the world.
Of course, the creation and maintenance of replication data is only half of the replication equation. The far more important question is who will actually be able to access these replication datasets to attempt to reproduce and verify research that comes out of the Social Science One initiative.
All researchers wishing to use Social Science One Facebook datasets must submit proposals to specific RFPs and undergo a full peer review process to be selected and receive a research data and funding award. The initiative has touted these awards as extremely competitive, meaning while Social Science One will increase the availability of Facebook data to researchers, the datasets will nonetheless still remain off limits for the majority of the world’s researchers.
When it comes to the replication process, however, Social Science One has remained nearly entirely silent.
Replication studies tend to be extremely costly in time and resources with little payoff to those that conduct them. Few journals will accept a paper that simply reports that it ran the same code on the same data as a previous paper and got the same results. A replication attempt that fails and is able to prove that the original paper’s findings are wrong is also likely to struggle to find a publication venue. At best it might lead to a retraction of the original paper after a long and toxic struggle, but often can lead to public attacks, retaliation and even derailed careers of the replicators who dared to stand up to a luminary of the field. Just skim the pages of Retraction Watch for a depressing education on just how bad the retraction process can get.
This means that replication studies are more often the domain of graduate students attempting to master a new technique they hope to apply in their own work, using an existing study as a tutorial to make sure they understand how to use the data or methods properly. Indeed, any number of recent high-profile replication failures, from inadvertent spreadsheet errors to outright fraud have been conducted by graduate students working on their own research.
What does this have to do with Social Science One? It means that the initiative’s replication review process must take into account the fact that many of the requests it will receive for replication access will come from graduate students and potentially even independent data researchers outside the academic community.
This creates a unique challenge in that researchers proposing original research must undergo a rigorous and highly competitive peer review process, but those requesting access to the exact same datasets to conduct replication studies must be subject to a far less rigorous screening process to ensure maximal access.
Making matters more complicated is the fact that replication studies don’t necessarily mean running the exact same code over the exact same dataset. A replication study might instead run additional statistical tests or use an entirely different technique or algorithmic workflow from the original authors to test whether it yields the same results. This means that replication proposals may themselves constitute original research that would offer what amounts to a backdoor around the peer review process for those whose work is highly similar to previous studies.
On the other hand, what happens if Social Science One adopts a replication policy that is too exclusive? If it only permits replication requests from faculty at a small set of elite schools, requires that faculty submit on behalf of their students, requires that faculty continually review and certify oversight of the replication efforts of their students or bars replication requests from non-academic researchers, that would pose grave challenges to the replication pipeline.
At present it is unclear just who precisely would even quality to submit a replication request to be granted access to the dataset used in a study supported by Social Science One. It is also unclear what checks and balances will be provided to ensure that Social Science One does not unduly deny replication requests that might undermine findings of high profile studies that could draw negative publicity to Social Science One and potentially endanger its funding or data partnerships.
Indeed, that paper by the member of Science’s Board of Reviewing Editors, who it turns out is also committee member of Social Science One, drew particular attention to these exact issues in the current replication landscape. Studies published by large companies may not often come with replication datasets, but select researchers are frequently granted access to perform replication research to help the company ensure that its methods and understandings are sound. The problem is that this means that only a very small cadre of researchers are able to access these replication datasets, rather than any researcher anywhere in the world being able to send an email and request access.
How will this play out in Social Science One? Will the initiative heavily limit who is permitted to conduct replication attempts of its supported studies or will it permit anyone anywhere in the world to attempt to replicate any study it supports? If it imposes any kinds of limits on who can conduct replication research, then how is it any different from the very previous corporate initiatives the academic world has condemned?
If the differential privacy, data audit logging and other security and privacy measures Social Science One has so heavily promoted are as good as it promises, then it should have few safety concerns about granting access to any researcher in the world to replicate one of its supported studies. If it is unwilling to grant open access to its datasets because of safety or security concerns, then that would suggest an undue level of trust in those researchers it does approve and sends a clear message that even it does not believe in the ability of its current measures to adequately safeguard user data.
Given that Social Science One has so prominently cited replication as a central tenant of its mission, one would assume the initiative has devoted substantial time and resources to developing an initial replication framework and has already written preliminary policies governing how it reviews replication requests, even if those policies are still in draft form. At the very least, one would assume the initiative would commit to an open replication framework that maximizes the number of people permitted to conduct replication attempts of the studies it supports.
Is Social Science One willing to release a preliminary draft or sketch of its replication policies or at least comment in general terms of how it views replication? Would it be willing to go so far as to publicly commit to guaranteeing that any researcher affiliated with an accredited academic institution will be permitted to conduct a replication attempt of any study supported by Social Science One?
Surprisingly, repeated requests for comment on these questions to both Social Science Research Council (SSRC), which helps steward Social Science One and its public relations agency went unanswered.
It is concerning to say the least that Social Science One is not willing to offer the most basic commitment to open replication of the research it supports. While publicly touting its focus on replication and that it will facilitate for the first time open replication of large social media studies, the organization is surprisingly unwilling to actually speak about the details of that commitment. After all, it would certainly seem that if replication were a big part of Social Science One’s focus, the organization would be only too eager to speak at length about its replication policies, rather than meet even the most basic of questions about them with utter silence.
In the end, it is unfortunate but not unexpected that Social Science One, which promised to usher in a new era in research transparency, would not answer even the most basic questions regarding its stance on replication. Even while touting open replication and calling on others to release their datasets, few academics are willing to open their own data to outside inspection, including at least one of Social Science One’s own committee members. As with the rest of the academic world from which it comes, it seems Social Science One’s focus is on what it can do with all of our data, rather than how to safeguard our safety and privacy, ensure ethical compliance and meet its promises of replication and access.