Research with Leaked Data

By Patrick L. Warren

In the past several years, there have been a number of high-profile public releases of private data, including Wikileak's diplomatic cables, Edward Snowden's NSA documents, emails at Sony Entertainment, the Panama Papers, and the user databases at Ashley Madison, OK Cupid, and Patreon. I am not a lawyer, so I'm not going to address the legality of releasing these data, or even the legality of using the data for research purposes (Maybe some of my colleagues would like to address it?). I am also not a philosopher, so you'll need to look elsewhere for an analysis of the morality of using these data. I'm an economist, so I want to look at the economics.

This is not just a theoretical question. Leaked data have been successfully used in a number of studies, already, and economists seems particularly open its use. At SIOE, a few years back, Ekaterina Zhuravskaya presented some work based on leaked data on Russian banking transactions (see, this, also). My colleague, Sergey Mityakov, has a number of papers with Serguey Braguinsky (1,2) using leaked data on income and car registrations, also in Russia.

Scholars in some other fields have seemed less willing to work with leaked data. The Wikileaks cables, for example, have NOT been widely used to test theories of international relations (an exception). Gabriel Michael has documented this hesitance, noting "International Studies Quarterly (ISQ), a major international relations journal, has adopted a provisional policy against handling manuscripts that make use of leaked documents if such use could be construed as mishandling classified material...The trend at conferences has shifted toward less engagement. In 2012, the first year following the complete and unredacted release of WikiLeaks’ diplomatic cables, the annual meeting of the ISA hosted 16 papers mentioning leaks, although only about half of these actually dealt with the content of the leaks. By contrast, in 2013, no papers dealing with leaks were presented, and in 2014, only two were presented, and only one of these made use of leaked information."

So what's up? Let's think about costs and benefits of using these leaked data, both private and social.

We can explain a good part of the variation in the usage in leaked data by looking at the researchers' private costs and benefits. Leaked Russian administrative data is both cheap and easily accessible. The risk of prosecution seems quite low, as the data have been floating around publically for over ten years, and (according to the authors) "Russian government officials are aware of the usage of these data by researchers and journalists and have publicly discussed policy-relevant conclusions of the analyses based on such data." Finally, there are not great formal substitutes for these data. In Europe and, increasingly, in the U.S., access to administrative data has been flourishing. But formal access to these sorts of data in the developing or transitional context is likely much more difficult. Furthermore, in contexts where the formal institutions are weak, a researcher may worry that the formal alternative may be selectively edited in biased ways.

The Wikileaks case probably has both higher costs and lower benefits. From the cost side, there is a certain ambiguity about possible penalties one might face. A State Department official was nearly fired for (among other things) linking to Wikileak cables. Courts have ruled that the leaked cables are still classified, so formal and informal penalties remain possible. Furthermore, the benefits may be relatively small. We already have large caches of declassified cables from the 1970s, so the Wikileaked cables may have close substitutes for testing overarching theories of international relations. Finally, as noted above, some IR journals have explicit policies against accepting manuscripts that make use of these cables, further undermining the benefits of such research.

So, looking at the more recent batch of hacks/leaks, I think we can make some guesses about their future uses. The Panama Papers seem a lot like the Russian data to me, low risk and high benefit, relative to alternatives, so I expect that we will see them being used a lot in research. The Snowden leaks, on the other hand, seem more like the Wikileak cables, so I don't expect much output. The more interesting cases are the more titilating private-sector leaks, OK Cupid (which was not quite a leak) and Ashley Madison. We don't have much to go on, but I bet the Ashley Madison leak will see more use than the OK Cupid database, because there are better (licit) substitutes available in the later case. A research partnership with any large dating site would provide a close substitute (And have, many times.) In contrast, there is only one major website that caters to people looking to have affairs, and secrecy is so important that a research partnership might be difficult to arrange.

The last question is whether there are wedges between the private and social cost/benefits in the use of leaked data. To be honest, I have a hard time thinking of many. There are surely costs born by people who have their privacy violated by leaks, but I don't see how an individual whose information is already leaked to the public is further hurt by researchers making use of that information, especially in a non-personalized statistical way. Similarly, there is a general concern that transparency affects ex-ante incentives to have optimal discussion/decisionmaking (See Prat (2005)), but I doubt that putting leaks to a scientific purpose has much effect above-and-beyond the effect of media reporting on the leaks. Finally, there is a chance that the use of illicit leaks could encourage hacking/leaking. This seems like the biggest risk, but it is limited in the context of publically leaked information. Any given researcher has a weak incentive to fund the provision of publicly leaked information, which he or she would have to share with the rest of the research community. This free-riding effect, combined with the penalties that would surely fall on any researcher who is found to have funded a hack, make the risk of this sort of activity seem extremely limited.

I'd love to hear others' thoughts, or a more formal analysis, of the research use of leaked data. It's an issue of growing importance, and one that would be worth understanding.