Executives are worried about data collection and governance in the wake of the Facebook fallout. The uproar over data harvesting and the lack of regulation in the United States is causing many organizations to review best practices for data security and privacy. But the path toward better data security controls remains unclear.
“We don’t want to say, ‘Never combine data,”’ cautioned Jay Jacobs, founding partner and security data scientist at the Cyentia Institute. “There is a great deal of power in being able to combine data sources.”
Jacobs has spent his career aggregating and analyzing data to find trends, patterns and countermeasures to aid the security community. Namely, who is attacking whom, why and how? The security data scientist co-founded the Virginia research firm with Wade Baker, a professor at Virginia Tech. Both men are highly regarded for their work on the Verizon Data Breach Investigations Report, data-driven research that analyzes anonymized breach data in an attempt to offer insights into attackers, their actions, and tactics and industry trends.
In 2014, Jacobs co-authored a book with Bob Rudis on data science and information security, titled Data-Driven Security: Analysis, Visualization and Dashboards. Here, Jacobs talks with Marcus Ranum about data-security controls, the Facebook privacy breach and who is ultimately responsible when there is “so much data flowing everyplace.”
Information security professionals have been warning that privacy protections are important for the last 20 years. Now — surprise! — everyone is suddenly discovering that maybe Facebook, Twitter and who knows what else are not the best custodians of our data. What’s going on?
Jay Jacobs: Back in the day, the discussion was more about anonymizing and controlling de-anonymization. There are a few historical instances of people combining data sets to basically de-anonymize the people in the database. It’s a huge challenge.
Coming from security, we have this idea that we can block things: When you close a port on your firewall, it’s 100% closed. But with data, we have no such controls. There are very legitimate reasons to buy data sets, and there are no data security controls that say, ‘When I give you this data, I want you to use it for this specific purpose.’ The only controls are contractual and legal — and they’re relatively effective. If you give your data to some company and say, ‘Use it for this purpose,’ and you have evidence that they don’t, you have all sorts of recourse built into the contract to go after them.
Is part of what’s going on a corporate reaction to those data security controls? There are companies that are responsible for collecting other companies’ data now: The FBI has outsourced the process of collecting data under warrant or subpoena.
Jacobs: I’m sure that when they engage in that contract and give that third party directions to go harvest data, in theory they should be following legal practices and have restrictions on what they can do with that data. ‘We want you to gather this, but you are not allowed to use it for any other purpose.’ The challenge is, when you look at something like the Facebook and Cambridge Analytica situation, Facebook may not have had enforcement around that API that they were offering at the time. They had an acceptable use [policy] or terms of service, but it wasn’t being enforced or monitored.
A lot of trust was being placed in getting users to click on terms of service. There is a fairly small connection between two pieces of data — who is friends with whom — and when that small connection is exploited to build larger data sets, you get unexpected consequences. It’s like when Strava [a fitness-tracking app] released its heat-map capability, and suddenly people could locate American bases in Syria.
Jacobs: There is a good aspect to that, too. We don’t want to say, ‘Never combine data.’ There is a great deal of power in being able to combine data sources. If we can look at breach data, for example, and understand what companies are running when they get breached, we can better understand our ecosystem. There are a lot of good things we can do, but obviously there are some bad aspects, too.
When you did the Verizon Data Breach Investigations Report, did you have some kind of process you followed to carefully analyze all your data, to make sure information was not going to leak? I assume you couldn’t accidentally leak who had which breach?
Jacobs: The data we collected [was] anonymized. Often, we had no idea who the customers were ourselves. But if you’re looking at a retail company with revenues in the billions, there are only one or two companies it can be.
If we talked about a breach in 2015 with over 100 million records, there’s no way to anonymize that. We had to be careful about what we chose, how we chose it, and how we talked about it.
I guess there are a lot of ways of mapping that sort of information back. If you have a company that says it has $722 million in revenue, that’s the same number that the public relations team uses for everything. It will turn up right away in Google searches, if it’s exact enough.
Jacobs: It’s pretty trivial to search the internet and see ‘Who had a breach of 155,000 credit cards last year?’ It’s not just at the high end.
For data like the Facebook data, there’s a natural defense in depth — someone can’t scrape a website the size of Facebook. Without the API that Facebook appears to have offered, anyone trying to collect that data would have had problems of scale, and they would have looked like a denial-of-service attack.
I was talking with the security team of a hospitality industry company that is constantly being scraped by competitors who want to validate prices. To them, it is a denial-of-service attack. And the competitors are not at all concerned with behaving within their terms of service. I wonder if Facebook may have provided the API at least in part defensively, in order to keep people from trying to spider them.
The big story, as far as I can see, is that there are businesses that exist to sell information. This is not news to information security practitioners — we have been telling people to watch out for this for a long time.
Jacobs: There have got to be data security controls, and scalability is always going to be a control. Someone who wants to scrape data out of Facebook is going to hit rate limiters, and that means they’d need a large infrastructure of sock puppet accounts in order to do all of the searches.
There are very legitimate reasons for screen scraping and writing robot clients. For example, if you’re trying to get data that some government is sharing, and they only have it available in a certain way on some website. There are good practices for when you do that: You put your email address in the user-agent string, and you try to be as conscious as possible about not hammering someone’s server. There are firewalls that have pretty good rate limiters and [distributed denial-of-service] protections that kick in if you don’t play nice.
The hospitality company I mentioned earlier, their security team set the system up to give the scrapers — once they were detected — the wrong answers. If someone has a robot that is trying to figure out your pricing so they can undercut it, you start feeding it numbers that are slightly off.
It sounds to me like it’s a hard problem you have to think about. What should organizations do about this?
Jacobs: If a company is in the data business, and Facebook is, then the onus is on them. They need to have the contracts and agreements in place and some element of making sure that things are OK with their data. That’s where this particular news story starts — there was so much data flowing everyplace that the whole thing fell down.
They are gathering personal information, and regulation eventually will begin to frame what is acceptable and what is not. The thing that was exceptional about Cambridge Analytica was that they were trying to find the vulnerabilities in the human psyche. With the amount of information that they had, they were able to find out that ‘people who like this type of article are more prone to suggestion in this direction.’ If you want to talk about regulation, the question is to what extent companies should collect that data in the first place.
At the same time, there’s a lot of good and interesting research that can come from data like that. You don’t want to stifle innovation and scientific progress, but with all of that data that Facebook has, the bad may completely outweigh the good.