What is web scraping and why is it on the rise for social networks?

LinkedIn became the latest victim of malicious web scraping this week (pic courtesy Stanislau Palaukou/Shutterstock)

LinkedIn became the latest victim of malicious web scraping this week, when 700m user records were lifted from the professional social networking site and posted on the dark web. It reflects a growing trend for scraping information from social networks and ecommerce sites to use or sell, which has been enabled by the rise of advanced artificial intelligence. But web scraping has legitimate uses too, meaning it is difficult for regulators to combat.

What is web scraping? — LinkedIn became the latest victim of malicious web scraping this week. (Photo by Stanislau Palaukou/Shutterstock)

The incident is the second time LinkedIn has been targeted this year – in April, scraped data from 500m LinkedIn users was offered for sale online. Elsewhere, Facebook deemed its recent data breach, which resulted in 530 million user records appearing on the dark web, to be a result of web scraping rather than a leak, and earlier this year Alibaba had more than a billion records scraped by a rival e-commerce company from its Chinese shopping portal, Taobao.

LinkedIn maintains its innocence, and said that following analysis it was confident the incident was “was not a LinkedIn data breach”, instead concluding that “the dataset includes information scraped from LinkedIn and other sources.” It could be in line for further problems, as it was reported on Wednesday that another 88,000 user records, comprising information about business holders taken from their profiles, had been posted on the dark web.

What is web scraping and why is it a problem?

Web scraping, or web crawling, is a process of lifting data that may be of some value. Its legitimate uses include for price comparison websites, which utilise scraping techniques to glean price alternatives for the same deal. “Data is really valuable, but for the same reason that it’s valuable in the legitimate economy, it’s also valuable for criminals,” says David Emm, lead researcher at security company Kaspersky.

Indeed, the technique has caught the attention of hackers, with social networking sites and e-commerce platforms proving particularly attractive targets, as much of their user information is in the public domain. The process of scraping can include anything from manual measures such as taking a picture of the screen on a phone to using an army of bots to find and duplicate millions of records.

The latter technique is more problematic for businesses. “Somebody taking a screenshot of a website is hard to scale,” says Han Veldwijk, CEO of security company Skopos.ai. “The moment you can apply that to millions of users it’s a different problem.” Though the data available for scraping may not be valuable in isolation, it can cause potentially big problems for businesses, as threat groups can use it to mount phishing or ransomware cyberattacks based on employee information they’ve received.

Why isn’t it being legislated against?

Eradicating web scraping altogether would eliminate its legitimate use, explains Emm. “It’s hard to see how you could abolish it without seriously harming certain parts of the economy,” he says. This, coupled with the fact that much of the data being lifted is already in the public domain makes it very difficult to regulate properly. “It’s difficult to say ‘we could put a stop to [web scraping] completely’ any more than you could say ‘let’s abandon the sale of sharp knives because we know they’re used for criminal behaviour’,” Emm adds

This is not to say that regulating it is impossible, but this responsibility lies with the web platforms, which do not appear to be under any pressure to instigate such regulation. “It’s not in [the platforms’] interest to regulate themselves,” says Bart Willemsen, privacy and AI analyst at Gartner. “The moment you process data, whether it’s on the dark web or in the open, if it concerns a person then that person is probably going to be impacted.”

In its statement, LinkedIn said: “Scraping data from LinkedIn is a violation of our terms of service and we are constantly working to ensure our members’ privacy is protected.”

Why is this risk increasing?

Emm says there has “definitely been a spike” in malicious scraping, and credits this to the fact that tools used to implement this sort of activity are becoming widely available and dropping in price. “Machine learning capabilities are improving all the time, and this is giving people the ability to grab data and organise it in a constructive way,” he says.

The only way to avoid such issues is to keep as little data as possible in the public domain, says Emms. But without wider changes to online behaviours, the problem is likely to persist. “We all get used to having free email and free web browsers and don’t think about the fact that actually, the price that we’re paying is the data we’re handling,” he says. “We would need to be comfortable with the fact that maybe we need to give up some of the advantages that are involved in [free] networking with people online.”