Web Scraping - Data Collection or Illegal Activity?

Web Scraping Defined

We've all heard the term "web scraping" but what is this thing and why should we really care about it? Web scraping refers to an application that is programmed to simulate human web surfing by accessing websites on behalf of its "user" and collecting large amounts of data that would typically be difficult for the end user to access. Web scrapers process the unstructured or semi-structured data pages of targeted websites and convert the data into a structured format. Once the data is in a structured format, the user can extract or manipulate the data with ease. Web scraping is very similar to web indexing (used by most search engines), but the end motivation is typically much different. Whereas web indexing is used to help make search engines more efficient, web scraping is typically used for different reasons like change detection, market research, data monitoring, and in some cases, theft.

Why Web Scrape?

There are lots of reasons people (or companies) want to scrape websites, and there are tons of web scraping applications available today. A quick Internet search will yield numerous web scraping tools written in just about any programming language you prefer. In today's information-hungry environment, individuals and companies alike are willing to go to great lengths to gather information about all sorts of topics. Imagine a company that would really like to gather some market research on one of their leading competitors...might they be tempted to invoke a web scraper that gathers all the information for them? Or, what if someone wanted to find a vulnerable site that allowed otherwise not-so-free downloads? Or, maybe a less than honest person might want to find a list of account numbers on a site that failed to properly secure them. The list goes on and on.

I should mention that web scraping is not always a bad thing. Some websites allow web scraping, but many do not. It's important to know what a website allows and prohibits before you scrape it.

The Problem With Web Scraping

Web scraping rides a fine line between collecting information and stealing information. Most websites have a copyright disclosure statement that legally protects their website information. It's up to the reader/user/scraper to read these disclosure statements and follow along legally and ethically. In fact, the F5.com website presents the following copyright disclosure: "All content included on this site, such as text, graphics, logos, button icons, images, audio clips, and software, including the compilation thereof (meaning the collection, arrangement, and assembly), is the property of F5 Networks, Inc., or its content and software suppliers, except as may be stated otherwise, and is protected by U.S. and international copyright laws." It goes on to say, "We reserve the right to make changes to our site and these disclaimers, terms, and conditions at any time."

So, scraper beware! There have been many court cases where web scraping turned into felony offenses. One case involved an online activist who scraped the MIT website and ultimately downloaded millions of academic articles. This guy is now free on bond, but faces dozens of years in prison and $1 million if convicted. Another case involves a real estate company who illegally scraped listings and photos from a competitor in an attempt to gain a lead in the market. Then, there's the case of a regional software company that was convicted of illegally scraping a major database company's websites in order to gain a competitive edge. The software company had to pay a $20 million fine and the guilty scraper is serving three years probation. Finally, there's the case of a medical website that hosted sensitive patient information. In this case, several patients had posted personal drug listings and other private information on closed forums located on the medical website. The website was scraped by a media-research firm, and all this information was suddenly public.

While many illegal web scrapers have been caught by the authorities, many more have never been caught and still run loose on websites around the world. As you can see, it's increasingly important to guard against this activity. After all, the information on your website belongs to you, and you don't want anyone else taking it without your permission.

The Good News

As we've noted, web scraping is a real problem for many companies today. The good news is that F5 has web scraping protection built into the Application Security Manager (ASM) of its BIG-IP product family. As you can see in the screenshot below, the ASM provides web scraping protection against bots, session opening anomalies, session transaction anomalies, and IP address whitelisting.

The bot detection works with clients that accept cookies and process JavaScript. It counts the client's page consumption speed and declares a client as a bot if a certain number of page changes happen within a given time interval. The session opening anomaly spots web scrapers that do not accept cookies or process JavaScript. It counts the number of sessions opened during a given time interval and declares the client as a scraper if the maximum threshold is exceeded. The session transaction anomaly detects valid sessions that visit the site much more than other clients. This defense is looking at a bigger picture and it blocks sessions that exceed a calculated baseline number that is derived from a current session table. The IP address whitelist allows known friendly bots and crawlers (i.e. Google, Bing, Yahoo, Ask, etc), and this list can be populated as needed to fit the needs of your organization.

I won't go into all the details here because I'll have some future articles that dive into the details of how the ASM protects against these types of web scraping capabilities. But, suffice it to say, ASM does a great job of protecting your website against the problem of web scraping.

I'm sure as you studied the screenshot above you also noticed lots of other protection capabilities the ASM provides...brute force attack prevention, customized attack signatures, Denial of Service protection, etc. You might be wondering how it does all that stuff as well. Give us a little feedback on the topics you would like to see, and we'll start posting some targeted tech tips for you!

Thanks for reading this introductory web scraping article...and, be sure to come back for the deeper look into how the ASM is configured to handle this problem. For more information, check out this video from Peter Silva where he discusses ASM botnet and web scraping defense.

Published May 16, 2013

Version 1.0