Web Scraping: Pitfalls and Proactive Best Practices

If your business relies on insights from reviews, comments or other aggregated data, chances are that you are tempted to engage in a little (or a lot of) web scraping. Web scraping is essentially conducting automatic data extraction from one website, for use by a different party. It’s sometimes referred to as “web harvesting” or “web data extraction”. The scraped information offers a snapshot of who and what users like, dislike, where there is a dense user population and where users are scarce, among many other types of data points.

 

Many websites that act as platforms for users to express their preferences, especially those which require a user to create an account, and operate through that account (Yelp, TripAdvisor, Facebook, and LinkedIn, to name a few) do not permit scraping of their websites. They’ll include clauses in their terms of use or terms of service that forbid robots or other automated bots from extracting data from their site’s pages.

 

Scraping calls to the fore several legal and ethical issues that you may be confronted with if you are hoping to leverage insights gained through scraped aggregated data. The most common areas of risk and exposure are:

 

  • COPYRIGHT: aside from the “fair use” standards as it relates to the use of scraped data, there are issues of more traditional notions of copyright that exist in photos, videos and other visual data.

 

  • TRADEMARK: logos, brand names, company names, taglines, certain sounds – these can all be subject to trademark protection, so you should take care to display those properly, without alteration or infringement.

 

  • COMPUTER FRAUD AND ABUSE ACT: Private business or personal information that you may share could open you up to claims under the Computer Fraud and Abuse Act. Invest in protocols that avoid and mitigate damage of data breaches as much as possible. Compliance with international privacy regulations through your privacy policy and terms of service, and certifying authorities (e.g. Trust-e) are effective tools to bolster your privacy practices, and help to shield you from liability arising from having collected scraped data.

 

web-scraping

Try to avoid scraping if you can – the law is treacherous and for every issue, courts have gone on either side of the coin, and in some instances, split the difference down the middle too!! There have been several high profile cases against scrapers initiated by the likes of eBay and Facebook, and although the law is still very much grey, web scraping could open you up to potential civil and criminal claims. The outcome of several key cases and current terms of service suggest the following “best practices”:

 

 

  • Web scraping of publicly available personal information is still open to interpretation and should be avoided if possible.

 

  • A content creator’s content/data/information is often copyright protected. While facts cannot be protected by copyright, and this is a well-established tenet of copyright law, scraping opens the scraper to copyright infringement because facts arranged in a particular manner are copyrightable, and protected information. Scrapers may want to manipulate any scraped information as much as possible, to avoid copyright infringement claims.

 

  • In copyright law, “fair use” permits third parties to use copyright protected works in limited ways. The more analytics performed on the scraped data, the more the scraped material tends towards fair use. The more “excerpting” performed on scraped data (i.e. not really changing anything, but just using snippets of data as is) the more likely you are to open yourself up to copyright infringement. Creating your own classifications and fields of information will generally mean playing around with scraped data, and organizing it in a different way than what you received from a robot will push your use of information towards the “transformative“, allowing you to create a stronger claim of “fair use” of information under copyright law.

 

  • Scraping data, even when allowed, should not burden the originating site’s functionality.

 

  • Use information provided freely through the API of each platform you’re trying to get data from. Using information from the API does not fall under “scraping” prohibitions outlined in the platforms’ terms, with various limitations and restrictions. Instead, you are free to use the information provided through the APIs of each platform in a manner consistent with any terms attached to the API itself.