Web-crawling: Legal Issues


web crawler 2

Photo Credit: http://www.theallineed.com

In the era of Big Data, the repository of information which exists online and growing everyday is essential to building intelligent tools which can map and analyse the wealth of data available. Most developers rely on special software robots, called spiders, or bots to pull information from the world wide web. This process is termed as web-crawling.

A crawler or a bot is, in a nutshell, a program that visits web sites, reads their pages and other information in order to create entries for a search engine or data index. All the principal search engines use such a program, which is also known as a “spider” or a “bot”. Crawlers are typically programmed to visit sites that have been submitted by their owners as new or updated. Entire sites or specific pages can be selectively visited and indexed. Crawlers apparently gained the name because they crawl through a site a page at a time, following the links to other pages on the site until all pages have been read. Under the Standard for Robot Exclusion (SRE), crawlers are supposed to ask each server which files should be excluded from being indexed. In such a case, it does not go through firewalls and it employs a special algorithm for waiting between successive server requests so as to not affect response time for other users.

Web-crawlers or similar tools have been gathering online content for a long time now. The Internet Archive, a non-profit digital library which archives historical versions of publicly accessible web pages – has used web crawling tools since the mid-1990s. Currently, web-crawlers are used as tools to aggregate content of varying nature available online, often without the permission or involvement of the crawled website.

The conflict with regard to web-crawlers arises in the context of the interests of website owners who want to protect and profit from their content against the interests of those who seek to gather and use that content for other purposes.  In the recent past, web-crawling techniques have come  under the cloud of legal disputes arising from a multitude of theories, ranging from copyright infringement to breach of contract like the website terms of use, trespass to chattels, and specific statutes prohibiting unauthorized access to a computer system or website.

The most prominent of these theories is copyright infringement. The courts have looked at issues like whether the copying is momentary, whether the information extracted is factual, and the effect on the market value of the copyrighted material. If we look at the limited jurisprudence available in the domain of web-crawling, the fundamental factor whether the object of the copyright protection which is to secure a fair return for an author’s creative labour is be served by prohibiting the instances of web-crawling.

Another prominent theory under which web-crawling is challenged is breach of contract terms with respect to the terms of use for websites. These terms often prohibit access or use of the website by web crawlers or similar tools. Courts have recognized causes of action for breaches of contract based on the use of web crawling or scraping tools in violation of such provisions. Such cases have often depended on the precise language of the terms of use contract and it is difficult to postulate a general position of law from the case-laws.

The legal understanding relating to web crawling is still at a nascent stage with very limited jurisprudence or legislation on the subject. However, a holistic reading of the judgments and laws does suggest a few simple guidelines like the enforceability of the contractual provisions in place, whether the whole process can fall within the definition of ‘fair dealing’ and in what manner does it thwart the content creators’ interests. Another significant issue one may want to look at is whether the purported is prohibited under the Computer Fraud and Abuse Act which prohibits unauthorised access to a protected computer or server.