Photo Credit: http://www.theallineed.com
In the era of Big Data, the repository of information which exists online and growing everyday is essential to building intelligent tools which can map and analyse the wealth of data available. Most developers rely on special software robots, called spiders, or bots to pull information from the world wide web. This process is termed as web-crawling.
A crawler or a bot is, in a nutshell, a program that visits web sites, reads their pages and other information in order to create entries for a search engine or data index. All the principal search engines use such a program, which is also known as a “spider” or a “bot”. Crawlers are typically programmed to visit sites that have been submitted by their owners as new or updated. Entire sites or specific pages can be selectively visited and indexed. Crawlers apparently gained the name because they crawl through a site a page at a time, following the links to other pages on the site until all pages have been read. Under the Standard for Robot Exclusion (SRE), crawlers are supposed to ask each server which files should be excluded from being indexed. In such a case, it does not go through firewalls and it employs a special algorithm for waiting between successive server requests so as to not affect response time for other users.
Web-crawlers or similar tools have been gathering online content for a long time now. The Internet Archive, a non-profit digital library which archives historical versions of publicly accessible web pages – has used web crawling tools since the mid-1990s. Currently, web-crawlers are used as tools to aggregate content of varying nature available online, often without the permission or involvement of the crawled website.
The most prominent of these theories is copyright infringement. The courts have looked at issues like whether the copying is momentary, whether the information extracted is factual, and the effect on the market value of the copyrighted material. If we look at the limited jurisprudence available in the domain of web-crawling, the fundamental factor whether the object of the copyright protection which is to secure a fair return for an author’s creative labour is be served by prohibiting the instances of web-crawling.
The legal understanding relating to web crawling is still at a nascent stage with very limited jurisprudence or legislation on the subject. However, a holistic reading of the judgments and laws does suggest a few simple guidelines like the enforceability of the contractual provisions in place, whether the whole process can fall within the definition of ‘fair dealing’ and in what manner does it thwart the content creators’ interests. Another significant issue one may want to look at is whether the purported is prohibited under the Computer Fraud and Abuse Act which prohibits unauthorised access to a protected computer or server.