Monday, 27 April 2009

Excluding Webwise Using Robots.txt


How are robots.txt files handled by Webwise?

The Webwise system observes the rules that a website sets for the Googlebot, Slurp (Yahoo! agent) and "*" (any robot) user agents. Where a website’s robots.txt file disallows any of these user agents, Webwise will not profile the relevant URL. As an example, the following robots.txt text will prevent profiling of all pages on a site:

user-agent: * disallow: /

The following example will restrict profiling of a directory named "images":

user-agent: Slurp disallow: /images

The system will request the robots.txt file from the root of the host e.g. www.domain.com/robots.txt. When requesting the robots.txt file, the system will follow up to 5 redirects. If no robots.txt file or an HTTP error is returned, if the returned file is not in single-byte ASCII (ISO-8859-x) format, or if the file size is greater than 50Kbytes, then the URL will be marked as allowed for profiling.

Website owners should note the following aspects of the Webwise system’s interpretation of robots.txt files:
  • Malformed robots.txt files will result in the URL being disallowed for profiling.
  • Any of the well-established line-termination tokens are interpreted as a newline, i.e. DOS, UNIX, old-style MacOS linefeeds. Multiple linefeeds are ignored.
  • Web-encoded URLs are decoded and handled as normal.
  • Variable capitalisation within the robots.txt file is converted to lower case and processed.
  • The system does not support Google extensions to the robots.txt standard.

0 comments:

Post a Comment

Followers

 

phormfree. Copyright 2008 All Rights Reserved Revolution Two Church theme by Brian Gardner Converted into Blogger Template by Bloganol dot com