The Webwise system observes the rules that a website sets for the Googlebot, Slurp (Yahoo! agent) and "*" (any robot) user agents. Where a website’s robots.txt file disallows any of these user agents, Webwise will not profile the relevant URL. As an example, the following robots.txt text will prevent profiling of all pages on a site:
user-agent: * disallow: /
The following example will restrict profiling of a directory named "images":
user-agent: Slurp disallow: /images
The system will request the robots.txt file from the root of the host e.g. www.domain.com/robots.txt. When requesting the robots.txt file, the system will follow up to 5 redirects. If no robots.txt file or an HTTP error is returned, if the returned file is not in single-byte ASCII (ISO-8859-x) format, or if the file size is greater than 50Kbytes, then the URL will be marked as allowed for profiling.
Website owners should note the following aspects of the Webwise system’s interpretation of robots.txt files:
- Malformed robots.txt files will result in the URL being disallowed for profiling.
- Any of the well-established line-termination tokens are interpreted as a newline, i.e. DOS, UNIX, old-style MacOS linefeeds. Multiple linefeeds are ignored.
- Web-encoded URLs are decoded and handled as normal.
- Variable capitalisation within the robots.txt file is converted to lower case and processed.
- The system does not support Google extensions to the robots.txt standard.


0 comments:
Post a Comment