xsmeral.semnet.crawler.util
Class RobotsPolicy

java.lang.Object
  extended by xsmeral.semnet.crawler.util.RobotsPolicy

public class RobotsPolicy
extends Object

Represents site crawling policy defined in Robots Exclusion Protocol for one host. Provides methods for checking URI against the policy.
This implementation allows non-standard, however, widely used extensions Allow, Crawl-delay and wildcards in URIs.
The parser is lenient, ignoring non-matching lines and unknown fields.
A more specific rule overrides a less specific rule (if a rule exists for one specific user agent, it overrides the rule for *).


Constructor Summary
RobotsPolicy(URL host, String userAgent)
          Calls load for the specified host and user agent.
 
Method Summary
 boolean allows(String relativeUri)
          Checks whether this relative URI is allowed in this host's robots policy
 boolean allowsAll()
          Checks whether this policy allows all URLs for this user-agent
 boolean disallows(String relativeUri)
          Complementary to allows
 float getCrawlDelay()
          Returns the crawl delay in seconds.
 int getCrawlDelayMillis()
          Returns the crawl delay in milliseconds
 void load(URL host)
          Tries to load robots.txt at the specified host.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

RobotsPolicy

public RobotsPolicy(URL host,
                    String userAgent)
Calls load for the specified host and user agent.

Parameters:
host - The host to get the policy for
userAgent - User agent, rules for which are searched
Method Detail

load

public final void load(URL host)
Tries to load robots.txt at the specified host.
If the file doesn't exist, is empty or otherwise malformed, the policy is considered to allow all user agents to all URLs

Parameters:
host - The host to load the policy from

allows

public boolean allows(String relativeUri)
Checks whether this relative URI is allowed in this host's robots policy

Parameters:
relativeUri - An URI relative to the host
Returns:
True, if the uri is allowed in this host's robots policy

disallows

public boolean disallows(String relativeUri)
Complementary to allows

Parameters:
relativeUri - An URI relative to the host
Returns:
True, if the uri is NOT allowed in this host's robots policy

allowsAll

public boolean allowsAll()
Checks whether this policy allows all URLs for this user-agent

Returns:
True, if all URLs are allowed for this user-agent

getCrawlDelay

public float getCrawlDelay()
Returns the crawl delay in seconds.
Crawl delay is the minimal amount of time a crawler should wait before any two consequent requests to the same host.

Returns:
The crawl delay in seconds

getCrawlDelayMillis

public int getCrawlDelayMillis()
Returns the crawl delay in milliseconds