xsmeral.semnet.crawler.util
Class CharsetDetector

java.lang.Object
  extended by xsmeral.semnet.crawler.util.CharsetDetector

public class CharsetDetector
extends Object

Provides method for detection of character set of HTML content.


Constructor Summary
CharsetDetector()
           
 
Method Summary
static String detectCharset(String url)
          Convenience method, calls detectCharset(new URL(url)).
static String detectCharset(URL url)
          Tries to find the charset of the HTML content at the specified URL.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CharsetDetector

public CharsetDetector()
Method Detail

detectCharset

public static String detectCharset(String url)
                            throws MalformedURLException
Convenience method, calls detectCharset(new URL(url)).

Throws:
MalformedURLException

detectCharset

public static String detectCharset(URL url)
Tries to find the charset of the HTML content at the specified URL.

It looks in the following places, returning the first found result

  1. The Content-Type HTTP header
  2. The HTML tag <meta http-equiv="Content-Type" content="..." />, which should contain the same directive as the corresponding HTTP header
  3. If the two previous fail, the juniversalchardet is used to guess the character set used

Parameters:
url - The URL to connect to
Returns:
The first found result or null if the charset can't be found or guessed