xsmeral.semnet.crawler.util
Class CharsetDetector
java.lang.Object
xsmeral.semnet.crawler.util.CharsetDetector
public class CharsetDetector
- extends Object
Provides method for detection of character set of HTML content.
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CharsetDetector
public CharsetDetector()
detectCharset
public static String detectCharset(String url)
throws MalformedURLException
- Convenience method, calls
detectCharset(new URL(url))
.
- Throws:
MalformedURLException
detectCharset
public static String detectCharset(URL url)
- Tries to find the charset of the HTML content at the specified URL.
It looks in the following places, returning the first found result
- The
Content-Type
HTTP header
- The HTML tag
<meta http-equiv="Content-Type" content="..." />
,
which should contain the same directive as the corresponding HTTP header
- If the two previous fail, the juniversalchardet
is used to guess the character set used
- Parameters:
url
- The URL to connect to
- Returns:
- The first found result or null if the charset can't be found or guessed