public class EncodingDetector
extends java.lang.Object
Broadly this encompasses two functions, which are distinctly separate:
A caller will often have some extra information about what the encoding might be (e.g. from the HTTP header or HTML meta-tags, often wrong but still potentially useful clues). The types of clues may differ from caller to caller. Thus a typical calling sequence is:
Modifier and Type | Field and Description |
---|---|
static org.apache.avro.util.Utf8 |
CONTENT_TYPE_UTF8 |
static java.lang.String |
MIN_CONFIDENCE_KEY |
static int |
NO_THRESHOLD |
Constructor and Description |
---|
EncodingDetector(Configuration conf) |
Modifier and Type | Method and Description |
---|---|
void |
addClue(java.lang.String value,
java.lang.String source) |
void |
addClue(java.lang.String value,
java.lang.String source,
int confidence) |
void |
autoDetectClues(WebPage page,
boolean filter) |
void |
clearClues()
Clears all clues.
|
java.lang.String |
guessEncoding(WebPage page,
java.lang.String defaultValue)
Guess the encoding with the previously specified list of clues.
|
static java.lang.String |
parseCharacterEncoding(java.lang.CharSequence contentTypeUtf8)
Parse the character encoding from the specified content type header.
|
static java.lang.String |
resolveEncodingAlias(java.lang.String encoding) |
public static final org.apache.avro.util.Utf8 CONTENT_TYPE_UTF8
public static final int NO_THRESHOLD
public static final java.lang.String MIN_CONFIDENCE_KEY
public EncodingDetector(Configuration conf)
public void autoDetectClues(WebPage page, boolean filter)
public void addClue(java.lang.String value, java.lang.String source, int confidence)
public void addClue(java.lang.String value, java.lang.String source)
public java.lang.String guessEncoding(WebPage page, java.lang.String defaultValue)
defaultValue
- Default encoding to return if no encoding can be detected with
enough confidence. Note that this will not be normalized
with resolveEncodingAlias(java.lang.String)
public void clearClues()
public static java.lang.String resolveEncodingAlias(java.lang.String encoding)
public static java.lang.String parseCharacterEncoding(java.lang.CharSequence contentTypeUtf8)
null
is returned. This method was copied from org.apache.catalina.util.RequestUtil, which is licensed under the Apache License, Version 2.0 (the "License").
contentTypeUtf8
- Copyright © 2019 The Apache Software Foundation