|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnu.validator.htmlparser.sax.HtmlParser
public class HtmlParser
This class implements an HTML5 parser that exposes data through the SAX2 interface.
By default, when using the constructor without arguments, the
this parser treats XML 1.0-incompatible infosets as fatal errors in
order to adhere to the SAX2 API contract strictly. This corresponds to
FATAL
as the general XML violation policy. To make the parser
support non-conforming HTML fully per the HTML 5 spec while on the other
hand potentially violating the SAX2 API contract, set the general XML
violation policy to ALLOW
. Handling all input without fatal
errors and without violating the SAX2 API contract is possible by setting
the general XML violation policy to ALTER_INFOSET
. This
makes the parser non-conforming but is probably the most useful
setting for most applications.
By default, this parser doesn't do true streaming but buffers everything
first. The parser can be made truly streaming by calling
setStreamabilityViolationPolicy(XmlViolationPolicy.FATAL)
. This
has the consequence that errors that require non-streamable recovery are
treated as fatal.
By default, in order to make the parse events emulate the parse events
for a DTDless XML document, the parser does not report the doctype through
LexicalHandler
. Doctype reporting through
LexicalHandler
can be turned on by calling
setReportingDoctype(true)
.
Constructor Summary | |
---|---|
HtmlParser()
Instantiates the parser with a fatal XML violation policy. |
|
HtmlParser(XmlViolationPolicy xmlPolicy)
Instantiates the parser with a specific XML violation policy. |
Method Summary | |
---|---|
void |
addCharacterHandler(CharacterHandler characterHandler)
|
XmlViolationPolicy |
getBogusXmlnsPolicy()
Returns the bogusXmlnsPolicy. |
XmlViolationPolicy |
getCommentPolicy()
Returns the commentPolicy. |
ContentHandler |
getContentHandler()
|
XmlViolationPolicy |
getContentNonXmlCharPolicy()
Returns the contentNonXmlCharPolicy. |
XmlViolationPolicy |
getContentSpacePolicy()
Returns the contentSpacePolicy. |
DoctypeExpectation |
getDoctypeExpectation()
Returns the doctype expectation. |
Locator |
getDocumentLocator()
Returns the Locator during parse. |
DocumentModeHandler |
getDocumentModeHandler()
Returns the document mode handler. |
DTDHandler |
getDTDHandler()
|
EntityResolver |
getEntityResolver()
|
ErrorHandler |
getErrorHandler()
|
boolean |
getFeature(String name)
Exposes the configuration of the emulated XML parser as well as boolean-valued configuration without using non- XMLReader
getters directly. |
LexicalHandler |
getLexicalHandler()
Returns the lexicalHandler. |
XmlViolationPolicy |
getNamePolicy()
The policy for non-NCName element and attribute names. |
Object |
getProperty(String name)
Allows XMLReader -level access to non-boolean valued
getters. |
XmlViolationPolicy |
getStreamabilityViolationPolicy()
Returns the streamabilityViolationPolicy. |
XmlViolationPolicy |
getXmlnsPolicy()
Returns the xmlnsPolicy. |
boolean |
isCheckingNormalization()
Indicates whether NFC normalization of source is being checked. |
boolean |
isHtml4ModeCompatibleWithXhtml1Schemata()
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value. |
boolean |
isMappingLangToXmlLang()
Whether lang is mapped to xml:lang . |
boolean |
isReportingDoctype()
Returns the reportingDoctype. |
boolean |
isScriptingEnabled()
Whether the parser considers scripting to be enabled for noscript treatment. |
private void |
lazyInit()
This class wraps differnt tree builders depending on configuration. |
void |
parse(InputSource input)
|
void |
parse(String systemId)
|
void |
parseFragment(InputSource input,
String context)
Parser a fragment. |
void |
setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
Sets the policy for forbidden xmlns attributes. |
void |
setCheckingNormalization(boolean enable)
Toggles the checking of the NFC normalization of source. |
void |
setCommentPolicy(XmlViolationPolicy commentPolicy)
Sets the policy for consecutive hyphens in comments. |
void |
setContentHandler(ContentHandler handler)
|
void |
setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
Sets the policy for non-XML characters except white space. |
void |
setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
Sets the policy for non-XML white space. |
void |
setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
Sets the doctype expectation. |
void |
setDocumentModeHandler(DocumentModeHandler documentModeHandler)
Sets the document mode handler. |
void |
setDTDHandler(DTDHandler handler)
|
void |
setEntityResolver(EntityResolver resolver)
|
void |
setErrorHandler(ErrorHandler handler)
|
void |
setFeature(String name,
boolean value)
Sets a boolean feature without having to use non- XMLReader
setters directly. |
void |
setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value. |
void |
setLexicalHandler(LexicalHandler handler)
Sets the lexical handler. |
void |
setMappingLangToXmlLang(boolean mappingLangToXmlLang)
Whether lang is mapped to xml:lang . |
void |
setNamePolicy(XmlViolationPolicy namePolicy)
The policy for non-NCName element and attribute names. |
void |
setProperty(String name,
Object value)
Sets a non-boolean property without having to use non- XMLReader
setters directly. |
void |
setReportingDoctype(boolean reportingDoctype)
|
void |
setScriptingEnabled(boolean scriptingEnabled)
Sets whether the parser considers scripting to be enabled for noscript treatment. |
void |
setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
Sets the streamabilityViolationPolicy. |
void |
setTreeBuilderErrorHandlerOverride(ErrorHandler handler)
Deprecated. For Validator.nu internal use |
void |
setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
Whether the xmlns attribute on the root element is
passed to through. |
void |
setXmlPolicy(XmlViolationPolicy xmlPolicy)
This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go. |
private void |
tokenize(InputSource is)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private Tokenizer tokenizer
private TreeBuilder<?> treeBuilder
private SAXStreamer saxStreamer
private SAXTreeBuilder saxTreeBuilder
private ContentHandler contentHandler
private LexicalHandler lexicalHandler
private DTDHandler dtdHandler
private EntityResolver entityResolver
private ErrorHandler errorHandler
private DocumentModeHandler documentModeHandler
private DoctypeExpectation doctypeExpectation
private boolean checkingNormalization
private boolean scriptingEnabled
private final List<CharacterHandler> characterHandlers
private XmlViolationPolicy contentSpacePolicy
private XmlViolationPolicy contentNonXmlCharPolicy
private XmlViolationPolicy commentPolicy
private XmlViolationPolicy namePolicy
private XmlViolationPolicy streamabilityViolationPolicy
private boolean html4ModeCompatibleWithXhtml1Schemata
private boolean mappingLangToXmlLang
private XmlViolationPolicy xmlnsPolicy
private XmlViolationPolicy bogusXmlnsPolicy
private boolean reportingDoctype
private ErrorHandler treeBuilderErrorHandler
Constructor Detail |
---|
public HtmlParser()
public HtmlParser(XmlViolationPolicy xmlPolicy)
xmlPolicy
- the policyMethod Detail |
---|
private void lazyInit()
public ContentHandler getContentHandler()
getContentHandler
in interface XMLReader
XMLReader.getContentHandler()
public DTDHandler getDTDHandler()
getDTDHandler
in interface XMLReader
XMLReader.getDTDHandler()
public EntityResolver getEntityResolver()
getEntityResolver
in interface XMLReader
XMLReader.getEntityResolver()
public ErrorHandler getErrorHandler()
getErrorHandler
in interface XMLReader
XMLReader.getErrorHandler()
public boolean getFeature(String name) throws SAXNotRecognizedException, SAXNotSupportedException
XMLReader
getters directly.
http://xml.org/sax/features/external-general-entities
false
http://xml.org/sax/features/external-parameter-entities
false
http://xml.org/sax/features/is-standalone
true
http://xml.org/sax/features/lexical-handler/parameter-entities
false
http://xml.org/sax/features/namespaces
true
http://xml.org/sax/features/namespace-prefixes
false
http://xml.org/sax/features/resolve-dtd-uris
true
http://xml.org/sax/features/string-interning
false
http://xml.org/sax/features/unicode-normalization-checking
isCheckingNormalization
http://xml.org/sax/features/use-attributes2
false
http://xml.org/sax/features/use-locator2
false
http://xml.org/sax/features/use-entity-resolver2
false
http://xml.org/sax/features/validation
false
http://xml.org/sax/features/xmlns-uris
false
http://xml.org/sax/features/xml-1.1
false
http://validator.nu/features/html4-mode-compatible-with-xhtml1-schemata
isHtml4ModeCompatibleWithXhtml1Schemata
http://validator.nu/features/mapping-lang-to-xml-lang
isMappingLangToXmlLang
http://validator.nu/features/scripting-enabled
isScriptingEnabled
getFeature
in interface XMLReader
name
- feature URI string
SAXNotRecognizedException
SAXNotSupportedException
XMLReader.getFeature(java.lang.String)
public Object getProperty(String name) throws SAXNotRecognizedException, SAXNotSupportedException
XMLReader
-level access to non-boolean valued
getters.
The properties are mapped as follows:
http://xml.org/sax/properties/document-xml-version
"1.0"
http://xml.org/sax/properties/lexical-handler
getLexicalHandler
http://validator.nu/properties/content-space-policy
getContentSpacePolicy
http://validator.nu/properties/content-non-xml-char-policy
getContentNonXmlCharPolicy
http://validator.nu/properties/comment-policy
getCommentPolicy
http://validator.nu/properties/xmlns-policy
getXmlnsPolicy
http://validator.nu/properties/name-policy
getNamePolicy
http://validator.nu/properties/streamability-violation-policy
getStreamabilityViolationPolicy
http://validator.nu/properties/document-mode-handler
getDocumentModeHandler
http://validator.nu/properties/doctype-expectation
getDoctypeExpectation
http://xml.org/sax/features/unicode-normalization-checking
getProperty
in interface XMLReader
name
- property URI string
SAXNotRecognizedException
SAXNotSupportedException
XMLReader.getProperty(java.lang.String)
public void parse(InputSource input) throws IOException, SAXException
parse
in interface XMLReader
IOException
SAXException
XMLReader.parse(org.xml.sax.InputSource)
public void parseFragment(InputSource input, String context) throws IOException, SAXException
input
- the input to parsecontext
- the name of the context element
IOException
SAXException
private void tokenize(InputSource is) throws SAXException, IOException, MalformedURLException
is
-
SAXException
IOException
MalformedURLException
public void parse(String systemId) throws IOException, SAXException
parse
in interface XMLReader
IOException
SAXException
XMLReader.parse(java.lang.String)
public void setContentHandler(ContentHandler handler)
setContentHandler
in interface XMLReader
XMLReader.setContentHandler(org.xml.sax.ContentHandler)
public void setLexicalHandler(LexicalHandler handler)
handler
- the hander.public void setDTDHandler(DTDHandler handler)
setDTDHandler
in interface XMLReader
XMLReader.setDTDHandler(org.xml.sax.DTDHandler)
public void setEntityResolver(EntityResolver resolver)
setEntityResolver
in interface XMLReader
XMLReader.setEntityResolver(org.xml.sax.EntityResolver)
public void setErrorHandler(ErrorHandler handler)
setErrorHandler
in interface XMLReader
XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)
public void setTreeBuilderErrorHandlerOverride(ErrorHandler handler)
XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)
public void setFeature(String name, boolean value) throws SAXNotRecognizedException, SAXNotSupportedException
XMLReader
setters directly.
The supported features are:
http://xml.org/sax/features/unicode-normalization-checking
setCheckingNormalization
http://validator.nu/features/html4-mode-compatible-with-xhtml1-schemata
setHtml4ModeCompatibleWithXhtml1Schemata
http://validator.nu/features/mapping-lang-to-xml-lang
setMappingLangToXmlLang
http://validator.nu/features/scripting-enabled
setScriptingEnabled
setFeature
in interface XMLReader
SAXNotRecognizedException
SAXNotSupportedException
XMLReader.setFeature(java.lang.String, boolean)
public void setProperty(String name, Object value) throws SAXNotRecognizedException, SAXNotSupportedException
XMLReader
setters directly.
http://xml.org/sax/properties/lexical-handler
setLexicalHandler
http://validator.nu/properties/content-space-policy
setContentSpacePolicy
http://validator.nu/properties/content-non-xml-char-policy
setContentNonXmlCharPolicy
http://validator.nu/properties/comment-policy
setCommentPolicy
http://validator.nu/properties/xmlns-policy
setXmlnsPolicy
http://validator.nu/properties/name-policy
setNamePolicy
http://validator.nu/properties/streamability-violation-policy
setStreamabilityViolationPolicy
http://validator.nu/properties/document-mode-handler
setDocumentModeHandler
http://validator.nu/properties/doctype-expectation
setDoctypeExpectation
http://validator.nu/properties/xml-policy
setXmlPolicy
setProperty
in interface XMLReader
SAXNotRecognizedException
SAXNotSupportedException
XMLReader.setProperty(java.lang.String,
java.lang.Object)
public boolean isCheckingNormalization()
true
if NFC normalization of source is being checked.Tokenizer.isCheckingNormalization()
public void setCheckingNormalization(boolean enable)
enable
- true
to check normalizationTokenizer.setCheckingNormalization(boolean)
public void setCommentPolicy(XmlViolationPolicy commentPolicy)
commentPolicy
- the policyTokenizer.setCommentPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
public void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
contentNonXmlCharPolicy
- the policyTokenizer.setContentNonXmlCharPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
public void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
contentSpacePolicy
- the policyTokenizer.setContentSpacePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
public boolean isScriptingEnabled()
true
if enabledTreeBuilder.isScriptingEnabled()
public void setScriptingEnabled(boolean scriptingEnabled)
scriptingEnabled
- true
to enableTreeBuilder.setScriptingEnabled(boolean)
public DoctypeExpectation getDoctypeExpectation()
public void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
doctypeExpectation
- the doctypeExpectation to setTreeBuilder.setDoctypeExpectation(nu.validator.htmlparser.common.DoctypeExpectation)
public DocumentModeHandler getDocumentModeHandler()
public void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
documentModeHandler
- the documentModeHandler to setTreeBuilder.setDocumentModeHandler(nu.validator.htmlparser.common.DocumentModeHandler)
public XmlViolationPolicy getStreamabilityViolationPolicy()
public void setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
streamabilityViolationPolicy
- the streamabilityViolationPolicy to setpublic void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
html4ModeCompatibleWithXhtml1Schemata
- public Locator getDocumentLocator()
Locator
during parse.
Locator
public boolean isHtml4ModeCompatibleWithXhtml1Schemata()
public void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
lang
is mapped to xml:lang
.
mappingLangToXmlLang
- Tokenizer.setMappingLangToXmlLang(boolean)
public boolean isMappingLangToXmlLang()
lang
is mapped to xml:lang
.
public void setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
xmlns
attribute on the root element is
passed to through. (FATAL not allowed.)
xmlnsPolicy
- Tokenizer.setXmlnsPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
public XmlViolationPolicy getXmlnsPolicy()
public LexicalHandler getLexicalHandler()
public XmlViolationPolicy getCommentPolicy()
public XmlViolationPolicy getContentNonXmlCharPolicy()
public XmlViolationPolicy getContentSpacePolicy()
public void setReportingDoctype(boolean reportingDoctype)
reportingDoctype
- TreeBuilder.setReportingDoctype(boolean)
public boolean isReportingDoctype()
public void setNamePolicy(XmlViolationPolicy namePolicy)
namePolicy
- Tokenizer.setNamePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
public void setXmlPolicy(XmlViolationPolicy xmlPolicy)
xmlPolicy
- public XmlViolationPolicy getNamePolicy()
public void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
xmlns
attributes.
bogusXmlnsPolicy
- the policyTokenizer.setBogusXmlnsPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)
public XmlViolationPolicy getBogusXmlnsPolicy()
public void addCharacterHandler(CharacterHandler characterHandler)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |