nu.validator.htmlparser.sax
Class HtmlParser

java.lang.Object
  extended by nu.validator.htmlparser.sax.HtmlParser
All Implemented Interfaces:
org.xml.sax.XMLReader
Direct Known Subclasses:
InfosetCoercingHtmlParser

public class HtmlParser
extends java.lang.Object
implements org.xml.sax.XMLReader

This class implements an HTML5 parser that exposes data through the SAX2 interface.

By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to ALTER_INFOSET as the general XML violation policy. To make the parser support non-conforming HTML fully per the HTML 5 spec while on the other hand potentially violating the SAX2 API contract, set the general XML violation policy to ALLOW. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy to FATAL.

By default, this parser doesn't do true streaming but buffers everything first. The parser can be made truly streaming by calling setStreamabilityViolationPolicy(XmlViolationPolicy.FATAL). This has the consequence that errors that require non-streamable recovery are treated as fatal.

By default, in order to make the parse events emulate the parse events for a DTDless XML document, the parser does not report the doctype through LexicalHandler. Doctype reporting through LexicalHandler can be turned on by calling setReportingDoctype(true).

Version:
$Id$
Author:
hsivonen

Constructor Summary
HtmlParser()
          Instantiates the parser with a fatal XML violation policy.
HtmlParser(XmlViolationPolicy xmlPolicy)
          Instantiates the parser with a specific XML violation policy.
 
Method Summary
 void addCharacterHandler(CharacterHandler characterHandler)
           
 XmlViolationPolicy getBogusXmlnsPolicy()
          Deprecated.  
 XmlViolationPolicy getCommentPolicy()
          Returns the commentPolicy.
 org.xml.sax.ContentHandler getContentHandler()
           
 XmlViolationPolicy getContentNonXmlCharPolicy()
          Returns the contentNonXmlCharPolicy.
 XmlViolationPolicy getContentSpacePolicy()
          Returns the contentSpacePolicy.
 DoctypeExpectation getDoctypeExpectation()
          Returns the doctype expectation.
 org.xml.sax.Locator getDocumentLocator()
          Returns the Locator during parse.
 DocumentModeHandler getDocumentModeHandler()
          Returns the document mode handler.
 org.xml.sax.DTDHandler getDTDHandler()
           
 org.xml.sax.EntityResolver getEntityResolver()
           
 org.xml.sax.ErrorHandler getErrorHandler()
           
 boolean getFeature(java.lang.String name)
          Exposes the configuration of the emulated XML parser as well as boolean-valued configuration without using non-XMLReader getters directly.
 Heuristics getHeuristics()
           
 org.xml.sax.ext.LexicalHandler getLexicalHandler()
          Returns the lexicalHandler.
 XmlViolationPolicy getNamePolicy()
          The policy for non-NCName element and attribute names.
 java.lang.Object getProperty(java.lang.String name)
          Allows XMLReader-level access to non-boolean valued getters.
 XmlViolationPolicy getStreamabilityViolationPolicy()
          Returns the streamabilityViolationPolicy.
 XmlViolationPolicy getXmlnsPolicy()
          Returns the xmlnsPolicy.
 boolean isCheckingNormalization()
          Indicates whether NFC normalization of source is being checked.
 boolean isHtml4ModeCompatibleWithXhtml1Schemata()
          Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
 boolean isMappingLangToXmlLang()
          Whether lang is mapped to xml:lang.
 boolean isReportingDoctype()
          Returns the reportingDoctype.
 boolean isScriptingEnabled()
          Whether the parser considers scripting to be enabled for noscript treatment.
 void parse(org.xml.sax.InputSource input)
           
 void parse(java.lang.String systemId)
           
 void parseFragment(org.xml.sax.InputSource input, java.lang.String context)
          Parses a fragment.
 void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
          Deprecated.  
 void setCheckingNormalization(boolean enable)
          Toggles the checking of the NFC normalization of source.
 void setCommentPolicy(XmlViolationPolicy commentPolicy)
          Sets the policy for consecutive hyphens in comments.
 void setContentHandler(org.xml.sax.ContentHandler handler)
           
 void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
          Sets the policy for non-XML characters except white space.
 void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
          Sets the policy for non-XML white space.
 void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
          Sets the doctype expectation.
 void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
          Sets the document mode handler.
 void setDTDHandler(org.xml.sax.DTDHandler handler)
           
 void setEntityResolver(org.xml.sax.EntityResolver resolver)
           
 void setErrorHandler(org.xml.sax.ErrorHandler handler)
           
 void setErrorProfile(java.util.HashMap<java.lang.String,java.lang.String> errorProfileMap)
           
 void setFeature(java.lang.String name, boolean value)
          Sets a boolean feature without having to use non-XMLReader setters directly.
 void setHeuristics(Heuristics heuristics)
          Sets the encoding sniffing heuristics.
 void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
          Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
 void setLexicalHandler(org.xml.sax.ext.LexicalHandler handler)
          Sets the lexical handler.
 void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
          Whether lang is mapped to xml:lang.
 void setNamePolicy(XmlViolationPolicy namePolicy)
          The policy for non-NCName element and attribute names.
 void setProperty(java.lang.String name, java.lang.Object value)
          Sets a non-boolean property without having to use non-XMLReader setters directly.
 void setReportingDoctype(boolean reportingDoctype)
           
 void setScriptingEnabled(boolean scriptingEnabled)
          Sets whether the parser considers scripting to be enabled for noscript treatment.
 void setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
          Sets the streamabilityViolationPolicy.
 void setTransitionHandler(TransitionHandler handler)
           
 void setTreeBuilderErrorHandlerOverride(org.xml.sax.ErrorHandler handler)
          Deprecated. For Validator.nu internal use
 void setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
          Whether the xmlns attribute on the root element is passed to through.
 void setXmlPolicy(XmlViolationPolicy xmlPolicy)
          This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlParser

public HtmlParser()
Instantiates the parser with a fatal XML violation policy.


HtmlParser

public HtmlParser(XmlViolationPolicy xmlPolicy)
Instantiates the parser with a specific XML violation policy.

Parameters:
xmlPolicy - the policy
Method Detail

getContentHandler

public org.xml.sax.ContentHandler getContentHandler()
Specified by:
getContentHandler in interface org.xml.sax.XMLReader
See Also:
XMLReader.getContentHandler()

getDTDHandler

public org.xml.sax.DTDHandler getDTDHandler()
Specified by:
getDTDHandler in interface org.xml.sax.XMLReader
See Also:
XMLReader.getDTDHandler()

getEntityResolver

public org.xml.sax.EntityResolver getEntityResolver()
Specified by:
getEntityResolver in interface org.xml.sax.XMLReader
See Also:
XMLReader.getEntityResolver()

getErrorHandler

public org.xml.sax.ErrorHandler getErrorHandler()
Specified by:
getErrorHandler in interface org.xml.sax.XMLReader
See Also:
XMLReader.getErrorHandler()

getFeature

public boolean getFeature(java.lang.String name)
                   throws org.xml.sax.SAXNotRecognizedException,
                          org.xml.sax.SAXNotSupportedException
Exposes the configuration of the emulated XML parser as well as boolean-valued configuration without using non-XMLReader getters directly.
http://xml.org/sax/features/external-general-entities
false
http://xml.org/sax/features/external-parameter-entities
false
http://xml.org/sax/features/is-standalone
true
http://xml.org/sax/features/lexical-handler/parameter-entities
false
http://xml.org/sax/features/namespaces
true
http://xml.org/sax/features/namespace-prefixes
false
http://xml.org/sax/features/resolve-dtd-uris
true
http://xml.org/sax/features/string-interning
false
http://xml.org/sax/features/unicode-normalization-checking
isCheckingNormalization
http://xml.org/sax/features/use-attributes2
false
http://xml.org/sax/features/use-locator2
false
http://xml.org/sax/features/use-entity-resolver2
false
http://xml.org/sax/features/validation
false
http://xml.org/sax/features/xmlns-uris
false
http://xml.org/sax/features/xml-1.1
false
http://validator.nu/features/html4-mode-compatible-with-xhtml1-schemata
isHtml4ModeCompatibleWithXhtml1Schemata
http://validator.nu/features/mapping-lang-to-xml-lang
isMappingLangToXmlLang
http://validator.nu/features/scripting-enabled
isScriptingEnabled

Specified by:
getFeature in interface org.xml.sax.XMLReader
Parameters:
name - feature URI string
Returns:
a value per the list above
Throws:
org.xml.sax.SAXNotRecognizedException
org.xml.sax.SAXNotSupportedException
See Also:
XMLReader.getFeature(java.lang.String)

getProperty

public java.lang.Object getProperty(java.lang.String name)
                             throws org.xml.sax.SAXNotRecognizedException,
                                    org.xml.sax.SAXNotSupportedException
Allows XMLReader-level access to non-boolean valued getters.

The properties are mapped as follows:

http://xml.org/sax/properties/document-xml-version
"1.0"
http://xml.org/sax/properties/lexical-handler
getLexicalHandler
http://validator.nu/properties/content-space-policy
getContentSpacePolicy
http://validator.nu/properties/content-non-xml-char-policy
getContentNonXmlCharPolicy
http://validator.nu/properties/comment-policy
getCommentPolicy
http://validator.nu/properties/xmlns-policy
getXmlnsPolicy
http://validator.nu/properties/name-policy
getNamePolicy
http://validator.nu/properties/streamability-violation-policy
getStreamabilityViolationPolicy
http://validator.nu/properties/document-mode-handler
getDocumentModeHandler
http://validator.nu/properties/doctype-expectation
getDoctypeExpectation
http://xml.org/sax/features/unicode-normalization-checking

Specified by:
getProperty in interface org.xml.sax.XMLReader
Parameters:
name - property URI string
Returns:
a value per the list above
Throws:
org.xml.sax.SAXNotRecognizedException
org.xml.sax.SAXNotSupportedException
See Also:
XMLReader.getProperty(java.lang.String)

parse

public void parse(org.xml.sax.InputSource input)
           throws java.io.IOException,
                  org.xml.sax.SAXException
Specified by:
parse in interface org.xml.sax.XMLReader
Throws:
java.io.IOException
org.xml.sax.SAXException
See Also:
XMLReader.parse(org.xml.sax.InputSource)

parseFragment

public void parseFragment(org.xml.sax.InputSource input,
                          java.lang.String context)
                   throws java.io.IOException,
                          org.xml.sax.SAXException
Parses a fragment.

Parameters:
input - the input to parse
context - the name of the context element
Throws:
java.io.IOException
org.xml.sax.SAXException

parse

public void parse(java.lang.String systemId)
           throws java.io.IOException,
                  org.xml.sax.SAXException
Specified by:
parse in interface org.xml.sax.XMLReader
Throws:
java.io.IOException
org.xml.sax.SAXException
See Also:
XMLReader.parse(java.lang.String)

setContentHandler

public void setContentHandler(org.xml.sax.ContentHandler handler)
Specified by:
setContentHandler in interface org.xml.sax.XMLReader
See Also:
XMLReader.setContentHandler(org.xml.sax.ContentHandler)

setLexicalHandler

public void setLexicalHandler(org.xml.sax.ext.LexicalHandler handler)
Sets the lexical handler.

Parameters:
handler - the hander.

setDTDHandler

public void setDTDHandler(org.xml.sax.DTDHandler handler)
Specified by:
setDTDHandler in interface org.xml.sax.XMLReader
See Also:
XMLReader.setDTDHandler(org.xml.sax.DTDHandler)

setEntityResolver

public void setEntityResolver(org.xml.sax.EntityResolver resolver)
Specified by:
setEntityResolver in interface org.xml.sax.XMLReader
See Also:
XMLReader.setEntityResolver(org.xml.sax.EntityResolver)

setErrorHandler

public void setErrorHandler(org.xml.sax.ErrorHandler handler)
Specified by:
setErrorHandler in interface org.xml.sax.XMLReader
See Also:
XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)

setTransitionHandler

public void setTransitionHandler(TransitionHandler handler)

setTreeBuilderErrorHandlerOverride

public void setTreeBuilderErrorHandlerOverride(org.xml.sax.ErrorHandler handler)
Deprecated. For Validator.nu internal use

See Also:
XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)

setFeature

public void setFeature(java.lang.String name,
                       boolean value)
                throws org.xml.sax.SAXNotRecognizedException,
                       org.xml.sax.SAXNotSupportedException
Sets a boolean feature without having to use non-XMLReader setters directly.

The supported features are:

http://xml.org/sax/features/unicode-normalization-checking
setCheckingNormalization
http://validator.nu/features/html4-mode-compatible-with-xhtml1-schemata
setHtml4ModeCompatibleWithXhtml1Schemata
http://validator.nu/features/mapping-lang-to-xml-lang
setMappingLangToXmlLang
http://validator.nu/features/scripting-enabled
setScriptingEnabled

Specified by:
setFeature in interface org.xml.sax.XMLReader
Throws:
org.xml.sax.SAXNotRecognizedException
org.xml.sax.SAXNotSupportedException
See Also:
XMLReader.setFeature(java.lang.String, boolean)

setProperty

public void setProperty(java.lang.String name,
                        java.lang.Object value)
                 throws org.xml.sax.SAXNotRecognizedException,
                        org.xml.sax.SAXNotSupportedException
Sets a non-boolean property without having to use non-XMLReader setters directly.
http://xml.org/sax/properties/lexical-handler
setLexicalHandler
http://validator.nu/properties/content-space-policy
setContentSpacePolicy
http://validator.nu/properties/content-non-xml-char-policy
setContentNonXmlCharPolicy
http://validator.nu/properties/comment-policy
setCommentPolicy
http://validator.nu/properties/xmlns-policy
setXmlnsPolicy
http://validator.nu/properties/name-policy
setNamePolicy
http://validator.nu/properties/streamability-violation-policy
setStreamabilityViolationPolicy
http://validator.nu/properties/document-mode-handler
setDocumentModeHandler
http://validator.nu/properties/doctype-expectation
setDoctypeExpectation
http://validator.nu/properties/xml-policy
setXmlPolicy

Specified by:
setProperty in interface org.xml.sax.XMLReader
Throws:
org.xml.sax.SAXNotRecognizedException
org.xml.sax.SAXNotSupportedException
See Also:
XMLReader.setProperty(java.lang.String, java.lang.Object)

isCheckingNormalization

public boolean isCheckingNormalization()
Indicates whether NFC normalization of source is being checked.

Returns:
true if NFC normalization of source is being checked.
See Also:
nu.validator.htmlparser.impl.Tokenizer#isCheckingNormalization()

setCheckingNormalization

public void setCheckingNormalization(boolean enable)
Toggles the checking of the NFC normalization of source.

Parameters:
enable - true to check normalization
See Also:
nu.validator.htmlparser.impl.Tokenizer#setCheckingNormalization(boolean)

setCommentPolicy

public void setCommentPolicy(XmlViolationPolicy commentPolicy)
Sets the policy for consecutive hyphens in comments.

Parameters:
commentPolicy - the policy
See Also:
Tokenizer.setCommentPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setContentNonXmlCharPolicy

public void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
Sets the policy for non-XML characters except white space.

Parameters:
contentNonXmlCharPolicy - the policy
See Also:
Tokenizer.setContentNonXmlCharPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setContentSpacePolicy

public void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
Sets the policy for non-XML white space.

Parameters:
contentSpacePolicy - the policy
See Also:
Tokenizer.setContentSpacePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

isScriptingEnabled

public boolean isScriptingEnabled()
Whether the parser considers scripting to be enabled for noscript treatment.

Returns:
true if enabled
See Also:
TreeBuilder.isScriptingEnabled()

setScriptingEnabled

public void setScriptingEnabled(boolean scriptingEnabled)
Sets whether the parser considers scripting to be enabled for noscript treatment.

Parameters:
scriptingEnabled - true to enable
See Also:
TreeBuilder.setScriptingEnabled(boolean)

getDoctypeExpectation

public DoctypeExpectation getDoctypeExpectation()
Returns the doctype expectation.

Returns:
the doctypeExpectation

setDoctypeExpectation

public void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
Sets the doctype expectation.

Parameters:
doctypeExpectation - the doctypeExpectation to set
See Also:
TreeBuilder.setDoctypeExpectation(nu.validator.htmlparser.common.DoctypeExpectation)

getDocumentModeHandler

public DocumentModeHandler getDocumentModeHandler()
Returns the document mode handler.

Returns:
the documentModeHandler

setDocumentModeHandler

public void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
Sets the document mode handler.

Parameters:
documentModeHandler - the documentModeHandler to set
See Also:
TreeBuilder.setDocumentModeHandler(nu.validator.htmlparser.common.DocumentModeHandler)

getStreamabilityViolationPolicy

public XmlViolationPolicy getStreamabilityViolationPolicy()
Returns the streamabilityViolationPolicy.

Returns:
the streamabilityViolationPolicy

setStreamabilityViolationPolicy

public void setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
Sets the streamabilityViolationPolicy.

Parameters:
streamabilityViolationPolicy - the streamabilityViolationPolicy to set

setHtml4ModeCompatibleWithXhtml1Schemata

public void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.

Parameters:
html4ModeCompatibleWithXhtml1Schemata -

getDocumentLocator

public org.xml.sax.Locator getDocumentLocator()
Returns the Locator during parse.

Returns:
the Locator

isHtml4ModeCompatibleWithXhtml1Schemata

public boolean isHtml4ModeCompatibleWithXhtml1Schemata()
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.

Returns:
the html4ModeCompatibleWithXhtml1Schemata

setMappingLangToXmlLang

public void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
Whether lang is mapped to xml:lang.

Parameters:
mappingLangToXmlLang -
See Also:
Tokenizer.setMappingLangToXmlLang(boolean)

isMappingLangToXmlLang

public boolean isMappingLangToXmlLang()
Whether lang is mapped to xml:lang.

Returns:
the mappingLangToXmlLang

setXmlnsPolicy

public void setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
Whether the xmlns attribute on the root element is passed to through. (FATAL not allowed.)

Parameters:
xmlnsPolicy -
See Also:
Tokenizer.setXmlnsPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

getXmlnsPolicy

public XmlViolationPolicy getXmlnsPolicy()
Returns the xmlnsPolicy.

Returns:
the xmlnsPolicy

getLexicalHandler

public org.xml.sax.ext.LexicalHandler getLexicalHandler()
Returns the lexicalHandler.

Returns:
the lexicalHandler

getCommentPolicy

public XmlViolationPolicy getCommentPolicy()
Returns the commentPolicy.

Returns:
the commentPolicy

getContentNonXmlCharPolicy

public XmlViolationPolicy getContentNonXmlCharPolicy()
Returns the contentNonXmlCharPolicy.

Returns:
the contentNonXmlCharPolicy

getContentSpacePolicy

public XmlViolationPolicy getContentSpacePolicy()
Returns the contentSpacePolicy.

Returns:
the contentSpacePolicy

setReportingDoctype

public void setReportingDoctype(boolean reportingDoctype)
Parameters:
reportingDoctype -
See Also:
TreeBuilder.setReportingDoctype(boolean)

isReportingDoctype

public boolean isReportingDoctype()
Returns the reportingDoctype.

Returns:
the reportingDoctype

setErrorProfile

public void setErrorProfile(java.util.HashMap<java.lang.String,java.lang.String> errorProfileMap)
Parameters:
errorProfile -
See Also:
nu.validator.htmlparser.impl.errorReportingTokenizer#setErrorProfile(set)

setNamePolicy

public void setNamePolicy(XmlViolationPolicy namePolicy)
The policy for non-NCName element and attribute names.

Parameters:
namePolicy -
See Also:
Tokenizer.setNamePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setHeuristics

public void setHeuristics(Heuristics heuristics)
Sets the encoding sniffing heuristics.

Parameters:
heuristics - the heuristics to set
See Also:
nu.validator.htmlparser.impl.Tokenizer#setHeuristics(nu.validator.htmlparser.common.Heuristics)

getHeuristics

public Heuristics getHeuristics()

setXmlPolicy

public void setXmlPolicy(XmlViolationPolicy xmlPolicy)
This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go. This does not affect the streamability policy or doctype reporting.

Parameters:
xmlPolicy -

getNamePolicy

public XmlViolationPolicy getNamePolicy()
The policy for non-NCName element and attribute names.

Returns:
the namePolicy

setBogusXmlnsPolicy

public void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
Deprecated. 

Does nothing.


getBogusXmlnsPolicy

public XmlViolationPolicy getBogusXmlnsPolicy()
Deprecated. 

Returns XmlViolationPolicy.ALTER_INFOSET.

Returns:
XmlViolationPolicy.ALTER_INFOSET

addCharacterHandler

public void addCharacterHandler(CharacterHandler characterHandler)