nu.validator.htmlparser.sax
Class HtmlParser

java.lang.Object
  extended by nu.validator.htmlparser.sax.HtmlParser
All Implemented Interfaces:
XMLReader

public class HtmlParser
extends Object
implements XMLReader

This class implements an HTML5 parser that exposes data through the SAX2 interface.

By default, when using the constructor without arguments, the this parser treats XML 1.0-incompatible infosets as fatal errors in order to adhere to the SAX2 API contract strictly. This corresponds to FATAL as the general XML violation policy. To make the parser support non-conforming HTML fully per the HTML 5 spec while on the other hand potentially violating the SAX2 API contract, set the general XML violation policy to ALLOW. Handling all input without fatal errors and without violating the SAX2 API contract is possible by setting the general XML violation policy to ALTER_INFOSET. This makes the parser non-conforming but is probably the most useful setting for most applications.

By default, this parser doesn't do true streaming but buffers everything first. The parser can be made truly streaming by calling setStreamabilityViolationPolicy(XmlViolationPolicy.FATAL). This has the consequence that errors that require non-streamable recovery are treated as fatal.

By default, in order to make the parse events emulate the parse events for a DTDless XML document, the parser does not report the doctype through LexicalHandler. Doctype reporting through LexicalHandler can be turned on by calling setReportingDoctype(true).

Version:
$Id: HtmlParser.java 161 2007-10-02 09:10:00Z hsivonen $
Author:
hsivonen

Field Summary
private  XmlViolationPolicy bogusXmlnsPolicy
           
private  List<CharacterHandler> characterHandlers
           
private  boolean checkingNormalization
           
private  XmlViolationPolicy commentPolicy
           
private  ContentHandler contentHandler
           
private  XmlViolationPolicy contentNonXmlCharPolicy
           
private  XmlViolationPolicy contentSpacePolicy
           
private  DoctypeExpectation doctypeExpectation
           
private  DocumentModeHandler documentModeHandler
           
private  DTDHandler dtdHandler
           
private  EntityResolver entityResolver
           
private  ErrorHandler errorHandler
           
private  boolean html4ModeCompatibleWithXhtml1Schemata
           
private  LexicalHandler lexicalHandler
           
private  boolean mappingLangToXmlLang
           
private  XmlViolationPolicy namePolicy
           
private  boolean reportingDoctype
           
private  SAXStreamer saxStreamer
           
private  SAXTreeBuilder saxTreeBuilder
           
private  boolean scriptingEnabled
           
private  XmlViolationPolicy streamabilityViolationPolicy
           
private  Tokenizer tokenizer
           
private  TreeBuilder<?> treeBuilder
           
private  ErrorHandler treeBuilderErrorHandler
           
private  XmlViolationPolicy xmlnsPolicy
           
 
Constructor Summary
HtmlParser()
          Instantiates the parser with a fatal XML violation policy.
HtmlParser(XmlViolationPolicy xmlPolicy)
          Instantiates the parser with a specific XML violation policy.
 
Method Summary
 void addCharacterHandler(CharacterHandler characterHandler)
           
 XmlViolationPolicy getBogusXmlnsPolicy()
          Returns the bogusXmlnsPolicy.
 XmlViolationPolicy getCommentPolicy()
          Returns the commentPolicy.
 ContentHandler getContentHandler()
           
 XmlViolationPolicy getContentNonXmlCharPolicy()
          Returns the contentNonXmlCharPolicy.
 XmlViolationPolicy getContentSpacePolicy()
          Returns the contentSpacePolicy.
 DoctypeExpectation getDoctypeExpectation()
          Returns the doctype expectation.
 Locator getDocumentLocator()
          Returns the Locator during parse.
 DocumentModeHandler getDocumentModeHandler()
          Returns the document mode handler.
 DTDHandler getDTDHandler()
           
 EntityResolver getEntityResolver()
           
 ErrorHandler getErrorHandler()
           
 boolean getFeature(String name)
          Exposes the configuration of the emulated XML parser as well as boolean-valued configuration without using non-XMLReader getters directly.
 LexicalHandler getLexicalHandler()
          Returns the lexicalHandler.
 XmlViolationPolicy getNamePolicy()
          The policy for non-NCName element and attribute names.
 Object getProperty(String name)
          Allows XMLReader-level access to non-boolean valued getters.
 XmlViolationPolicy getStreamabilityViolationPolicy()
          Returns the streamabilityViolationPolicy.
 XmlViolationPolicy getXmlnsPolicy()
          Returns the xmlnsPolicy.
 boolean isCheckingNormalization()
          Indicates whether NFC normalization of source is being checked.
 boolean isHtml4ModeCompatibleWithXhtml1Schemata()
          Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
 boolean isMappingLangToXmlLang()
          Whether lang is mapped to xml:lang.
 boolean isReportingDoctype()
          Returns the reportingDoctype.
 boolean isScriptingEnabled()
          Whether the parser considers scripting to be enabled for noscript treatment.
private  void lazyInit()
          This class wraps differnt tree builders depending on configuration.
 void parse(InputSource input)
           
 void parse(String systemId)
           
 void parseFragment(InputSource input, String context)
          Parser a fragment.
 void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
          Sets the policy for forbidden xmlns attributes.
 void setCheckingNormalization(boolean enable)
          Toggles the checking of the NFC normalization of source.
 void setCommentPolicy(XmlViolationPolicy commentPolicy)
          Sets the policy for consecutive hyphens in comments.
 void setContentHandler(ContentHandler handler)
           
 void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
          Sets the policy for non-XML characters except white space.
 void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
          Sets the policy for non-XML white space.
 void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
          Sets the doctype expectation.
 void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
          Sets the document mode handler.
 void setDTDHandler(DTDHandler handler)
           
 void setEntityResolver(EntityResolver resolver)
           
 void setErrorHandler(ErrorHandler handler)
           
 void setFeature(String name, boolean value)
          Sets a boolean feature without having to use non-XMLReader setters directly.
 void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
          Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
 void setLexicalHandler(LexicalHandler handler)
          Sets the lexical handler.
 void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
          Whether lang is mapped to xml:lang.
 void setNamePolicy(XmlViolationPolicy namePolicy)
          The policy for non-NCName element and attribute names.
 void setProperty(String name, Object value)
          Sets a non-boolean property without having to use non-XMLReader setters directly.
 void setReportingDoctype(boolean reportingDoctype)
           
 void setScriptingEnabled(boolean scriptingEnabled)
          Sets whether the parser considers scripting to be enabled for noscript treatment.
 void setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
          Sets the streamabilityViolationPolicy.
 void setTreeBuilderErrorHandlerOverride(ErrorHandler handler)
          Deprecated. For Validator.nu internal use
 void setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
          Whether the xmlns attribute on the root element is passed to through.
 void setXmlPolicy(XmlViolationPolicy xmlPolicy)
          This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go.
private  void tokenize(InputSource is)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

tokenizer

private Tokenizer tokenizer

treeBuilder

private TreeBuilder<?> treeBuilder

saxStreamer

private SAXStreamer saxStreamer

saxTreeBuilder

private SAXTreeBuilder saxTreeBuilder

contentHandler

private ContentHandler contentHandler

lexicalHandler

private LexicalHandler lexicalHandler

dtdHandler

private DTDHandler dtdHandler

entityResolver

private EntityResolver entityResolver

errorHandler

private ErrorHandler errorHandler

documentModeHandler

private DocumentModeHandler documentModeHandler

doctypeExpectation

private DoctypeExpectation doctypeExpectation

checkingNormalization

private boolean checkingNormalization

scriptingEnabled

private boolean scriptingEnabled

characterHandlers

private final List<CharacterHandler> characterHandlers

contentSpacePolicy

private XmlViolationPolicy contentSpacePolicy

contentNonXmlCharPolicy

private XmlViolationPolicy contentNonXmlCharPolicy

commentPolicy

private XmlViolationPolicy commentPolicy

namePolicy

private XmlViolationPolicy namePolicy

streamabilityViolationPolicy

private XmlViolationPolicy streamabilityViolationPolicy

html4ModeCompatibleWithXhtml1Schemata

private boolean html4ModeCompatibleWithXhtml1Schemata

mappingLangToXmlLang

private boolean mappingLangToXmlLang

xmlnsPolicy

private XmlViolationPolicy xmlnsPolicy

bogusXmlnsPolicy

private XmlViolationPolicy bogusXmlnsPolicy

reportingDoctype

private boolean reportingDoctype

treeBuilderErrorHandler

private ErrorHandler treeBuilderErrorHandler
Constructor Detail

HtmlParser

public HtmlParser()
Instantiates the parser with a fatal XML violation policy.


HtmlParser

public HtmlParser(XmlViolationPolicy xmlPolicy)
Instantiates the parser with a specific XML violation policy.

Parameters:
xmlPolicy - the policy
Method Detail

lazyInit

private void lazyInit()
This class wraps differnt tree builders depending on configuration. This method does the work of hiding this from the user of the class.


getContentHandler

public ContentHandler getContentHandler()
Specified by:
getContentHandler in interface XMLReader
See Also:
XMLReader.getContentHandler()

getDTDHandler

public DTDHandler getDTDHandler()
Specified by:
getDTDHandler in interface XMLReader
See Also:
XMLReader.getDTDHandler()

getEntityResolver

public EntityResolver getEntityResolver()
Specified by:
getEntityResolver in interface XMLReader
See Also:
XMLReader.getEntityResolver()

getErrorHandler

public ErrorHandler getErrorHandler()
Specified by:
getErrorHandler in interface XMLReader
See Also:
XMLReader.getErrorHandler()

getFeature

public boolean getFeature(String name)
                   throws SAXNotRecognizedException,
                          SAXNotSupportedException
Exposes the configuration of the emulated XML parser as well as boolean-valued configuration without using non-XMLReader getters directly.
http://xml.org/sax/features/external-general-entities
false
http://xml.org/sax/features/external-parameter-entities
false
http://xml.org/sax/features/is-standalone
true
http://xml.org/sax/features/lexical-handler/parameter-entities
false
http://xml.org/sax/features/namespaces
true
http://xml.org/sax/features/namespace-prefixes
false
http://xml.org/sax/features/resolve-dtd-uris
true
http://xml.org/sax/features/string-interning
false
http://xml.org/sax/features/unicode-normalization-checking
isCheckingNormalization
http://xml.org/sax/features/use-attributes2
false
http://xml.org/sax/features/use-locator2
false
http://xml.org/sax/features/use-entity-resolver2
false
http://xml.org/sax/features/validation
false
http://xml.org/sax/features/xmlns-uris
false
http://xml.org/sax/features/xml-1.1
false
http://validator.nu/features/html4-mode-compatible-with-xhtml1-schemata
isHtml4ModeCompatibleWithXhtml1Schemata
http://validator.nu/features/mapping-lang-to-xml-lang
isMappingLangToXmlLang
http://validator.nu/features/scripting-enabled
isScriptingEnabled

Specified by:
getFeature in interface XMLReader
Parameters:
name - feature URI string
Returns:
a value per the list above
Throws:
SAXNotRecognizedException
SAXNotSupportedException
See Also:
XMLReader.getFeature(java.lang.String)

getProperty

public Object getProperty(String name)
                   throws SAXNotRecognizedException,
                          SAXNotSupportedException
Allows XMLReader-level access to non-boolean valued getters.

The properties are mapped as follows:

http://xml.org/sax/properties/document-xml-version
"1.0"
http://xml.org/sax/properties/lexical-handler
getLexicalHandler
http://validator.nu/properties/content-space-policy
getContentSpacePolicy
http://validator.nu/properties/content-non-xml-char-policy
getContentNonXmlCharPolicy
http://validator.nu/properties/comment-policy
getCommentPolicy
http://validator.nu/properties/xmlns-policy
getXmlnsPolicy
http://validator.nu/properties/name-policy
getNamePolicy
http://validator.nu/properties/streamability-violation-policy
getStreamabilityViolationPolicy
http://validator.nu/properties/document-mode-handler
getDocumentModeHandler
http://validator.nu/properties/doctype-expectation
getDoctypeExpectation
http://xml.org/sax/features/unicode-normalization-checking

Specified by:
getProperty in interface XMLReader
Parameters:
name - property URI string
Returns:
a value per the list above
Throws:
SAXNotRecognizedException
SAXNotSupportedException
See Also:
XMLReader.getProperty(java.lang.String)

parse

public void parse(InputSource input)
           throws IOException,
                  SAXException
Specified by:
parse in interface XMLReader
Throws:
IOException
SAXException
See Also:
XMLReader.parse(org.xml.sax.InputSource)

parseFragment

public void parseFragment(InputSource input,
                          String context)
                   throws IOException,
                          SAXException
Parser a fragment.

Parameters:
input - the input to parse
context - the name of the context element
Throws:
IOException
SAXException

tokenize

private void tokenize(InputSource is)
               throws SAXException,
                      IOException,
                      MalformedURLException
Parameters:
is -
Throws:
SAXException
IOException
MalformedURLException

parse

public void parse(String systemId)
           throws IOException,
                  SAXException
Specified by:
parse in interface XMLReader
Throws:
IOException
SAXException
See Also:
XMLReader.parse(java.lang.String)

setContentHandler

public void setContentHandler(ContentHandler handler)
Specified by:
setContentHandler in interface XMLReader
See Also:
XMLReader.setContentHandler(org.xml.sax.ContentHandler)

setLexicalHandler

public void setLexicalHandler(LexicalHandler handler)
Sets the lexical handler.

Parameters:
handler - the hander.

setDTDHandler

public void setDTDHandler(DTDHandler handler)
Specified by:
setDTDHandler in interface XMLReader
See Also:
XMLReader.setDTDHandler(org.xml.sax.DTDHandler)

setEntityResolver

public void setEntityResolver(EntityResolver resolver)
Specified by:
setEntityResolver in interface XMLReader
See Also:
XMLReader.setEntityResolver(org.xml.sax.EntityResolver)

setErrorHandler

public void setErrorHandler(ErrorHandler handler)
Specified by:
setErrorHandler in interface XMLReader
See Also:
XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)

setTreeBuilderErrorHandlerOverride

public void setTreeBuilderErrorHandlerOverride(ErrorHandler handler)
Deprecated. For Validator.nu internal use

See Also:
XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)

setFeature

public void setFeature(String name,
                       boolean value)
                throws SAXNotRecognizedException,
                       SAXNotSupportedException
Sets a boolean feature without having to use non-XMLReader setters directly.

The supported features are:

http://xml.org/sax/features/unicode-normalization-checking
setCheckingNormalization
http://validator.nu/features/html4-mode-compatible-with-xhtml1-schemata
setHtml4ModeCompatibleWithXhtml1Schemata
http://validator.nu/features/mapping-lang-to-xml-lang
setMappingLangToXmlLang
http://validator.nu/features/scripting-enabled
setScriptingEnabled

Specified by:
setFeature in interface XMLReader
Throws:
SAXNotRecognizedException
SAXNotSupportedException
See Also:
XMLReader.setFeature(java.lang.String, boolean)

setProperty

public void setProperty(String name,
                        Object value)
                 throws SAXNotRecognizedException,
                        SAXNotSupportedException
Sets a non-boolean property without having to use non-XMLReader setters directly.
http://xml.org/sax/properties/lexical-handler
setLexicalHandler
http://validator.nu/properties/content-space-policy
setContentSpacePolicy
http://validator.nu/properties/content-non-xml-char-policy
setContentNonXmlCharPolicy
http://validator.nu/properties/comment-policy
setCommentPolicy
http://validator.nu/properties/xmlns-policy
setXmlnsPolicy
http://validator.nu/properties/name-policy
setNamePolicy
http://validator.nu/properties/streamability-violation-policy
setStreamabilityViolationPolicy
http://validator.nu/properties/document-mode-handler
setDocumentModeHandler
http://validator.nu/properties/doctype-expectation
setDoctypeExpectation
http://validator.nu/properties/xml-policy
setXmlPolicy

Specified by:
setProperty in interface XMLReader
Throws:
SAXNotRecognizedException
SAXNotSupportedException
See Also:
XMLReader.setProperty(java.lang.String, java.lang.Object)

isCheckingNormalization

public boolean isCheckingNormalization()
Indicates whether NFC normalization of source is being checked.

Returns:
true if NFC normalization of source is being checked.
See Also:
Tokenizer.isCheckingNormalization()

setCheckingNormalization

public void setCheckingNormalization(boolean enable)
Toggles the checking of the NFC normalization of source.

Parameters:
enable - true to check normalization
See Also:
Tokenizer.setCheckingNormalization(boolean)

setCommentPolicy

public void setCommentPolicy(XmlViolationPolicy commentPolicy)
Sets the policy for consecutive hyphens in comments.

Parameters:
commentPolicy - the policy
See Also:
Tokenizer.setCommentPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setContentNonXmlCharPolicy

public void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
Sets the policy for non-XML characters except white space.

Parameters:
contentNonXmlCharPolicy - the policy
See Also:
Tokenizer.setContentNonXmlCharPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setContentSpacePolicy

public void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
Sets the policy for non-XML white space.

Parameters:
contentSpacePolicy - the policy
See Also:
Tokenizer.setContentSpacePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

isScriptingEnabled

public boolean isScriptingEnabled()
Whether the parser considers scripting to be enabled for noscript treatment.

Returns:
true if enabled
See Also:
TreeBuilder.isScriptingEnabled()

setScriptingEnabled

public void setScriptingEnabled(boolean scriptingEnabled)
Sets whether the parser considers scripting to be enabled for noscript treatment.

Parameters:
scriptingEnabled - true to enable
See Also:
TreeBuilder.setScriptingEnabled(boolean)

getDoctypeExpectation

public DoctypeExpectation getDoctypeExpectation()
Returns the doctype expectation.

Returns:
the doctypeExpectation

setDoctypeExpectation

public void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
Sets the doctype expectation.

Parameters:
doctypeExpectation - the doctypeExpectation to set
See Also:
TreeBuilder.setDoctypeExpectation(nu.validator.htmlparser.common.DoctypeExpectation)

getDocumentModeHandler

public DocumentModeHandler getDocumentModeHandler()
Returns the document mode handler.

Returns:
the documentModeHandler

setDocumentModeHandler

public void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
Sets the document mode handler.

Parameters:
documentModeHandler - the documentModeHandler to set
See Also:
TreeBuilder.setDocumentModeHandler(nu.validator.htmlparser.common.DocumentModeHandler)

getStreamabilityViolationPolicy

public XmlViolationPolicy getStreamabilityViolationPolicy()
Returns the streamabilityViolationPolicy.

Returns:
the streamabilityViolationPolicy

setStreamabilityViolationPolicy

public void setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
Sets the streamabilityViolationPolicy.

Parameters:
streamabilityViolationPolicy - the streamabilityViolationPolicy to set

setHtml4ModeCompatibleWithXhtml1Schemata

public void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.

Parameters:
html4ModeCompatibleWithXhtml1Schemata -

getDocumentLocator

public Locator getDocumentLocator()
Returns the Locator during parse.

Returns:
the Locator

isHtml4ModeCompatibleWithXhtml1Schemata

public boolean isHtml4ModeCompatibleWithXhtml1Schemata()
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.

Returns:
the html4ModeCompatibleWithXhtml1Schemata

setMappingLangToXmlLang

public void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
Whether lang is mapped to xml:lang.

Parameters:
mappingLangToXmlLang -
See Also:
Tokenizer.setMappingLangToXmlLang(boolean)

isMappingLangToXmlLang

public boolean isMappingLangToXmlLang()
Whether lang is mapped to xml:lang.

Returns:
the mappingLangToXmlLang

setXmlnsPolicy

public void setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
Whether the xmlns attribute on the root element is passed to through. (FATAL not allowed.)

Parameters:
xmlnsPolicy -
See Also:
Tokenizer.setXmlnsPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

getXmlnsPolicy

public XmlViolationPolicy getXmlnsPolicy()
Returns the xmlnsPolicy.

Returns:
the xmlnsPolicy

getLexicalHandler

public LexicalHandler getLexicalHandler()
Returns the lexicalHandler.

Returns:
the lexicalHandler

getCommentPolicy

public XmlViolationPolicy getCommentPolicy()
Returns the commentPolicy.

Returns:
the commentPolicy

getContentNonXmlCharPolicy

public XmlViolationPolicy getContentNonXmlCharPolicy()
Returns the contentNonXmlCharPolicy.

Returns:
the contentNonXmlCharPolicy

getContentSpacePolicy

public XmlViolationPolicy getContentSpacePolicy()
Returns the contentSpacePolicy.

Returns:
the contentSpacePolicy

setReportingDoctype

public void setReportingDoctype(boolean reportingDoctype)
Parameters:
reportingDoctype -
See Also:
TreeBuilder.setReportingDoctype(boolean)

isReportingDoctype

public boolean isReportingDoctype()
Returns the reportingDoctype.

Returns:
the reportingDoctype

setNamePolicy

public void setNamePolicy(XmlViolationPolicy namePolicy)
The policy for non-NCName element and attribute names.

Parameters:
namePolicy -
See Also:
Tokenizer.setNamePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setXmlPolicy

public void setXmlPolicy(XmlViolationPolicy xmlPolicy)
This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go. This does not affect the streamability policy or doctype reporting.

Parameters:
xmlPolicy -

getNamePolicy

public XmlViolationPolicy getNamePolicy()
The policy for non-NCName element and attribute names.

Returns:
the namePolicy

setBogusXmlnsPolicy

public void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
Sets the policy for forbidden xmlns attributes.

Parameters:
bogusXmlnsPolicy - the policy
See Also:
Tokenizer.setBogusXmlnsPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

getBogusXmlnsPolicy

public XmlViolationPolicy getBogusXmlnsPolicy()
Returns the bogusXmlnsPolicy.

Returns:
the bogusXmlnsPolicy

addCharacterHandler

public void addCharacterHandler(CharacterHandler characterHandler)