nu.validator.htmlparser.dom
Class HtmlDocumentBuilder

java.lang.Object
  extended by javax.xml.parsers.DocumentBuilder
      extended by nu.validator.htmlparser.dom.HtmlDocumentBuilder

public class HtmlDocumentBuilder
extends javax.xml.parsers.DocumentBuilder

This class implements an HTML5 parser that exposes data through the DOM interface.

By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to ALTER_INFOSET as the general XML violation policy. To make the parser support non-conforming HTML fully per the HTML 5 spec while on the other hand potentially violating the SAX2 API contract, set the general XML violation policy to ALLOW. This does not work with a standard DOM implementation. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy to FATAL.

The doctype is not represented in the tree.

The document mode is represented as user data DocumentMode object with the key nu.validator.document-mode on the document node.

The form pointer is also stored as user data with the key nu.validator.form-pointer.

Version:
$Id$
Author:
hsivonen

Constructor Summary
HtmlDocumentBuilder()
          Instantiates the document builder with the JAXP DOM implementation and the infoset-altering XML violation policy.
HtmlDocumentBuilder(org.w3c.dom.DOMImplementation implementation)
          Instantiates the document builder with a specific DOM implementation and the infoset-altering XML violation policy.
HtmlDocumentBuilder(org.w3c.dom.DOMImplementation implementation, XmlViolationPolicy xmlPolicy)
          Instantiates the document builder with a specific DOM implementation and XML violation policy.
HtmlDocumentBuilder(XmlViolationPolicy xmlPolicy)
          Instantiates the document builder with the JAXP DOM implementation and a specific XML violation policy.
 
Method Summary
 void addCharacterHandler(CharacterHandler characterHandler)
           
 XmlViolationPolicy getBogusXmlnsPolicy()
          Deprecated.  
 XmlViolationPolicy getCommentPolicy()
          Returns the commentPolicy.
 XmlViolationPolicy getContentNonXmlCharPolicy()
          Returns the contentNonXmlCharPolicy.
 XmlViolationPolicy getContentSpacePolicy()
          Returns the contentSpacePolicy.
 DoctypeExpectation getDoctypeExpectation()
          Returns the doctype expectation.
 org.xml.sax.Locator getDocumentLocator()
          Returns the Locator during parse.
 DocumentModeHandler getDocumentModeHandler()
          Returns the document mode handler.
 org.w3c.dom.DOMImplementation getDOMImplementation()
          Returns the DOM implementation
 Heuristics getHeuristics()
           
 XmlViolationPolicy getNamePolicy()
          The policy for non-NCName element and attribute names.
 XmlViolationPolicy getStreamabilityViolationPolicy()
          Returns the streamabilityViolationPolicy.
 XmlViolationPolicy getXmlnsPolicy()
          Returns the xmlnsPolicy.
 boolean isCheckingNormalization()
          Indicates whether NFC normalization of source is being checked.
 boolean isHtml4ModeCompatibleWithXhtml1Schemata()
          Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
 boolean isMappingLangToXmlLang()
          Whether lang is mapped to xml:lang.
 boolean isNamespaceAware()
          Returns true.
 boolean isReportingDoctype()
          Returns the reportingDoctype.
 boolean isScriptingEnabled()
          Whether the parser considers scripting to be enabled for noscript treatment.
 boolean isValidating()
          Returns false
 org.w3c.dom.Document newDocument()
          For API compatibility.
 org.w3c.dom.Document parse(org.xml.sax.InputSource is)
          Parses a document from a SAX InputSource.
 org.w3c.dom.DocumentFragment parseFragment(org.xml.sax.InputSource is, java.lang.String context)
          Parses a document fragment from a SAX InputSource.
 void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
          Deprecated.  
 void setCheckingNormalization(boolean enable)
          Toggles the checking of the NFC normalization of source.
 void setCommentPolicy(XmlViolationPolicy commentPolicy)
          Sets the policy for consecutive hyphens in comments.
 void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
          Sets the policy for non-XML characters except white space.
 void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
          Sets the policy for non-XML white space.
 void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
          Sets the doctype expectation.
 void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
          Sets the document mode handler.
 void setEntityResolver(org.xml.sax.EntityResolver resolver)
          Sets the entity resolver for URI-only inputs.
 void setErrorHandler(org.xml.sax.ErrorHandler errorHandler)
          Sets the error handler.
 void setHeuristics(Heuristics heuristics)
          Sets the encoding sniffing heuristics.
 void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
          Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
 void setIgnoringComments(boolean ignoreComments)
          Sets whether comment nodes appear in the tree.
 void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
          Whether lang is mapped to xml:lang.
 void setNamePolicy(XmlViolationPolicy namePolicy)
          The policy for non-NCName element and attribute names.
 void setReportingDoctype(boolean reportingDoctype)
           
 void setScriptingEnabled(boolean scriptingEnabled)
          Sets whether the parser considers scripting to be enabled for noscript treatment.
 void setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
          Sets the streamabilityViolationPolicy.
 void setTransitionHander(TransitionHandler handler)
           
 void setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
          Whether the xmlns attribute on the root element is passed to through.
 void setXmlPolicy(XmlViolationPolicy xmlPolicy)
          This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go.
 
Methods inherited from class javax.xml.parsers.DocumentBuilder
getSchema, isXIncludeAware, parse, parse, parse, parse, reset
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlDocumentBuilder

public HtmlDocumentBuilder(org.w3c.dom.DOMImplementation implementation,
                           XmlViolationPolicy xmlPolicy)
Instantiates the document builder with a specific DOM implementation and XML violation policy.

Parameters:
implementation - the DOM implementation
xmlPolicy - the policy

HtmlDocumentBuilder

public HtmlDocumentBuilder(org.w3c.dom.DOMImplementation implementation)
Instantiates the document builder with a specific DOM implementation and the infoset-altering XML violation policy.

Parameters:
implementation - the DOM implementation

HtmlDocumentBuilder

public HtmlDocumentBuilder()
Instantiates the document builder with the JAXP DOM implementation and the infoset-altering XML violation policy.


HtmlDocumentBuilder

public HtmlDocumentBuilder(XmlViolationPolicy xmlPolicy)
Instantiates the document builder with the JAXP DOM implementation and a specific XML violation policy.

Parameters:
xmlPolicy - the policy
Method Detail

getDOMImplementation

public org.w3c.dom.DOMImplementation getDOMImplementation()
Returns the DOM implementation

Specified by:
getDOMImplementation in class javax.xml.parsers.DocumentBuilder
Returns:
the DOM implementation
See Also:
DocumentBuilder.getDOMImplementation()

isNamespaceAware

public boolean isNamespaceAware()
Returns true.

Specified by:
isNamespaceAware in class javax.xml.parsers.DocumentBuilder
Returns:
true
See Also:
DocumentBuilder.isNamespaceAware()

isValidating

public boolean isValidating()
Returns false

Specified by:
isValidating in class javax.xml.parsers.DocumentBuilder
Returns:
false
See Also:
DocumentBuilder.isValidating()

newDocument

public org.w3c.dom.Document newDocument()
For API compatibility.

Specified by:
newDocument in class javax.xml.parsers.DocumentBuilder
See Also:
DocumentBuilder.newDocument()

parse

public org.w3c.dom.Document parse(org.xml.sax.InputSource is)
                           throws org.xml.sax.SAXException,
                                  java.io.IOException
Parses a document from a SAX InputSource.

Specified by:
parse in class javax.xml.parsers.DocumentBuilder
Parameters:
is - the source
Returns:
the doc
Throws:
org.xml.sax.SAXException - if stuff goes wrong
java.io.IOException - if IO goes wrong
See Also:
DocumentBuilder.parse(org.xml.sax.InputSource)

parseFragment

public org.w3c.dom.DocumentFragment parseFragment(org.xml.sax.InputSource is,
                                                  java.lang.String context)
                                           throws java.io.IOException,
                                                  org.xml.sax.SAXException
Parses a document fragment from a SAX InputSource.

Parameters:
is - the source
context - the context element name
Returns:
the doc
Throws:
org.xml.sax.SAXException - if stuff goes wrong
java.io.IOException - if IO goes wrong

setEntityResolver

public void setEntityResolver(org.xml.sax.EntityResolver resolver)
Sets the entity resolver for URI-only inputs.

Specified by:
setEntityResolver in class javax.xml.parsers.DocumentBuilder
Parameters:
resolver - the resolver
See Also:
DocumentBuilder.setEntityResolver(org.xml.sax.EntityResolver)

setErrorHandler

public void setErrorHandler(org.xml.sax.ErrorHandler errorHandler)
Sets the error handler.

Specified by:
setErrorHandler in class javax.xml.parsers.DocumentBuilder
Parameters:
errorHandler - the handler
See Also:
DocumentBuilder.setErrorHandler(org.xml.sax.ErrorHandler)

setTransitionHander

public void setTransitionHander(TransitionHandler handler)

isCheckingNormalization

public boolean isCheckingNormalization()
Indicates whether NFC normalization of source is being checked.

Returns:
true if NFC normalization of source is being checked.
See Also:
nu.validator.htmlparser.impl.Tokenizer#isCheckingNormalization()

setCheckingNormalization

public void setCheckingNormalization(boolean enable)
Toggles the checking of the NFC normalization of source.

Parameters:
enable - true to check normalization
See Also:
nu.validator.htmlparser.impl.Tokenizer#setCheckingNormalization(boolean)

setCommentPolicy

public void setCommentPolicy(XmlViolationPolicy commentPolicy)
Sets the policy for consecutive hyphens in comments.

Parameters:
commentPolicy - the policy
See Also:
Tokenizer.setCommentPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setContentNonXmlCharPolicy

public void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
Sets the policy for non-XML characters except white space.

Parameters:
contentNonXmlCharPolicy - the policy
See Also:
Tokenizer.setContentNonXmlCharPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setContentSpacePolicy

public void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
Sets the policy for non-XML white space.

Parameters:
contentSpacePolicy - the policy
See Also:
Tokenizer.setContentSpacePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

isScriptingEnabled

public boolean isScriptingEnabled()
Whether the parser considers scripting to be enabled for noscript treatment.

Returns:
true if enabled
See Also:
TreeBuilder.isScriptingEnabled()

setScriptingEnabled

public void setScriptingEnabled(boolean scriptingEnabled)
Sets whether the parser considers scripting to be enabled for noscript treatment.

Parameters:
scriptingEnabled - true to enable
See Also:
TreeBuilder.setScriptingEnabled(boolean)

getDoctypeExpectation

public DoctypeExpectation getDoctypeExpectation()
Returns the doctype expectation.

Returns:
the doctypeExpectation

setDoctypeExpectation

public void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
Sets the doctype expectation.

Parameters:
doctypeExpectation - the doctypeExpectation to set
See Also:
TreeBuilder.setDoctypeExpectation(nu.validator.htmlparser.common.DoctypeExpectation)

getDocumentModeHandler

public DocumentModeHandler getDocumentModeHandler()
Returns the document mode handler.

Returns:
the documentModeHandler

setDocumentModeHandler

public void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
Sets the document mode handler.

Parameters:
documentModeHandler - the documentModeHandler to set
See Also:
TreeBuilder.setDocumentModeHandler(nu.validator.htmlparser.common.DocumentModeHandler)

getStreamabilityViolationPolicy

public XmlViolationPolicy getStreamabilityViolationPolicy()
Returns the streamabilityViolationPolicy.

Returns:
the streamabilityViolationPolicy

setStreamabilityViolationPolicy

public void setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
Sets the streamabilityViolationPolicy.

Parameters:
streamabilityViolationPolicy - the streamabilityViolationPolicy to set

setHtml4ModeCompatibleWithXhtml1Schemata

public void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.

Parameters:
html4ModeCompatibleWithXhtml1Schemata -

getDocumentLocator

public org.xml.sax.Locator getDocumentLocator()
Returns the Locator during parse.

Returns:
the Locator

isHtml4ModeCompatibleWithXhtml1Schemata

public boolean isHtml4ModeCompatibleWithXhtml1Schemata()
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.

Returns:
the html4ModeCompatibleWithXhtml1Schemata

setMappingLangToXmlLang

public void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
Whether lang is mapped to xml:lang.

Parameters:
mappingLangToXmlLang -
See Also:
Tokenizer.setMappingLangToXmlLang(boolean)

isMappingLangToXmlLang

public boolean isMappingLangToXmlLang()
Whether lang is mapped to xml:lang.

Returns:
the mappingLangToXmlLang

setXmlnsPolicy

public void setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
Whether the xmlns attribute on the root element is passed to through. (FATAL not allowed.)

Parameters:
xmlnsPolicy -
See Also:
Tokenizer.setXmlnsPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

getXmlnsPolicy

public XmlViolationPolicy getXmlnsPolicy()
Returns the xmlnsPolicy.

Returns:
the xmlnsPolicy

getCommentPolicy

public XmlViolationPolicy getCommentPolicy()
Returns the commentPolicy.

Returns:
the commentPolicy

getContentNonXmlCharPolicy

public XmlViolationPolicy getContentNonXmlCharPolicy()
Returns the contentNonXmlCharPolicy.

Returns:
the contentNonXmlCharPolicy

getContentSpacePolicy

public XmlViolationPolicy getContentSpacePolicy()
Returns the contentSpacePolicy.

Returns:
the contentSpacePolicy

setReportingDoctype

public void setReportingDoctype(boolean reportingDoctype)
Parameters:
reportingDoctype -
See Also:
TreeBuilder.setReportingDoctype(boolean)

isReportingDoctype

public boolean isReportingDoctype()
Returns the reportingDoctype.

Returns:
the reportingDoctype

setNamePolicy

public void setNamePolicy(XmlViolationPolicy namePolicy)
The policy for non-NCName element and attribute names.

Parameters:
namePolicy -
See Also:
Tokenizer.setNamePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setHeuristics

public void setHeuristics(Heuristics heuristics)
Sets the encoding sniffing heuristics.

Parameters:
heuristics - the heuristics to set
See Also:
nu.validator.htmlparser.impl.Tokenizer#setHeuristics(nu.validator.htmlparser.common.Heuristics)

getHeuristics

public Heuristics getHeuristics()

setXmlPolicy

public void setXmlPolicy(XmlViolationPolicy xmlPolicy)
This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go. This does not affect the streamability policy or doctype reporting.

Parameters:
xmlPolicy -

getNamePolicy

public XmlViolationPolicy getNamePolicy()
The policy for non-NCName element and attribute names.

Returns:
the namePolicy

setBogusXmlnsPolicy

public void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
Deprecated. 

Does nothing.


getBogusXmlnsPolicy

public XmlViolationPolicy getBogusXmlnsPolicy()
Deprecated. 

Returns XmlViolationPolicy.ALTER_INFOSET.

Returns:
XmlViolationPolicy.ALTER_INFOSET

addCharacterHandler

public void addCharacterHandler(CharacterHandler characterHandler)

setIgnoringComments

public void setIgnoringComments(boolean ignoreComments)
Sets whether comment nodes appear in the tree.

Parameters:
ignoreComments - true to ignore comments
See Also:
TreeBuilder.setIgnoringComments(boolean)