nu.validator.htmlparser.dom
Class HtmlDocumentBuilder

java.lang.Object
  extended by javax.xml.parsers.DocumentBuilder
      extended by nu.validator.htmlparser.dom.HtmlDocumentBuilder

public class HtmlDocumentBuilder
extends DocumentBuilder

This class implements an HTML5 parser that exposes data through the DOM interface.

By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to ALTER_INFOSET as the general XML violation policy. To make the parser support non-conforming HTML fully per the HTML 5 spec while on the other hand potentially violating the SAX2 API contract, set the general XML violation policy to ALLOW. This does not work with a standard DOM implementation. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy to FATAL.

The doctype is not represented in the tree.

The document mode is represented as user data DocumentMode object with the key nu.validator.document-mode on the document node.

The form pointer is also stored as user data with the key nu.validator.form-pointer.

Version:
$Id: HtmlDocumentBuilder.java 463 2008-10-03 11:46:38Z hsivonen $
Author:
hsivonen

Field Summary
private  DOMTreeBuilder domTreeBuilder
          The tree builder.
private  EntityResolver entityResolver
          The entity resolver.
private  DOMImplementation implementation
          The DOM impl.
private  Driver tokenizer
          The tokenizer.
 
Constructor Summary
HtmlDocumentBuilder()
          Instantiates the document builder with the JAXP DOM implementation and the infoset-altering XML violation policy.
HtmlDocumentBuilder(DOMImplementation implementation)
          Instantiates the document builder with a specific DOM implementation and the infoset-altering XML violation policy.
HtmlDocumentBuilder(DOMImplementation implementation, XmlViolationPolicy xmlPolicy)
          Instantiates the document builder with a specific DOM implementation and XML violation policy.
HtmlDocumentBuilder(XmlViolationPolicy xmlPolicy)
          Instantiates the document builder with the JAXP DOM implementation and a specific XML violation policy.
 
Method Summary
 DOMImplementation getDOMImplementation()
          Returns the DOM implementation
 boolean isNamespaceAware()
          Returns true.
 boolean isValidating()
          Returns false
private static DOMImplementation jaxpDOMImplementation()
          Returns the JAXP DOM implementation.
 Document newDocument()
          For API compatibility.
 Document parse(InputSource is)
          Parses a document from a SAX InputSource.
 DocumentFragment parseFragment(InputSource is, String context)
          Parses a document fragment from a SAX InputSource.
 void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
          Deprecated.  
 void setCheckingNormalization(boolean enable)
          Toggles the checking of the NFC normalization of source.
 void setCommentPolicy(XmlViolationPolicy commentPolicy)
          Sets the policy for consecutive hyphens in comments.
 void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
          Sets the policy for non-XML characters except white space.
 void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
          Sets the policy for non-XML white space.
 void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
          Sets the doctype expectation.
 void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
          Sets the document mode handler.
 void setEntityResolver(EntityResolver resolver)
          Sets the entity resolver for URI-only inputs.
 void setErrorHandler(ErrorHandler errorHandler)
          Sets the error handler.
 void setHeuristics(Heuristics heuristics)
          Sets the encoding sniffing heuristics.
 void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
          Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
 void setIgnoringComments(boolean ignoreComments)
          Sets whether comment nodes appear in the tree.
 void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
          Whether to map the HTML lang attribute to xml:lang.
 void setNamePolicy(XmlViolationPolicy namePolicy)
          Sets the policy for dealing with names that aren't XML 1.0 4th ed.
 void setScriptingEnabled(boolean scriptingEnabled)
          Sets whether the parser considers scripting to be enabled for noscript treatment.
 void setXmlPolicy(XmlViolationPolicy xmlPolicy)
          This is a catch-all convenience method for setting name, content space, content non-XML char and comment policies in one go.
private  void tokenize(InputSource is)
          Tokenizes the input source.
 
Methods inherited from class javax.xml.parsers.DocumentBuilder
getSchema, isXIncludeAware, parse, parse, parse, parse, reset
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

tokenizer

private final Driver tokenizer
The tokenizer.


domTreeBuilder

private final DOMTreeBuilder domTreeBuilder
The tree builder.


implementation

private final DOMImplementation implementation
The DOM impl.


entityResolver

private EntityResolver entityResolver
The entity resolver.

Constructor Detail

HtmlDocumentBuilder

public HtmlDocumentBuilder(DOMImplementation implementation,
                           XmlViolationPolicy xmlPolicy)
Instantiates the document builder with a specific DOM implementation and XML violation policy.

Parameters:
implementation - the DOM implementation
xmlPolicy - the policy

HtmlDocumentBuilder

public HtmlDocumentBuilder(DOMImplementation implementation)
Instantiates the document builder with a specific DOM implementation and the infoset-altering XML violation policy.

Parameters:
implementation - the DOM implementation

HtmlDocumentBuilder

public HtmlDocumentBuilder()
Instantiates the document builder with the JAXP DOM implementation and the infoset-altering XML violation policy.


HtmlDocumentBuilder

public HtmlDocumentBuilder(XmlViolationPolicy xmlPolicy)
Instantiates the document builder with the JAXP DOM implementation and a specific XML violation policy.

Parameters:
xmlPolicy - the policy
Method Detail

jaxpDOMImplementation

private static DOMImplementation jaxpDOMImplementation()
Returns the JAXP DOM implementation.

Returns:
the JAXP DOM implementation

getDOMImplementation

public DOMImplementation getDOMImplementation()
Returns the DOM implementation

Specified by:
getDOMImplementation in class DocumentBuilder
Returns:
the DOM implementation
See Also:
DocumentBuilder.getDOMImplementation()

isNamespaceAware

public boolean isNamespaceAware()
Returns true.

Specified by:
isNamespaceAware in class DocumentBuilder
Returns:
true
See Also:
DocumentBuilder.isNamespaceAware()

isValidating

public boolean isValidating()
Returns false

Specified by:
isValidating in class DocumentBuilder
Returns:
false
See Also:
DocumentBuilder.isValidating()

newDocument

public Document newDocument()
For API compatibility.

Specified by:
newDocument in class DocumentBuilder
See Also:
DocumentBuilder.newDocument()

parse

public Document parse(InputSource is)
               throws SAXException,
                      IOException
Parses a document from a SAX InputSource.

Specified by:
parse in class DocumentBuilder
Parameters:
is - the source
Returns:
the doc
Throws:
SAXException - if stuff goes wrong
IOException - if IO goes wrong
See Also:
DocumentBuilder.parse(org.xml.sax.InputSource)

parseFragment

public DocumentFragment parseFragment(InputSource is,
                                      String context)
                               throws IOException,
                                      SAXException
Parses a document fragment from a SAX InputSource.

Parameters:
is - the source
context - the context element name
Returns:
the doc
Throws:
SAXException - if stuff goes wrong
IOException - if IO goes wrong

setEntityResolver

public void setEntityResolver(EntityResolver resolver)
Sets the entity resolver for URI-only inputs.

Specified by:
setEntityResolver in class DocumentBuilder
Parameters:
resolver - the resolver
See Also:
DocumentBuilder.setEntityResolver(org.xml.sax.EntityResolver)

setErrorHandler

public void setErrorHandler(ErrorHandler errorHandler)
Sets the error handler.

Specified by:
setErrorHandler in class DocumentBuilder
Parameters:
errorHandler - the handler
See Also:
DocumentBuilder.setErrorHandler(org.xml.sax.ErrorHandler)

setIgnoringComments

public void setIgnoringComments(boolean ignoreComments)
Sets whether comment nodes appear in the tree.

Parameters:
ignoreComments - true to ignore comments
See Also:
TreeBuilder.setIgnoringComments(boolean)

setScriptingEnabled

public void setScriptingEnabled(boolean scriptingEnabled)
Sets whether the parser considers scripting to be enabled for noscript treatment.

Parameters:
scriptingEnabled - true to enable
See Also:
TreeBuilder.setScriptingEnabled(boolean)

setCheckingNormalization

public void setCheckingNormalization(boolean enable)
Toggles the checking of the NFC normalization of source.

Parameters:
enable - true to check normalization
See Also:
nu.validator.htmlparser.impl.Tokenizer#setCheckingNormalization(boolean)

setCommentPolicy

public void setCommentPolicy(XmlViolationPolicy commentPolicy)
Sets the policy for consecutive hyphens in comments.

Parameters:
commentPolicy - the policy
See Also:
Tokenizer.setCommentPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setContentNonXmlCharPolicy

public void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
Sets the policy for non-XML characters except white space.

Parameters:
contentNonXmlCharPolicy - the policy
See Also:
Tokenizer.setContentNonXmlCharPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setContentSpacePolicy

public void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
Sets the policy for non-XML white space.

Parameters:
contentSpacePolicy - the policy
See Also:
Tokenizer.setContentSpacePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setHtml4ModeCompatibleWithXhtml1Schemata

public void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.

Parameters:
html4ModeCompatibleWithXhtml1Schemata -

setMappingLangToXmlLang

public void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
Whether to map the HTML lang attribute to xml:lang.

Parameters:
mappingLangToXmlLang - true to map lang to xml:lang
See Also:
Tokenizer.setMappingLangToXmlLang(boolean)

setNamePolicy

public void setNamePolicy(XmlViolationPolicy namePolicy)
Sets the policy for dealing with names that aren't XML 1.0 4th ed. plus Namespaces NCNames.

Parameters:
namePolicy - the policy
See Also:
Tokenizer.setNamePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setXmlPolicy

public void setXmlPolicy(XmlViolationPolicy xmlPolicy)
This is a catch-all convenience method for setting name, content space, content non-XML char and comment policies in one go.

Parameters:
namePolicy - the policy

setBogusXmlnsPolicy

public void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
Deprecated. 

Does nothing.


setDoctypeExpectation

public void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
Sets the doctype expectation.

Parameters:
doctypeExpectation - the doctypeExpectation to set
See Also:
TreeBuilder.setDoctypeExpectation(nu.validator.htmlparser.common.DoctypeExpectation)

setDocumentModeHandler

public void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
Sets the document mode handler.

Parameters:
documentModeHandler -
See Also:
TreeBuilder.setDocumentModeHandler(nu.validator.htmlparser.common.DocumentModeHandler)

setHeuristics

public void setHeuristics(Heuristics heuristics)
Sets the encoding sniffing heuristics.

Parameters:
heuristics - the heuristics to set
See Also:
nu.validator.htmlparser.impl.Tokenizer#setHeuristics(nu.validator.htmlparser.common.Heuristics)

tokenize

private void tokenize(InputSource is)
               throws SAXException,
                      IOException,
                      MalformedURLException
Tokenizes the input source.

Parameters:
is - the source
Throws:
SAXException - if stuff goes wrong
IOException - if IO goes wrong
MalformedURLException - if the system ID is malformed and the entity resolver is null