nu.validator.htmlparser.dom
Class HtmlDocumentBuilder

java.lang.Object
  extended by javax.xml.parsers.DocumentBuilder
      extended by nu.validator.htmlparser.dom.HtmlDocumentBuilder

public class HtmlDocumentBuilder
extends DocumentBuilder

This class implements an HTML5 parser that exposes data through the DOM interface.

By default, when using the constructor without arguments, the this parser treats XML 1.0-incompatible infosets as fatal errors. This corresponds to FATAL as the general XML violation policy. To make the parser support non-conforming HTML fully per the HTML 5 spec while on the other hand potentially violating the DOM API contract, set the general XML violation policy to ALLOW. This does not work with a standard DOM implementation. Handling all input without fatal errors and without violating the DOM API contract is possible by setting the general XML violation policy to ALTER_INFOSET. This makes the parser non-conforming but is probably the most useful setting for most applications.

The doctype is not represented in the tree.

The document mode is represented as user data DocumentMode object with the key nu.validator.document-mode on the document node.

The form pointer is also stored as user data with the key nu.validator.form-pointer.

Version:
$Id: HtmlDocumentBuilder.java 153 2007-09-11 07:41:33Z hsivonen $
Author:
hsivonen

Field Summary
private  DOMTreeBuilder domTreeBuilder
           
private  EntityResolver entityResolver
           
private  DOMImplementation implementation
           
private  Tokenizer tokenizer
           
 
Constructor Summary
HtmlDocumentBuilder()
          Instantiates the document builder with the JAXP DOM implementation and fatal XML violation policy.
HtmlDocumentBuilder(DOMImplementation implementation)
          Instantiates the document builder with a specific DOM implementation and fatal XML violation policy.
HtmlDocumentBuilder(DOMImplementation implementation, XmlViolationPolicy xmlPolicy)
          Instantiates the document builder with a specific DOM implementation and XML violation policy.
HtmlDocumentBuilder(XmlViolationPolicy xmlPolicy)
          Instantiates the document builder with the JAXP DOM implementation and a specific XML violation policy.
 
Method Summary
 DOMImplementation getDOMImplementation()
          Returns the DOM implementation
 boolean isNamespaceAware()
          Returns true.
 boolean isValidating()
          Returns false
private static DOMImplementation jaxpDOMImplementation()
           
 Document newDocument()
          For API compatibility.
 Document parse(InputSource is)
          Parses a document from a SAX InputSource.
 DocumentFragment parseFragment(InputSource is, String context)
          Parses a document fragment from a SAX InputSource.
 void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
          Sets the policy for forbidden xmlns attributes.
 void setCheckingNormalization(boolean enable)
          Toggles the checking of the NFC normalization of source.
 void setCommentPolicy(XmlViolationPolicy commentPolicy)
          Sets the policy for consecutive hyphens in comments.
 void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
          Sets the policy for non-XML characters except white space.
 void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
          Sets the policy for non-XML white space.
 void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
          Sets the doctype expectation.
 void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
          Sets the document mode handler.
 void setEntityResolver(EntityResolver resolver)
          Sets the entity resolver for URI-only inputs.
 void setErrorHandler(ErrorHandler errorHandler)
           
 void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
          Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
 void setIgnoringComments(boolean ignoreComments)
          Sets whether comment nodes appear in the tree.
 void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
           
 void setNamePolicy(XmlViolationPolicy namePolicy)
           
 void setScriptingEnabled(boolean scriptingEnabled)
          Sets whether the parser considers scripting to be enabled for noscript treatment.
 void setXmlPolicy(XmlViolationPolicy xmlPolicy)
          This is a catch-all convenience method for setting name, content space, content non-XML char and comment policies in one go.
private  void tokenize(InputSource is)
           
 
Methods inherited from class javax.xml.parsers.DocumentBuilder
getSchema, isXIncludeAware, parse, parse, parse, parse, reset
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

tokenizer

private final Tokenizer tokenizer

domTreeBuilder

private final DOMTreeBuilder domTreeBuilder

implementation

private final DOMImplementation implementation

entityResolver

private EntityResolver entityResolver
Constructor Detail

HtmlDocumentBuilder

public HtmlDocumentBuilder(DOMImplementation implementation,
                           XmlViolationPolicy xmlPolicy)
Instantiates the document builder with a specific DOM implementation and XML violation policy.

Parameters:
implementation - the DOM implementation
xmlPolicy - the policy

HtmlDocumentBuilder

public HtmlDocumentBuilder(DOMImplementation implementation)
Instantiates the document builder with a specific DOM implementation and fatal XML violation policy.

Parameters:
implementation - the DOM implementation

HtmlDocumentBuilder

public HtmlDocumentBuilder()
Instantiates the document builder with the JAXP DOM implementation and fatal XML violation policy.


HtmlDocumentBuilder

public HtmlDocumentBuilder(XmlViolationPolicy xmlPolicy)
Instantiates the document builder with the JAXP DOM implementation and a specific XML violation policy.

Parameters:
xmlPolicy - the policy
Method Detail

jaxpDOMImplementation

private static DOMImplementation jaxpDOMImplementation()
Returns:
the JAXP DOM implementation

getDOMImplementation

public DOMImplementation getDOMImplementation()
Returns the DOM implementation

Specified by:
getDOMImplementation in class DocumentBuilder
Returns:
the DOM implementation
See Also:
DocumentBuilder.getDOMImplementation()

isNamespaceAware

public boolean isNamespaceAware()
Returns true.

Specified by:
isNamespaceAware in class DocumentBuilder
Returns:
true
See Also:
DocumentBuilder.isNamespaceAware()

isValidating

public boolean isValidating()
Returns false

Specified by:
isValidating in class DocumentBuilder
Returns:
false
See Also:
DocumentBuilder.isValidating()

newDocument

public Document newDocument()
For API compatibility.

Specified by:
newDocument in class DocumentBuilder
See Also:
DocumentBuilder.newDocument()

parse

public Document parse(InputSource is)
               throws SAXException,
                      IOException
Parses a document from a SAX InputSource.

Specified by:
parse in class DocumentBuilder
Parameters:
is - the source
Returns:
the doc
Throws:
SAXException
IOException
See Also:
DocumentBuilder.parse(org.xml.sax.InputSource)

parseFragment

public DocumentFragment parseFragment(InputSource is,
                                      String context)
                               throws IOException,
                                      SAXException
Parses a document fragment from a SAX InputSource.

Parameters:
is - the source
context - the context element name
Returns:
the doc
Throws:
IOException
SAXException

tokenize

private void tokenize(InputSource is)
               throws SAXException,
                      IOException,
                      MalformedURLException
Parameters:
is -
Throws:
SAXException
IOException
MalformedURLException

setEntityResolver

public void setEntityResolver(EntityResolver resolver)
Sets the entity resolver for URI-only inputs.

Specified by:
setEntityResolver in class DocumentBuilder
Parameters:
resolver - the resolver
See Also:
DocumentBuilder.setEntityResolver(org.xml.sax.EntityResolver)

setErrorHandler

public void setErrorHandler(ErrorHandler errorHandler)
Specified by:
setErrorHandler in class DocumentBuilder
See Also:
DocumentBuilder.setErrorHandler(org.xml.sax.ErrorHandler)

setIgnoringComments

public void setIgnoringComments(boolean ignoreComments)
Sets whether comment nodes appear in the tree.

Parameters:
ignoreComments - true to ignore comments
See Also:
TreeBuilder.setIgnoringComments(boolean)

setScriptingEnabled

public void setScriptingEnabled(boolean scriptingEnabled)
Sets whether the parser considers scripting to be enabled for noscript treatment.

Parameters:
scriptingEnabled - true to enable
See Also:
TreeBuilder.setScriptingEnabled(boolean)

setCheckingNormalization

public void setCheckingNormalization(boolean enable)
Toggles the checking of the NFC normalization of source.

Parameters:
enable - true to check normalization
See Also:
Tokenizer.setCheckingNormalization(boolean)

setCommentPolicy

public void setCommentPolicy(XmlViolationPolicy commentPolicy)
Sets the policy for consecutive hyphens in comments.

Parameters:
commentPolicy - the policy
See Also:
Tokenizer.setCommentPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setContentNonXmlCharPolicy

public void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
Sets the policy for non-XML characters except white space.

Parameters:
contentNonXmlCharPolicy - the policy
See Also:
Tokenizer.setContentNonXmlCharPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setContentSpacePolicy

public void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
Sets the policy for non-XML white space.

Parameters:
contentSpacePolicy - the policy
See Also:
Tokenizer.setContentSpacePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setHtml4ModeCompatibleWithXhtml1Schemata

public void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.

Parameters:
html4ModeCompatibleWithXhtml1Schemata -

setMappingLangToXmlLang

public void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
Parameters:
mappingLangToXmlLang -
See Also:
Tokenizer.setMappingLangToXmlLang(boolean)

setNamePolicy

public void setNamePolicy(XmlViolationPolicy namePolicy)
Parameters:
namePolicy -
See Also:
Tokenizer.setNamePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setXmlPolicy

public void setXmlPolicy(XmlViolationPolicy xmlPolicy)
This is a catch-all convenience method for setting name, content space, content non-XML char and comment policies in one go.

Parameters:
xmlPolicy -

setDoctypeExpectation

public void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
Sets the doctype expectation.

Parameters:
doctypeExpectation - the doctypeExpectation to set
See Also:
TreeBuilder.setDoctypeExpectation(nu.validator.htmlparser.common.DoctypeExpectation)

setDocumentModeHandler

public void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
Sets the document mode handler.

Parameters:
documentModeHandler -
See Also:
TreeBuilder.setDocumentModeHandler(nu.validator.htmlparser.common.DocumentModeHandler)

setBogusXmlnsPolicy

public void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
Sets the policy for forbidden xmlns attributes.

Parameters:
bogusXmlnsPolicy - the policy
See Also:
Tokenizer.setBogusXmlnsPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)