nu.validator.htmlparser.xom
Class HtmlBuilder

java.lang.Object
  extended by nu.xom.Builder
      extended by nu.validator.htmlparser.xom.HtmlBuilder

public class HtmlBuilder
extends nu.xom.Builder

This class implements an HTML5 parser that exposes data through the XOM interface.

By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to ALTER_INFOSET as the general XML violation policy. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy to FATAL.

The doctype is not represented in the tree.

The document mode is represented via the Mode interface on the Document node if the node implements that interface (depends on the used node factory).

The form pointer is stored if the node factory supports storing it.

This package has its own node factory class because the official XOM node factory may return multiple nodes instead of one confusing the assumptions of the DOM-oriented HTML5 parsing algorithm.

Version:
$Id$
Author:
hsivonen

Constructor Summary
HtmlBuilder()
          Constructor with default node factory and fatal XML violation policy.
HtmlBuilder(SimpleNodeFactory nodeFactory)
          Constructor with given node factory and fatal XML violation policy.
HtmlBuilder(SimpleNodeFactory nodeFactory, XmlViolationPolicy xmlPolicy)
          Constructor with given node factory and given XML violation policy.
HtmlBuilder(XmlViolationPolicy xmlPolicy)
          Constructor with default node factory and given XML violation policy.
 
Method Summary
 void addCharacterHandler(CharacterHandler characterHandler)
           
 nu.xom.Document build(java.io.File file)
          Parse from File.
 nu.xom.Document build(org.xml.sax.InputSource is)
          Parse from SAX InputSource.
 nu.xom.Document build(java.io.InputStream stream)
          Parse from InputStream.
 nu.xom.Document build(java.io.InputStream stream, java.lang.String uri)
          Parse from InputStream.
 nu.xom.Document build(java.io.Reader stream)
          Parse from Reader.
 nu.xom.Document build(java.io.Reader stream, java.lang.String uri)
          Parse from Reader.
 nu.xom.Document build(java.lang.String uri)
          Parse from URI.
 nu.xom.Document build(java.lang.String content, java.lang.String uri)
          Parse from String.
 nu.xom.Nodes buildFragment(org.xml.sax.InputSource is, java.lang.String context)
          Parse a fragment from SAX InputSource.
 XmlViolationPolicy getBogusXmlnsPolicy()
          Deprecated.  
 XmlViolationPolicy getCommentPolicy()
          Returns the commentPolicy.
 XmlViolationPolicy getContentNonXmlCharPolicy()
          Returns the contentNonXmlCharPolicy.
 XmlViolationPolicy getContentSpacePolicy()
          Returns the contentSpacePolicy.
 DoctypeExpectation getDoctypeExpectation()
          Returns the doctype expectation.
 org.xml.sax.Locator getDocumentLocator()
          Returns the Locator during parse.
 DocumentModeHandler getDocumentModeHandler()
          Returns the document mode handler.
 Heuristics getHeuristics()
           
 XmlViolationPolicy getNamePolicy()
          The policy for non-NCName element and attribute names.
 SimpleNodeFactory getSimpleNodeFactory()
          Gets the node factory
 XmlViolationPolicy getStreamabilityViolationPolicy()
          Returns the streamabilityViolationPolicy.
 XmlViolationPolicy getXmlnsPolicy()
          Returns the xmlnsPolicy.
 boolean isCheckingNormalization()
          Indicates whether NFC normalization of source is being checked.
 boolean isHtml4ModeCompatibleWithXhtml1Schemata()
          Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
 boolean isMappingLangToXmlLang()
          Whether lang is mapped to xml:lang.
 boolean isReportingDoctype()
          Returns the reportingDoctype.
 boolean isScriptingEnabled()
          Whether the parser considers scripting to be enabled for noscript treatment.
 void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
          Deprecated.  
 void setCheckingNormalization(boolean enable)
          Toggles the checking of the NFC normalization of source.
 void setCommentPolicy(XmlViolationPolicy commentPolicy)
          Sets the policy for consecutive hyphens in comments.
 void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
          Sets the policy for non-XML characters except white space.
 void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
          Sets the policy for non-XML white space.
 void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
          Sets the doctype expectation.
 void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
          Sets the document mode handler.
 void setEntityResolver(org.xml.sax.EntityResolver resolver)
           
 void setErrorHandler(org.xml.sax.ErrorHandler handler)
           
 void setHeuristics(Heuristics heuristics)
          Sets the encoding sniffing heuristics.
 void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
          Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.
 void setIgnoringComments(boolean ignoreComments)
          Sets whether comment nodes appear in the tree.
 void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
          Whether lang is mapped to xml:lang.
 void setNamePolicy(XmlViolationPolicy namePolicy)
          The policy for non-NCName element and attribute names.
 void setReportingDoctype(boolean reportingDoctype)
           
 void setScriptingEnabled(boolean scriptingEnabled)
          Sets whether the parser considers scripting to be enabled for noscript treatment.
 void setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
          Sets the streamabilityViolationPolicy.
 void setTransitionHander(TransitionHandler handler)
           
 void setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
          Whether the xmlns attribute on the root element is passed to through.
 void setXmlPolicy(XmlViolationPolicy xmlPolicy)
          This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go.
 
Methods inherited from class nu.xom.Builder
getNodeFactory
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlBuilder

public HtmlBuilder()
Constructor with default node factory and fatal XML violation policy.


HtmlBuilder

public HtmlBuilder(SimpleNodeFactory nodeFactory)
Constructor with given node factory and fatal XML violation policy.

Parameters:
nodeFactory - the factory

HtmlBuilder

public HtmlBuilder(XmlViolationPolicy xmlPolicy)
Constructor with default node factory and given XML violation policy.

Parameters:
xmlPolicy - the policy

HtmlBuilder

public HtmlBuilder(SimpleNodeFactory nodeFactory,
                   XmlViolationPolicy xmlPolicy)
Constructor with given node factory and given XML violation policy.

Parameters:
nodeFactory - the factory
xmlPolicy - the policy
Method Detail

build

public nu.xom.Document build(org.xml.sax.InputSource is)
                      throws nu.xom.ParsingException,
                             java.io.IOException
Parse from SAX InputSource.

Parameters:
is - the InputSource
Returns:
the document
Throws:
nu.xom.ParsingException - in case of an XML violation
java.io.IOException - if IO goes wrang

buildFragment

public nu.xom.Nodes buildFragment(org.xml.sax.InputSource is,
                                  java.lang.String context)
                           throws java.io.IOException,
                                  nu.xom.ParsingException
Parse a fragment from SAX InputSource.

Parameters:
is - the InputSource
context - the name of the context element
Returns:
the fragment
Throws:
nu.xom.ParsingException - in case of an XML violation
java.io.IOException - if IO goes wrang

build

public nu.xom.Document build(java.io.File file)
                      throws nu.xom.ParsingException,
                             nu.xom.ValidityException,
                             java.io.IOException
Parse from File.

Overrides:
build in class nu.xom.Builder
Parameters:
file - the file
Returns:
the document
Throws:
nu.xom.ParsingException - in case of an XML violation
java.io.IOException - if IO goes wrang
nu.xom.ValidityException
See Also:
Builder.build(java.io.File)

build

public nu.xom.Document build(java.io.InputStream stream,
                             java.lang.String uri)
                      throws nu.xom.ParsingException,
                             nu.xom.ValidityException,
                             java.io.IOException
Parse from InputStream.

Overrides:
build in class nu.xom.Builder
Parameters:
stream - the stream
uri - the base URI
Returns:
the document
Throws:
nu.xom.ParsingException - in case of an XML violation
java.io.IOException - if IO goes wrang
nu.xom.ValidityException
See Also:
Builder.build(java.io.InputStream, java.lang.String)

build

public nu.xom.Document build(java.io.InputStream stream)
                      throws nu.xom.ParsingException,
                             nu.xom.ValidityException,
                             java.io.IOException
Parse from InputStream.

Overrides:
build in class nu.xom.Builder
Parameters:
stream - the stream
Returns:
the document
Throws:
nu.xom.ParsingException - in case of an XML violation
java.io.IOException - if IO goes wrang
nu.xom.ValidityException
See Also:
Builder.build(java.io.InputStream)

build

public nu.xom.Document build(java.io.Reader stream,
                             java.lang.String uri)
                      throws nu.xom.ParsingException,
                             nu.xom.ValidityException,
                             java.io.IOException
Parse from Reader.

Overrides:
build in class nu.xom.Builder
Parameters:
stream - the reader
uri - the base URI
Returns:
the document
Throws:
nu.xom.ParsingException - in case of an XML violation
java.io.IOException - if IO goes wrang
nu.xom.ValidityException
See Also:
Builder.build(java.io.Reader, java.lang.String)

build

public nu.xom.Document build(java.io.Reader stream)
                      throws nu.xom.ParsingException,
                             nu.xom.ValidityException,
                             java.io.IOException
Parse from Reader.

Overrides:
build in class nu.xom.Builder
Parameters:
stream - the reader
Returns:
the document
Throws:
nu.xom.ParsingException - in case of an XML violation
java.io.IOException - if IO goes wrang
nu.xom.ValidityException
See Also:
Builder.build(java.io.Reader)

build

public nu.xom.Document build(java.lang.String content,
                             java.lang.String uri)
                      throws nu.xom.ParsingException,
                             nu.xom.ValidityException,
                             java.io.IOException
Parse from String.

Overrides:
build in class nu.xom.Builder
Parameters:
content - the HTML source as string
uri - the base URI
Returns:
the document
Throws:
nu.xom.ParsingException - in case of an XML violation
java.io.IOException - if IO goes wrang
nu.xom.ValidityException
See Also:
Builder.build(java.lang.String, java.lang.String)

build

public nu.xom.Document build(java.lang.String uri)
                      throws nu.xom.ParsingException,
                             nu.xom.ValidityException,
                             java.io.IOException
Parse from URI.

Overrides:
build in class nu.xom.Builder
Parameters:
uri - the URI of the document
Returns:
the document
Throws:
nu.xom.ParsingException - in case of an XML violation
java.io.IOException - if IO goes wrang
nu.xom.ValidityException
See Also:
Builder.build(java.lang.String)

getSimpleNodeFactory

public SimpleNodeFactory getSimpleNodeFactory()
Gets the node factory


setEntityResolver

public void setEntityResolver(org.xml.sax.EntityResolver resolver)
See Also:
XMLReader.setEntityResolver(org.xml.sax.EntityResolver)

setErrorHandler

public void setErrorHandler(org.xml.sax.ErrorHandler handler)
See Also:
XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)

setTransitionHander

public void setTransitionHander(TransitionHandler handler)

isCheckingNormalization

public boolean isCheckingNormalization()
Indicates whether NFC normalization of source is being checked.

Returns:
true if NFC normalization of source is being checked.
See Also:
nu.validator.htmlparser.impl.Tokenizer#isCheckingNormalization()

setCheckingNormalization

public void setCheckingNormalization(boolean enable)
Toggles the checking of the NFC normalization of source.

Parameters:
enable - true to check normalization
See Also:
nu.validator.htmlparser.impl.Tokenizer#setCheckingNormalization(boolean)

setCommentPolicy

public void setCommentPolicy(XmlViolationPolicy commentPolicy)
Sets the policy for consecutive hyphens in comments.

Parameters:
commentPolicy - the policy
See Also:
Tokenizer.setCommentPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setContentNonXmlCharPolicy

public void setContentNonXmlCharPolicy(XmlViolationPolicy contentNonXmlCharPolicy)
Sets the policy for non-XML characters except white space.

Parameters:
contentNonXmlCharPolicy - the policy
See Also:
Tokenizer.setContentNonXmlCharPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setContentSpacePolicy

public void setContentSpacePolicy(XmlViolationPolicy contentSpacePolicy)
Sets the policy for non-XML white space.

Parameters:
contentSpacePolicy - the policy
See Also:
Tokenizer.setContentSpacePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

isScriptingEnabled

public boolean isScriptingEnabled()
Whether the parser considers scripting to be enabled for noscript treatment.

Returns:
true if enabled
See Also:
TreeBuilder.isScriptingEnabled()

setScriptingEnabled

public void setScriptingEnabled(boolean scriptingEnabled)
Sets whether the parser considers scripting to be enabled for noscript treatment.

Parameters:
scriptingEnabled - true to enable
See Also:
TreeBuilder.setScriptingEnabled(boolean)

getDoctypeExpectation

public DoctypeExpectation getDoctypeExpectation()
Returns the doctype expectation.

Returns:
the doctypeExpectation

setDoctypeExpectation

public void setDoctypeExpectation(DoctypeExpectation doctypeExpectation)
Sets the doctype expectation.

Parameters:
doctypeExpectation - the doctypeExpectation to set
See Also:
TreeBuilder.setDoctypeExpectation(nu.validator.htmlparser.common.DoctypeExpectation)

getDocumentModeHandler

public DocumentModeHandler getDocumentModeHandler()
Returns the document mode handler.

Returns:
the documentModeHandler

setDocumentModeHandler

public void setDocumentModeHandler(DocumentModeHandler documentModeHandler)
Sets the document mode handler.

Parameters:
documentModeHandler - the documentModeHandler to set
See Also:
TreeBuilder.setDocumentModeHandler(nu.validator.htmlparser.common.DocumentModeHandler)

getStreamabilityViolationPolicy

public XmlViolationPolicy getStreamabilityViolationPolicy()
Returns the streamabilityViolationPolicy.

Returns:
the streamabilityViolationPolicy

setStreamabilityViolationPolicy

public void setStreamabilityViolationPolicy(XmlViolationPolicy streamabilityViolationPolicy)
Sets the streamabilityViolationPolicy.

Parameters:
streamabilityViolationPolicy - the streamabilityViolationPolicy to set

setHtml4ModeCompatibleWithXhtml1Schemata

public void setHtml4ModeCompatibleWithXhtml1Schemata(boolean html4ModeCompatibleWithXhtml1Schemata)
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.

Parameters:
html4ModeCompatibleWithXhtml1Schemata -

getDocumentLocator

public org.xml.sax.Locator getDocumentLocator()
Returns the Locator during parse.

Returns:
the Locator

isHtml4ModeCompatibleWithXhtml1Schemata

public boolean isHtml4ModeCompatibleWithXhtml1Schemata()
Whether the HTML 4 mode reports boolean attributes in a way that repeats the name in the value.

Returns:
the html4ModeCompatibleWithXhtml1Schemata

setMappingLangToXmlLang

public void setMappingLangToXmlLang(boolean mappingLangToXmlLang)
Whether lang is mapped to xml:lang.

Parameters:
mappingLangToXmlLang -
See Also:
Tokenizer.setMappingLangToXmlLang(boolean)

isMappingLangToXmlLang

public boolean isMappingLangToXmlLang()
Whether lang is mapped to xml:lang.

Returns:
the mappingLangToXmlLang

setXmlnsPolicy

public void setXmlnsPolicy(XmlViolationPolicy xmlnsPolicy)
Whether the xmlns attribute on the root element is passed to through. (FATAL not allowed.)

Parameters:
xmlnsPolicy -
See Also:
Tokenizer.setXmlnsPolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

getXmlnsPolicy

public XmlViolationPolicy getXmlnsPolicy()
Returns the xmlnsPolicy.

Returns:
the xmlnsPolicy

getCommentPolicy

public XmlViolationPolicy getCommentPolicy()
Returns the commentPolicy.

Returns:
the commentPolicy

getContentNonXmlCharPolicy

public XmlViolationPolicy getContentNonXmlCharPolicy()
Returns the contentNonXmlCharPolicy.

Returns:
the contentNonXmlCharPolicy

getContentSpacePolicy

public XmlViolationPolicy getContentSpacePolicy()
Returns the contentSpacePolicy.

Returns:
the contentSpacePolicy

setReportingDoctype

public void setReportingDoctype(boolean reportingDoctype)
Parameters:
reportingDoctype -
See Also:
TreeBuilder.setReportingDoctype(boolean)

isReportingDoctype

public boolean isReportingDoctype()
Returns the reportingDoctype.

Returns:
the reportingDoctype

setNamePolicy

public void setNamePolicy(XmlViolationPolicy namePolicy)
The policy for non-NCName element and attribute names.

Parameters:
namePolicy -
See Also:
Tokenizer.setNamePolicy(nu.validator.htmlparser.common.XmlViolationPolicy)

setHeuristics

public void setHeuristics(Heuristics heuristics)
Sets the encoding sniffing heuristics.

Parameters:
heuristics - the heuristics to set
See Also:
nu.validator.htmlparser.impl.Tokenizer#setHeuristics(nu.validator.htmlparser.common.Heuristics)

getHeuristics

public Heuristics getHeuristics()

setXmlPolicy

public void setXmlPolicy(XmlViolationPolicy xmlPolicy)
This is a catch-all convenience method for setting name, xmlns, content space, content non-XML char and comment policies in one go. This does not affect the streamability policy or doctype reporting.

Parameters:
xmlPolicy -

getNamePolicy

public XmlViolationPolicy getNamePolicy()
The policy for non-NCName element and attribute names.

Returns:
the namePolicy

setBogusXmlnsPolicy

public void setBogusXmlnsPolicy(XmlViolationPolicy bogusXmlnsPolicy)
Deprecated. 

Does nothing.


getBogusXmlnsPolicy

public XmlViolationPolicy getBogusXmlnsPolicy()
Deprecated. 

Returns XmlViolationPolicy.ALTER_INFOSET.

Returns:
XmlViolationPolicy.ALTER_INFOSET

addCharacterHandler

public void addCharacterHandler(CharacterHandler characterHandler)

setIgnoringComments

public void setIgnoringComments(boolean ignoreComments)
Sets whether comment nodes appear in the tree.

Parameters:
ignoreComments - true to ignore comments
See Also:
TreeBuilder.setIgnoringComments(boolean)