The Validator.nu HTML Parser

The Validator.nu HTML Parser is an implementation of the HTML parsing algorithm in Java. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)

Supported APIs

SAX, DOM and XOM are supported. Both truly streaming SAX and buffered SAX are supported. Some HTML errors require non-streamable recovery. Those are fatal in the truly streaming mode.

When running under Google Web Toolkit, the browser’s DOM is used (createElementNS required). For document.write() support in Java apps, the application needs to provide its own TreeBuilder subclass and an IO driver. Please see the GWT-specific code for inspiration.

Get the Source

The code is available from a Git repo: git clone https://github.com/validator/htmlparser.git; cd htmlparser; git checkout master

Download Release

You really should prefer getting the source (see above), since the latest release is over two years out of data. (Yeah, fixing that is on the todo list.)

Version 1.4 2012-06-05 (GPG sig)

The parser is also available from the Maven Central Repository (groupId: nu.validator.htmlparser, artifactId: htmlparser, version: 1.4).

Limitations of HotSpot

The distribution package comes with two precompiled JAR files: htmlparser-1.4.jar and htmlparser-1.4-with-transitions.jar. The first one works properly with HotSpot without special settings but does not support reporting tokenizer transitions to the application via TransitionHandler. The second one supports tokenizer transition reporting but does not get JITted by HotSpot unless you start the JVM with the -XX:-DontCompileHugeMethods command line switch.

Note: If you compile the parser yourself from source, you get what corresponds to the -transitions JAR. To get a version that works with HotSpot without the -XX:-DontCompileHugeMethods switch, you need to run nu.validator.htmlparser.generator.ApplyHotSpotWorkaround with the Tokenizer.java and HotSpotWorkaround.txt files as inputs. Warning: This will modify Tokenizer.java in place!

Run-Time Dependencies for Version 1.4

JDK 5.0 or later
Optionally ICU4J (required if source code Unicode normalization checking is enabled or if the ICU4J encoding sniffer is enabled)
Optionally jchardet (required the Mozilla chardet encoding sniffer is enabled)
Optionally XOM (required for XOM functionality)

Sample Apps

The jar file contains sample main() entry points:

nu.validator.htmlparser.tools.XSLT4HTML5
nu.validator.htmlparser.tools.XSLT4HTML5XOM
nu.validator.htmlparser.tools.HTML2XML
nu.validator.htmlparser.tools.XML2HTML
nu.validator.htmlparser.tools.XML2XML
nu.validator.htmlparser.tools.HTML2HTML

The first two are sample apps that demo the use of XSLT with HTML5. The first one can use SAX or DOM and requires the Xalan serializer. The second one uses XOM. Running without parameters dumps usage help.

java -cp htmlparser-1.4.jar nu.validator.htmlparser.tools.XSLT4HTML5 --template=sort-ul.xsl --input-html=test.html --output-html=out.html --mode=dom

HTML2XML converts HTML5 to XML 1.0 plus Namespaces. With no arguments, it reads from stdio and writes to stdout. With one parameter, it reads the named file and writes to stdout. With two parameters, the first is the input file name and the second is the output file name.

XML2HTML, HTML2HTML and XML2XML work analogously. The *2HTML versions produce bad output if the document tree is not serializable as HTML5. It is up to the user the make sure that it is.

Upgrade Guide from 1.0.x to Current Release

In all cases, you need to check that your application does not break when it receives SVG or MathML subtrees.

If you use the parser through the SAX, DOM or XOM API and do not pass an explicit XmlViolationPolicy to the constructor of HtmlParser, HtmlDocumentBuilder or HtmlBuilder:

If you really wanted the old default behavior, you should now pass XmlViolationPolicy.FATAL to the constructor.

If you did not really want to have fatal errors by default, you do not need to do anything, since ALTER_INFOSET is now the default.

If you use the parser through the SAX, DOM or XOM API and do pass an explicit XmlViolationPolicy to the constructor of HtmlParser, HtmlDocumentBuilder or HtmlBuilder:

You do not need to change your code to upgrade.

If you have your own subclass of TreeBuilder:

The abstract methods on TreeBuilder now have additional arguments for passing the namespace URI. You should upgrade your subclass to deal with the namespace URIs. (The URI is always an interned string, so you can use == to compare.)

There is a class called CoalescingTreeBuilder which you should subclass instead of TreeBuilder to get automatic text node coalescing.

The entry point for passing in a SAX InputSource has moved from the Tokenizer class to the Driver class (in the io package), so you should change your references from Tokenizer to Driver.

If you have your own implementation of TokenHandler:

Please refer to the JavaDocs of TokenHandler. Also note the new separation of Tokenizer and Driver mentioned above.

Change Log

1.4

No longer crashes in setErrorHandler() in the DOM case.
No longer crashes with ArrayIndexOutOfBoundsException in the meta prescan.
Correctness tweaks to HTML integration point and MathML text integration point behavior.
Slight adjustments to error and warning reporting.
The XLink namespace is now serialized more nicely.
Unicode decoder returning zero-length output in the middle of the file is now dealt with correctly.
No longer goes to infinite loop with the HotSpot workaround applied.
Builds again with Maven.

1.3.1

Fixed the release package to contain the command-line tools that were accidentally omitted from the previous release package.
Better error reporting for unclosed elements in lists.
Correct behavior when <nobr> is seen when nobr is already open.
Reduced the static memory footprint slightly.

1.3

Implemented spec changes. (Too numerous to enumerate, but, as highlights, foreign content works better now and there are limits on the growth of the number of formatting element clones.)
Made Dom2Sax robust against null localNames.
Error reporting improvements for unclosed elements.
The HTML serializer no longer serializes comments when inside an [R]CDATA element.
Line break swallowing now works correctly after a pre, textarea or listing start tag.
Added an interface that enables the application to follow the state transitions in the tokenizer.
Fixed the Uhhhhh notation output when applying infoset coercion.
Provided an infoset coercing SAX parser class that has a zero-argument constructor and can be conveniently passed to other Java code that wants to instantiate a parser class by name with no arguments (e.g. Saxon).

1.2.1

Fixed an IDness issue with the DOM implementation of the latest Xerces.

1.2.0

Fixed an issue where under rare circumstances attribute values leaking into element content.
Fixed a bug where isindex processing added attributes to all elements that were supposed to have no attributes.
Implemented spec changes. (Too numerous to enumerate, but, as a highlight, framesets parse much better now.)
Moved to WebKit-style foster parenting.
Changed the API for tree builder subclasses again due to new constraints. If you have previously written your own tree builder subclass, you need to change it.
Fixed the bundled XML serializer.
Made it possible to generate a C++ version that does not leak memory from the Java source.
Removed the C++ translator for the release. (Get it from SVN.)

1.1.1

Fixed JavaDocs about XML violation policy defaults.
Fixed the handling of spaces in attributes in the XML serializer.
Made getElementById work with the DOM trees built by the parser.

1.1.0

Made the SAX, DOM and XOM parser entry point constructors default to altering the infoset instead of throwing when the input needs coercing to be an XML 1.0 4th ed. plus Namespaces infoset.
Isolated Java IO dependent code from the parser core. The parser core now compiles on Google Web Toolkit.
Refactored the tokenizer to use a switch branch per state instead of method per state.
Made various performance tweaks to the tokenizer.
Implemented support for MathML and SVG foreign content. (Note that the SVG part is based on spec text that has been commented out from the spec at the request of the SVG WG.)
Made the parser suspendable after any input character.
Made it possible for custom TreeBuilder subclasses to request parser suspension. (Applications wishing to implement document.write() should provide their own TreeBuilder subclass and a document.write()-aware replacement of the Driver class. Look in the gwt-src/ directory for sample code.)
Made changes to the parser core to make it more suitable for mechanical translation into other object-oriented programming languages that have C-like control structures but not necessarily a garbage collector (with focus on targeting C++). This work is not complete.
Made the HTML serializer do the right thing when input represents a conforming XHTML+SVG+MathML tree. (Results may be bad for non-conforming input trees.)
Developed sample programs for converting between HTML5 and XHTML5 when the input is known to be conforming.
Provided an XML serializer so that the sample code no longer depends on the Xalan serializer.
Improved API documentation.
Fixed bugs in the tokenizer, tree builder and the input stream character encoding decoder.
Made coercion to an XML infoset work according to the HTML5 spec.
Added ID uniqueness checking.
Various other fixes.

1.0.7

Adds optional support for heuristic encoding sniffing using the ICU4J sniffer, jchardet or both.
Adds support for rewinding and reparsing when becoming confident about the character encoding and the tentative encoding was wrong.
Performs encoding name matching per spec instead of using the JDK mechanism.
Implements spec changes up until just before SVG and MathML support. (Those will merit 1.1 or something.)
Warning: The semantics of the doctype token have changed in case you have your own token handler (unlikely).

1.0.6

Fixes a crasher bug in bytes to characters conversion
Works around a crash when the ICU4J 3.8.1 UTF-7 decoder is in the classpath
Improves error message wording
Brings errors and warnings pertaining to legacy encodings up-to-date per the current HTML 5 draft

License

This is for the HTML parser as a whole except the rewindable input stream,
the named character classes and the Live DOM Viewer. 
For the copyright notices for individual files, please see individual files.

/*
 * Copyright (c) 2005, 2006, 2007 Henri Sivonen
 * Copyright (c) 2007-2011 Mozilla Foundation
 * Portions of comments Copyright 2004-2007 Apple Computer, Inc., Mozilla 
 * Foundation, and Opera Software ASA.
 *
 * Permission is hereby granted, free of charge, to any person obtaining a 
 * copy of this software and associated documentation files (the "Software"), 
 * to deal in the Software without restriction, including without limitation 
 * the rights to use, copy, modify, merge, publish, distribute, sublicense, 
 * and/or sell copies of the Software, and to permit persons to whom the 
 * Software is furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in 
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL 
 * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
 * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 
 * DEALINGS IN THE SOFTWARE.
 */

The following license is for the WHATWG spec from which the named character
data was extracted.

/*
 * Copyright 2004-2010 Apple Computer, Inc., Mozilla Foundation, and Opera 
 * Software ASA.
 * 
 * You are granted a license to use, reproduce and create derivative works of 
 * this document.
 */
 
The following license is for the rewindable input stream.

/*
 * Copyright (c) 2001-2003 Thai Open Source Software Center Ltd
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without 
 * modification, are permitted provided that the following conditions 
 * are met:
 *
 *  * Redistributions of source code must retain the above copyright 
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above 
 *    copyright notice, this list of conditions and the following 
 *    disclaimer in the documentation and/or other materials provided 
 *    with the distribution.
 *  * Neither the name of the Thai Open Source Software Center Ltd nor 
 *    the names of its contributors may be used to endorse or promote 
 *    products derived from this software without specific prior 
 *    written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
 * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 
 * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE 
 * REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 
 * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 
 * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 
 * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN 
 * ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 
 * POSSIBILITY OF SUCH DAMAGE.
 */

The following license applies to the Live DOM Viewer:

Copyright (c) 2000, 2006, 2008 Ian Hickson and various contributors

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

Sightings

Validator.nu HTML Parser in use elsewhere:

Bugs

Known bugs on the trunk.

Acknowledgements

Thanks to the Mozilla Foundation and the Mozilla Corporation for funding this project. Thanks to the html5lib team and Philip Taylor for test cases and bug reports. Thanks to Chris Hubick for Mavenization. Thanks to Simon Pieters and the Firefox nightly testers for finding bugs. Thanks to Mike(tm) Smith, William Chen, Mats Palmgren and Neil Rashbrook for fixing bugs. Thanks to Ian Hickson for writing the spec and the Live DOM Viewer.

Other Implementations

Please refer to the WHATWG wiki for implementations in other programming languages.

Contact

hsivonen@hsivonen.fi