The Validator.nu HTML Parser

The Validator.nu HTML Parser is an implementation of the HTML parsing algorithm in Java. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)

Supported APIs

SAX, DOM and XOM are supported. Both truly streaming SAX and buffered SAX are supported. Some HTML errors require non-streamable recovery. Those are fatal in the truly streaming mode.

When running under Google Web Toolkit, the browser’s DOM is used (createElementNS required). For document.write() support in Java apps, the application needs to provide its own TreeBuilder subclass and an IO driver. Please see the GWT-specific code for inspiration.

Get the Source

The code is available from a Git repo: git clone https://github.com/validator/htmlparser.git; cd htmlparser; git checkout master

Download Release

You really should prefer getting the source (see above), since the latest release is over two years out of data. (Yeah, fixing that is on the todo list.)

Version 1.4 2012-06-05 (GPG sig)

The parser is also available from the Maven Central Repository (groupId: nu.validator.htmlparser, artifactId: htmlparser, version: 1.4).

Limitations of HotSpot

The distribution package comes with two precompiled JAR files: htmlparser-1.4.jar and htmlparser-1.4-with-transitions.jar. The first one works properly with HotSpot without special settings but does not support reporting tokenizer transitions to the application via TransitionHandler. The second one supports tokenizer transition reporting but does not get JITted by HotSpot unless you start the JVM with the -XX:-DontCompileHugeMethods command line switch.

Note: If you compile the parser yourself from source, you get what corresponds to the -transitions JAR. To get a version that works with HotSpot without the -XX:-DontCompileHugeMethods switch, you need to run nu.validator.htmlparser.generator.ApplyHotSpotWorkaround with the Tokenizer.java and HotSpotWorkaround.txt files as inputs. Warning: This will modify Tokenizer.java in place!

Run-Time Dependencies for Version 1.4

Sample Apps

The jar file contains sample main() entry points:

The first two are sample apps that demo the use of XSLT with HTML5. The first one can use SAX or DOM and requires the Xalan serializer. The second one uses XOM. Running without parameters dumps usage help.

java -cp htmlparser-1.4.jar nu.validator.htmlparser.tools.XSLT4HTML5 --template=sort-ul.xsl --input-html=test.html --output-html=out.html --mode=dom

HTML2XML converts HTML5 to XML 1.0 plus Namespaces. With no arguments, it reads from stdio and writes to stdout. With one parameter, it reads the named file and writes to stdout. With two parameters, the first is the input file name and the second is the output file name.

XML2HTML, HTML2HTML and XML2XML work analogously. The *2HTML versions produce bad output if the document tree is not serializable as HTML5. It is up to the user the make sure that it is.

Upgrade Guide from 1.0.x to Current Release

In all cases, you need to check that your application does not break when it receives SVG or MathML subtrees.

If you use the parser through the SAX, DOM or XOM API and do not pass an explicit XmlViolationPolicy to the constructor of HtmlParser, HtmlDocumentBuilder or HtmlBuilder:

If you really wanted the old default behavior, you should now pass XmlViolationPolicy.FATAL to the constructor.

If you did not really want to have fatal errors by default, you do not need to do anything, since ALTER_INFOSET is now the default.

If you use the parser through the SAX, DOM or XOM API and do pass an explicit XmlViolationPolicy to the constructor of HtmlParser, HtmlDocumentBuilder or HtmlBuilder:

You do not need to change your code to upgrade.

If you have your own subclass of TreeBuilder:

The abstract methods on TreeBuilder now have additional arguments for passing the namespace URI. You should upgrade your subclass to deal with the namespace URIs. (The URI is always an interned string, so you can use == to compare.)

There is a class called CoalescingTreeBuilder which you should subclass instead of TreeBuilder to get automatic text node coalescing.

The entry point for passing in a SAX InputSource has moved from the Tokenizer class to the Driver class (in the io package), so you should change your references from Tokenizer to Driver.

If you have your own implementation of TokenHandler:

Please refer to the JavaDocs of TokenHandler. Also note the new separation of Tokenizer and Driver mentioned above.

Change Log

1.4

1.3.1

1.3

1.2.1

1.2.0

1.1.1

1.1.0

1.0.7

1.0.6

License

This is for the HTML parser as a whole except the rewindable input stream,
the named character classes and the Live DOM Viewer. 
For the copyright notices for individual files, please see individual files.

/*
 * Copyright (c) 2005, 2006, 2007 Henri Sivonen
 * Copyright (c) 2007-2011 Mozilla Foundation
 * Portions of comments Copyright 2004-2007 Apple Computer, Inc., Mozilla 
 * Foundation, and Opera Software ASA.
 *
 * Permission is hereby granted, free of charge, to any person obtaining a 
 * copy of this software and associated documentation files (the "Software"), 
 * to deal in the Software without restriction, including without limitation 
 * the rights to use, copy, modify, merge, publish, distribute, sublicense, 
 * and/or sell copies of the Software, and to permit persons to whom the 
 * Software is furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in 
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL 
 * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
 * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 
 * DEALINGS IN THE SOFTWARE.
 */

The following license is for the WHATWG spec from which the named character
data was extracted.

/*
 * Copyright 2004-2010 Apple Computer, Inc., Mozilla Foundation, and Opera 
 * Software ASA.
 * 
 * You are granted a license to use, reproduce and create derivative works of 
 * this document.
 */
 
The following license is for the rewindable input stream.

/*
 * Copyright (c) 2001-2003 Thai Open Source Software Center Ltd
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without 
 * modification, are permitted provided that the following conditions 
 * are met:
 *
 *  * Redistributions of source code must retain the above copyright 
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above 
 *    copyright notice, this list of conditions and the following 
 *    disclaimer in the documentation and/or other materials provided 
 *    with the distribution.
 *  * Neither the name of the Thai Open Source Software Center Ltd nor 
 *    the names of its contributors may be used to endorse or promote 
 *    products derived from this software without specific prior 
 *    written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
 * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 
 * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE 
 * REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 
 * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 
 * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 
 * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN 
 * ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 
 * POSSIBILITY OF SUCH DAMAGE.
 */

The following license applies to the Live DOM Viewer:

Copyright (c) 2000, 2006, 2008 Ian Hickson and various contributors

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

Sightings

Validator.nu HTML Parser in use elsewhere:

Bugs

Known bugs on the trunk.

Acknowledgements

Thanks to the Mozilla Foundation and the Mozilla Corporation for funding this project. Thanks to the html5lib team and Philip Taylor for test cases and bug reports. Thanks to Chris Hubick for Mavenization. Thanks to Simon Pieters and the Firefox nightly testers for finding bugs. Thanks to Mike(tm) Smith, William Chen, Mats Palmgren and Neil Rashbrook for fixing bugs. Thanks to Ian Hickson for writing the spec and the Live DOM Viewer.

Other Implementations

Please refer to the WHATWG wiki for implementations in other programming languages.

Contact

hsivonen@hsivonen.fi