The Validator.nu HTML Parser is
an implementation of the HTML
parsing algorithm in Java. The parser is designed to work as a drop-in replacement for
the XML parser in applications that already support XHTML 1.x content
with an XML parser and use SAX, DOM or XOM to interface with the
parser. Low-level functionality is provided for applications that wish to perform their own IO and support
document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)
SAX, DOM and XOM are supported. Both truly streaming SAX and buffered SAX are supported. Some HTML errors require non-streamable recovery. Those are fatal in the truly streaming mode.
When running under Google Web Toolkit, the browser’s DOM is used (
createElementNS required). For
document.write() support in Java apps, the application needs to provide its own
TreeBuilder subclass and an IO driver. Please see the GWT-specific code for inspiration.
The code is available from a Git repo: git clone https://github.com/validator/htmlparser.git; cd htmlparser; git checkout master
You really should prefer getting the source (see above), since the latest release is over two years out of data. (Yeah, fixing that is on the todo list.)
Version 1.4 2012-06-05 (GPG sig)
The parser is also available from the Maven Central Repository (
The distribution package comes with two precompiled JAR files:
htmlparser-1.4-with-transitions.jar. The first one works properly with HotSpot without special settings but does not support reporting tokenizer transitions to the application via
TransitionHandler. The second one supports tokenizer transition reporting but does not get JITted by HotSpot unless you start the JVM with the
-XX:-DontCompileHugeMethods command line switch.
Note: If you compile the parser yourself from source, you get what corresponds to the
-transitions JAR. To get a version that works with HotSpot without the
-XX:-DontCompileHugeMethods switch, you need to run
nu.validator.htmlparser.generator.ApplyHotSpotWorkaround with the
HotSpotWorkaround.txt files as inputs. Warning: This will modify
Tokenizer.java in place!
JDK 5.0 or later
Optionally ICU4J (required if source code Unicode normalization checking is enabled or if the ICU4J encoding sniffer is enabled)
Optionally jchardet (required the Mozilla chardet encoding sniffer is enabled)
Optionally XOM (required for XOM functionality)
The jar file contains sample
main() entry points:
The first two are sample apps that demo the use of XSLT with HTML5. The first one can use SAX or DOM and requires the Xalan serializer. The second one uses XOM. Running without parameters dumps usage help.
java -cp htmlparser-1.4.jar nu.validator.htmlparser.tools.XSLT4HTML5 --template=sort-ul.xsl --input-html=test.html --output-html=out.html --mode=dom
HTML2XML converts HTML5 to XML 1.0 plus Namespaces. With no arguments, it reads from stdio and writes to stdout. With one parameter, it reads the named file and writes to stdout. With two parameters, the first is the input file name and the second is the output file name.
XML2HTML, HTML2HTML and XML2XML work analogously. The *2HTML versions produce bad output if the document tree is not serializable as HTML5. It is up to the user the make sure that it is.
In all cases, you need to check that your application does not break when it receives SVG or MathML subtrees.
XmlViolationPolicyto the constructor of
If you really wanted the old default behavior, you should now pass
XmlViolationPolicy.FATAL to the constructor.
If you did not really want to have fatal errors by default, you do not need to do anything, since
ALTER_INFOSET is now the default.
XmlViolationPolicyto the constructor of
You do not need to change your code to upgrade.
The abstract methods on
TreeBuilder now have additional arguments for passing the namespace URI. You should upgrade your subclass to deal with the namespace URIs. (The URI is always an interned string, so you can use
== to compare.)
There is a class called
CoalescingTreeBuilder which you should subclass instead of
TreeBuilder to get automatic text node coalescing.
The entry point for passing in a SAX
InputSource has moved from the
Tokenizer class to the
Driver class (in the
io package), so you should change your references from
Please refer to the JavaDocs of
TokenHandler. Also note the new separation of
Driver mentioned above.
setErrorHandler()in the DOM case.
<nobr>is seen when
nobris already open.
isindexprocessing added attributes to all elements that were supposed to have no attributes.
getElementByIdwork with the DOM trees built by the parser.
switchbranch per state instead of method per state.
TreeBuildersubclasses to request parser suspension. (Applications wishing to implement
document.write()should provide their own
TreeBuildersubclass and a
document.write()-aware replacement of the
Driverclass. Look in the
gwt-src/directory for sample code.)
This is for the HTML parser as a whole except the rewindable input stream, the named character classes and the Live DOM Viewer. For the copyright notices for individual files, please see individual files. /* * Copyright (c) 2005, 2006, 2007 Henri Sivonen * Copyright (c) 2007-2011 Mozilla Foundation * Portions of comments Copyright 2004-2007 Apple Computer, Inc., Mozilla * Foundation, and Opera Software ASA. * * Permission is hereby granted, free of charge, to any person obtaining a * copy of this software and associated documentation files (the "Software"), * to deal in the Software without restriction, including without limitation * the rights to use, copy, modify, merge, publish, distribute, sublicense, * and/or sell copies of the Software, and to permit persons to whom the * Software is furnished to do so, subject to the following conditions: * * The above copyright notice and this permission notice shall be included in * all copies or substantial portions of the Software. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER * DEALINGS IN THE SOFTWARE. */ The following license is for the WHATWG spec from which the named character data was extracted. /* * Copyright 2004-2010 Apple Computer, Inc., Mozilla Foundation, and Opera * Software ASA. * * You are granted a license to use, reproduce and create derivative works of * this document. */ The following license is for the rewindable input stream. /* * Copyright (c) 2001-2003 Thai Open Source Software Center Ltd * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * * Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * * Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials provided * with the distribution. * * Neither the name of the Thai Open Source Software Center Ltd nor * the names of its contributors may be used to endorse or promote * products derived from this software without specific prior * written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE * REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN * ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */ The following license applies to the Live DOM Viewer: Copyright (c) 2000, 2006, 2008 Ian Hickson and various contributors Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Validator.nu HTML Parser in use elsewhere:
Known bugs on the trunk.
Thanks to the Mozilla Foundation and the Mozilla Corporation for funding this project. Thanks to the html5lib team and Philip Taylor for test cases and bug reports. Thanks to Chris Hubick for Mavenization. Thanks to Simon Pieters and the Firefox nightly testers for finding bugs. Thanks to Mike(tm) Smith, William Chen, Mats Palmgren and Neil Rashbrook for fixing bugs. Thanks to Ian Hickson for writing the spec and the Live DOM Viewer.
Please refer to the WHATWG wiki for implementations in other programming languages.