The Validator.nu HTML Parser is
an implementation of the HTML
parsing algorithm in Java. The parser is designed to work as a drop-in replacement for
the XML parser in applications that already support XHTML 1.x content
with an XML parser and use SAX, DOM or XOM to interface with the
parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write()
with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)
SAX, DOM and XOM are supported. Both truly streaming SAX and buffered SAX are supported. Some HTML errors require non-streamable recovery. Those are fatal in the truly streaming mode.
When running under Google Web Toolkit, the browser’s DOM is used (createElementNS
required). For document.write()
support in Java apps, the application needs to provide its own TreeBuilder
subclass and an IO driver. Please see the GWT-specific code for inspiration.
The code is available from a Git repo: git clone https://github.com/validator/htmlparser.git; cd htmlparser; git checkout master
You really should prefer getting the source (see above), since the latest release is over two years out of data. (Yeah, fixing that is on the todo list.)
Version 1.4 2012-06-05 (GPG sig)
The parser is also available from the Maven Central Repository (groupId
: nu.validator.htmlparser
, artifactId
: htmlparser
, version
: 1.4
).
The distribution package comes with two precompiled JAR files: htmlparser-1.4.jar
and htmlparser-1.4-with-transitions.jar
. The first one works properly with HotSpot without special settings but does not support reporting tokenizer transitions to the application via TransitionHandler
. The second one supports tokenizer transition reporting but does not get JITted by HotSpot unless you start the JVM with the -XX:-DontCompileHugeMethods
command line switch.
Note: If you compile the parser yourself from source, you get what corresponds to the -transitions
JAR. To get a version that works with HotSpot without the -XX:-DontCompileHugeMethods
switch, you need to run nu.validator.htmlparser.generator.ApplyHotSpotWorkaround
with the Tokenizer.java
and HotSpotWorkaround.txt
files as inputs. Warning: This will modify Tokenizer.java
in place!
JDK 5.0 or later
Optionally ICU4J (required if source code Unicode normalization checking is enabled or if the ICU4J encoding sniffer is enabled)
Optionally jchardet (required the Mozilla chardet encoding sniffer is enabled)
Optionally XOM (required for XOM functionality)
The jar file contains sample main()
entry points:
nu.validator.htmlparser.tools.XSLT4HTML5
nu.validator.htmlparser.tools.XSLT4HTML5XOM
nu.validator.htmlparser.tools.HTML2XML
nu.validator.htmlparser.tools.XML2HTML
nu.validator.htmlparser.tools.XML2XML
nu.validator.htmlparser.tools.HTML2HTML
The first two are sample apps that demo the use of XSLT with HTML5. The first one can use SAX or DOM and requires the Xalan serializer. The second one uses XOM. Running without parameters dumps usage help.
java -cp htmlparser-1.4.jar nu.validator.htmlparser.tools.XSLT4HTML5 --template=sort-ul.xsl --input-html=test.html --output-html=out.html --mode=dom
HTML2XML converts HTML5 to XML 1.0 plus Namespaces. With no arguments, it reads from stdio and writes to stdout. With one parameter, it reads the named file and writes to stdout. With two parameters, the first is the input file name and the second is the output file name.
XML2HTML, HTML2HTML and XML2XML work analogously. The *2HTML versions produce bad output if the document tree is not serializable as HTML5. It is up to the user the make sure that it is.
In all cases, you need to check that your application does not break when it receives SVG or MathML subtrees.
XmlViolationPolicy
to the constructor of HtmlParser
, HtmlDocumentBuilder
or HtmlBuilder
:If you really wanted the old default behavior, you should now pass XmlViolationPolicy.FATAL
to the constructor.
If you did not really want to have fatal errors by default, you do not need to do anything, since ALTER_INFOSET
is now the default.
XmlViolationPolicy
to the constructor of HtmlParser
, HtmlDocumentBuilder
or HtmlBuilder
:You do not need to change your code to upgrade.
TreeBuilder
:The abstract methods on TreeBuilder
now have additional arguments for passing the namespace URI. You should upgrade your subclass to deal with the namespace URIs. (The URI is always an interned string, so you can use ==
to compare.)
There is a class called CoalescingTreeBuilder
which you should subclass instead of TreeBuilder
to get automatic text node coalescing.
The entry point for passing in a SAX InputSource
has moved from the Tokenizer
class to the Driver
class (in the io
package), so you should change your references from Tokenizer
to Driver
.
TokenHandler
:Please refer to the JavaDocs of TokenHandler
. Also note the new separation of Tokenizer
and Driver
mentioned above.
setErrorHandler()
in the DOM case.ArrayIndexOutOfBoundsException
in the meta
prescan.<nobr>
is seen when nobr
is already open.Dom2Sax
robust against null
localName
s.pre
, textarea
or listing
start tag.isindex
processing added attributes to all elements that were supposed to have no attributes.getElementById
work with the DOM trees built by the parser.switch
branch per state instead of method per state.TreeBuilder
subclasses to request parser suspension. (Applications wishing to implement document.write()
should provide their own TreeBuilder
subclass and a document.write()
-aware replacement of the Driver
class. Look in the gwt-src/
directory for sample code.)This is for the HTML parser as a whole except the rewindable input stream, the named character classes and the Live DOM Viewer. For the copyright notices for individual files, please see individual files. /* * Copyright (c) 2005, 2006, 2007 Henri Sivonen * Copyright (c) 2007-2011 Mozilla Foundation * Portions of comments Copyright 2004-2007 Apple Computer, Inc., Mozilla * Foundation, and Opera Software ASA. * * Permission is hereby granted, free of charge, to any person obtaining a * copy of this software and associated documentation files (the "Software"), * to deal in the Software without restriction, including without limitation * the rights to use, copy, modify, merge, publish, distribute, sublicense, * and/or sell copies of the Software, and to permit persons to whom the * Software is furnished to do so, subject to the following conditions: * * The above copyright notice and this permission notice shall be included in * all copies or substantial portions of the Software. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER * DEALINGS IN THE SOFTWARE. */ The following license is for the WHATWG spec from which the named character data was extracted. /* * Copyright 2004-2010 Apple Computer, Inc., Mozilla Foundation, and Opera * Software ASA. * * You are granted a license to use, reproduce and create derivative works of * this document. */ The following license is for the rewindable input stream. /* * Copyright (c) 2001-2003 Thai Open Source Software Center Ltd * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * * Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * * Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials provided * with the distribution. * * Neither the name of the Thai Open Source Software Center Ltd nor * the names of its contributors may be used to endorse or promote * products derived from this software without specific prior * written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE * REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN * ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */ The following license applies to the Live DOM Viewer: Copyright (c) 2000, 2006, 2008 Ian Hickson and various contributors Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Validator.nu HTML Parser in use elsewhere:
Thanks to the Mozilla Foundation and the Mozilla Corporation for funding this project. Thanks to the html5lib team and Philip Taylor for test cases and bug reports. Thanks to Chris Hubick for Mavenization. Thanks to Simon Pieters and the Firefox nightly testers for finding bugs. Thanks to Mike(tm) Smith, William Chen, Mats Palmgren and Neil Rashbrook for fixing bugs. Thanks to Ian Hickson for writing the spec and the Live DOM Viewer.
Please refer to the WHATWG wiki for implementations in other programming languages.