About Validator.nu

Validator.nu is validation 2.0.

The Pitch

No DTD-Based Validation

Basic Usage

Validator.nu has two facets: generic (complex UI) and (X)HTML5 (simple UI).

Enter the URL (http, https or data IRI to be exact) of the document you want to validate in the field labeled “Document” and submit the form. That’s all it takes in most cases.

In the (X)HTML5 facet, the parser and the schema will be chosen based on the HTTP Content-Type of the document. In the generic facet, the parser will be chosen based on the HTTP Content-Type and a preset schema will be chosen based on the root namespace (for XML) or the doctype (for text/html).

Alternative Modes of Input

For simplicity, the HTML5 facet only shows UI for validation by URL. Validation by text area and by file upload are available in the generic facet.

Here are bookmarklets:

There is a command-line script that uploads documents from the local filesystem to the (X)HTML5 validator. Integration into vim is available.

Configurability

Schemas

When the field for schemas is left empty, the validator will try to choose a schema on its own. If you are not happy with the guessed preset, you can specify a schema either by selecting a preset or by entering a space-separated list of schema URLs (http, https or data IRIs). In addition to actual schemas, you may use certain special URLs to invoke checkers that seem like special schemas but aren’t actually implemented as schemas.

Parser

If the automatic choice of parser does not work for you, you can choose the parser manually. The choice of parser affects the HTTP Accept request header that is sent.

Be lax about HTTP Content-Type

When the lax option is set, text/html, text/xsl and text/plain are allowed as XML content types and text/plain is allowed as an HTML content type and, if the URL ends with .rnc, as a Compact Syntax content type. Also, in the lax mode the US-ASCII default for text/* XML types is not enforced.

Normally, schemas using the RELAX NG XML syntax, Schematron schemas and the XML documents to be validated are expected to be served using an XML content type. Schemas using the RELAX NG Compact Syntax are expected to be served using application/relax-ng-compact-syntax content type. (The unregistered application/vnd.relax-ng.rnc content type is also understood.) HTML documents are expected to be served as text/html.

Show Image Report

When the “Show Image Report” checkbox is set, a report concerning the textual alternatives of img elements in the XHTML namespace is shown for accessibility review.

Show Source

You may check the “Show Source” checkbox to show the decoded source of the document being checked. Please note that the source may not be shown in its entirety if the parser encounters a fatal error. Moreover, the show source feature shows the decoded Unicode source. Erroneous byte sequences in the original source and characters that would render the validator output as non-conforming (e.g. U+0000) are not represented faithfully.

Web Service API

If you want to create you own alternative mode of input or want to call Validator.nu (or your own local copy) from within your own application, there is a RESTful Web service API. In addition to the modes of input that work from HTML forms, you can also POST the document to be checked as an HTTP entity body. In addition to the default HTML output, the messages are also available as XHTML, XML, JSON, GNU error format and plain text.

Preset Schemas

HTML5 (experimental)

HTML5 (text/html-compatible content models)

HTML5+ARIA (experimental)

HTML5 with ARIA (unendorsed integration prototype)

Mike(tm) Smith has generated documentation for this schema.

HTML 4.01 Strict + IRI / XHTML 1.0 Strict + IRI

XHTML 1.0 Strict with IRI support. Generally suitable for use HTML 4.01 Strict checking as well, although there are theoretically wrong corner cases. Uses backported HTML5 datatypes.

HTML 4.01 Transitional + IRI / XHTML 1.0 Transitional + IRI

XHTML 1.0 Transitional with IRI support. Generally suitable for use HTML 4.01 Transitional checking as well, although there are theoretically wrong corner cases. Uses backported HTML5 datatypes.

HTML 4.01 Frameset + IRI / XHTML 1.0 Frameset + IRI

XHTML 1.0 Frameset with IRI support. Generally suitable for use HTML 4.01 Frameset checking as well, although there are theoretically wrong corner cases. Uses backported HTML5 datatypes. Do not use. :-)

XHTML5 (experimental)

XHTML5 (XML-compatible content models)

XHTML5+ARIA, SVG 1.1 plus MathML 2.0 (experimental)

XHTML5 with ARIA (unendorsed integration prototype), SVG 1.1, MathML 2.0 and holes for OpenMath, RDF and Inkscape cruft.

XHTML 1.0 Strict, SVG 1.1, MathML 2.0 + IRI

XHTML 1.0 (not 1.1), SVG 1.1 and MathML 2.0 with IRI support.

XHTML 1.0 Strict, Ruby, SVG 1.1, MathML 2.0 + IRI

XHTML 1.0 (not 1.1), Ruby, SVG 1.1 and MathML 2.0 with IRI support.

XHTML Basic + IRI

A schema for XHTML Basic with IRI support. Suitable for use with the HTML parser.

SVG 1.1 + IRI

SVG 1.1 Full with IRI support (Inkscape cruft not permitted).

Non-Schema Checkers

The service supports a few special pseudo-schema URIs that map to checkers written in a Turing-complete programming language.

http://c.validator.nu/table/

Checks (X)HTML table integrity. The current implementation should be considered a prototype that has not yet been updated to match the latest spec language for HTML5. (See more detailed discussion.)

http://c.validator.nu/nfc/

Checks that constructs in the document tree are in the Unicode Normalization Form C and don’t start with a “composing character”. Using this pseudo-schema also enables normalization checking of source text. (See more detailed discussion.)

http://c.validator.nu/text-content/

Checks the text content of the (X)HTML5 meter, progress and time elements for conformance. (This is a prototype with liberties taken.)

http://c.validator.nu/unchecked/

Warns about RDF, OpenMath and Inkspace holes and about the use of version="1.0" in SVG.

http://c.validator.nu/usemap/

Checks the usemap attribute for referential integrity.

http://c.validator.nu/all/

Shorthand for http://c.validator.nu/table/ http://c.validator.nu/nfc/ http://c.validator.nu/text-content/ http://c.validator.nu/unchecked/ http://c.validator.nu/usemap/.

http://c.validator.nu/all-html4/

Shorthand for http://c.validator.nu/table/ http://c.validator.nu/nfc/ http://c.validator.nu/unchecked/ http://c.validator.nu/usemap/.

http://c.validator.nu/debug/

Dumps parse events as warnings.

FAQ

My server gives the HTML5 validator a 406 status. What’s up?

Your server cannot properly deal with an Accept header that does not have */* in it. Chances are that you are using Apache 1.3, PHP and MultiViews together. MultiViews thinks the type of your page is application/x-httpd-php, which isn’t in the Accept header. Apache 2 does not have this problem.

Can I get a “Valid HTML5” badge?

No, Validator.nu does not give badges.

I have observed that once people are given badges they start to feel entitled to the badges and become hostile if the validation service is changed so that some documents that previously were proclaimed valid no longer are. I do not want to deliberately incite an opposition to bug fixes. I know some of the schemas are not as tight as the corresponding spec prose. If I make them tighter, consider it a bug fix. Moreover, the HTML 5 spec is still changing, so the schema will change as well. Finally, I may (and even intend to) change the namespace associations of preset schemas in the future.

In addition to the problem with changing the validator after badges have been awarded, badges don’t provide value to the readers of validated pages. Validation is a tool for you as a page author—not something your readers need to verify. However, if you are writing about Web authoring and want to refer others to Validator.nu, please, by all means feel free to link to Validator.nu.

Java? Eww. Why didn’t you write it in Python or Ruby?

By the time Ruby on Rails hit everyone’s radar, this project was already underway. However, Ruby would still have been a bad choice had I considered it seriously earlier. Ruby lacks a solid Unicode infrastructure. I’ve already been in a situation when I had to stop writing app code and spend time writing the very basics Unicode infrastructure. I don’t want to be in that situation again. Ruby lacks solid XML infrastructure as well.

I chose Java over Python for three reasons: SAX, Jing and more experience with Java. Apart from Java feeling like a more secure choice because I had more experience with it, the choice between Java and Python also comes down to infrastructure. Having a platform-wide unified way for plugging together XML tools is extremely important when what you are doing entails plugging together XML tools efficiently.

Java is in a unique position when it comes to XML tool infrastructure. Java has a lot of XML-related libraries available and they pretty much all plug into the same interface. Not only is there a platform-wide XML API, it also happens to be one of the most complete and correct of the XML APIs around. From the point of view of RELAX NG, Java being the language Jing is written in is an extremely important consideration. Jing is a seriously good piece of software. Moreover, Java is the native language of the extensibility interface for RELAX NG datatype libraries.

While I’m on a soap box, I should mention that ICU4J is a seriously good piece of software, too, and having Java’s notion of Unicode frozen as UTF-16 from to dawn of time until eternity is very important considering the stability of infrastructure. It is a horribly bad idea that the meaning of Python programs change (due to datatypes changing underneath) depending on how the interpreter was compiled. Unicode is optimized for 16-bit units. The stability of sticking to UTF-16 in RAM everywhere outweighs the theoretical purity of UTF-32 in RAM. (On disk and network, use UTF-8, of course.)

I do want to make the validator functionality available to applications that are not written in Java, though. This is why Validator.nu has a Web service interface that can be used either with the instance running at validator.nu or with a your private instance running at localhost. I encourage you to write a wrapper library for the Web service in your favorite programming language.

What’s wrong with DTDs?

I think DTDs are bad in four ways:

  1. DTDs pollute the document with schema-specific syntax. Since the document itself declares the rules, the question on answered by DTD validation is not the question that should be asked. DTD validation aswers the question “Does this document conform to the rules it declares itself?” The interesting question is “Does this document conform to these rules?” when the person who asks the question chooses the rules the question is about.

  2. DTDs mix a validation mechanism, an inclusion mechanism and an infoset augmentation mechanism. The inclusion mechanism is mainly used for character entities, which solve (but only if the DTD is processed and processing it is not required!) an input problem by burdening the recipient instead of keeping input matters between the editing software and the document author.

  3. DTDs aren’t particularly expressive.

  4. DTDs don’t support Namespaces in XML.

I hope providing an online validation service for RELAX NG removes the excuse that DTDs are needed for online validators.

Validation has a clear and precise meaning. Can’t you kids read ISO 8879?

“Validation” and “validator” in the name and the user interface of the service refer to the ISO/IEC FDIS 19757-2 definition of “validator” (which performs validation), to the Schematron “validation” function (which is performed by a validator), and to the HTML 5 definition of “validator”.

Known Issues and Ideas for Future Development

Schemas for XHTML 1.0 are used for HTML 4.01, because XHTML 1.0 is supposed to be a reformulation of HTML 4.01 in XML. However, there are some subtle spec bugs introduced in the reformulation. For this reason, some errors for HTML 4.01 are wrong. For example, XHTML 1.0 (in the DTD) forbids the name attribute on the form element, although it is allowed in HTML 4.01.

Please refer to the bug tracker for other known issues and for ideas for future development.

Reporting Bugs and Getting Help

The preferred forum for discussing issues related to using the (X)HTML5 validator is the WHATWG Help mailing list. The preferred forum for discussing issues related to implementing (X)HTML5 validators in general and this on in particular is the WHATWG Implementors mailing list. Bugs should be reported to Validator.nu Bugzilla.

Feature Details for Custom Schemas

Source Code

The code is hosted on GitHub. Please see the the build instructions.

Acknowledgments

I would like to thank the Mozilla Foundation and the Mozilla Corporation for funding this project.

I would like to thank James Clark for writing Jing and for championing RELAX NG and XML. I would also like to thank everyone who tested the development builds, the writers of test cases and everyone who has developed library code and schemas that the service uses.

Mike(tm) Smith has contributed numerous fixes and updates to HTML5 validation and is the most active developer of the project as of 2014.

Philip Jägenstedt contributed Microdata validation support.

The XHTML 1.0 schemas were originally written by James Clark and have been improved by Petr Nálevka.

fantasai designed the (X)HTML5 schema framework, wrote the (X)HTML5 Core schemas and helped along the way when I added features.

JavaScript bits, the favicon and a lot of bug reports were contributed by Simon Pieters.

The schemas for RELAX NG and XSLT were written by James Clark.

The principal author of the schema for DocBook is Norman Walsh.

The SVG schemas come from the W3C.

The MathML schema was written by Yutaka Furubayashi.

Test cases written by fantasai, Anne van Kesteren and Christoph Schneegans were very useful in developing this service.

This product includes software developed by The Apache Software Foundation (http://www.apache.org/).

This product uses The SAXON XSLT Processor from Michael Kay.

Validome by The Validome Team

Focuses on HTML, XHTML, WML. Uses SGML DTDs and custom code for HTML. Uses XSD and custom code for XHTML. Recently added support for RSS and Atom, but that feature is still in flux.

XHTML 1.0 schema validator by Christoph Schneegans

Validates using the XSD implementation of XHTML 1.0.

Relaxed by Petr Nálevka

Uses RELAX NG and Schematron for validating XHTML and HTML. (The XHTML 1.0 schemas offered here as presets are based on the schemas used in Relaxed.)

Page Valet by WebThing / Nick Kew

DTD-based SGML and XML validation.

Feed Validator by Sam Ruby, Mark Pilgrim, Joseph Walton, and Phil Ringnalda

Checks Atom and RSS feeds. Uses Python as the schema language. :-)

The W3C CSS Validation Service

Checks CSS style sheets.

The W3C Markup Validation Service

DTD-based SGML and XML validation.

Terms of Service

These terms only apply to the service hosted on the validator.nu domain. If you arrived at this page from another instance of the software run by someone else, such as the W3C, that instance may have different terms.

If you do not accept these terms, do not use the service. You can run your own copy of the software under the applicable Open Source licenses without having to agree to these terms.

These terms may be updated from time to time. There are no email notifications of updates in order not to have to collect your email address.

Point of contact

The software instance on validator.nu in operated by Henri Sivonen on Gandi's infrastructure. The point of contact in all matters related to the deployment instance on validator.nu is Henri Sivonen. (For matter relating to the validator software itself rather than the specific deployment instance on validator.nu, please refer to GitHub issues of the software project.)

No Guarantee of Service Level

There is absolutely no warranty or guarantee of level of service. If you want uptime guarantees, please run your own copy of the software. The service may be discontinued at any time without prior notice.

Appropriate Use

The service at validator.nu is meant for validating public Web pages (GET request mode) and for validating drafts of pages that are being prepared to be published on the Web (POST request mode). By design, the service does not ask for passwords to be able to validate pages that are behind login. You must not grant the validator instance at validator.nu special access to your site e.g. by IP address. If you wish to validate behind-login or otherwise private pages, please run your own copy of the validator software. Do not upload sensitive data as POST request. (E.g. do not upload real confidential records within your HTML if your a developing an HTML UI that deals with such data.)

You must not use the service to validate illegal content or engage in activity that has the appearance of botnet activity.

Do not place excessive load on the service. It's fine to use the API from the content management system of your personal blog. If you have a large blog hosting service, please run your own copy of the software. You must not use a browser extension that sends the content of every page you browse to the validator. If you want to see a validity indicator for every page, please run your own copy of the validator software.

Privacy

For HTTP requests, the service is typically configured to log non-personally-identifiable usage information including the virtual server host name accessed, the path accessed, the HTTP method, the response code, the number of bytes transferred, the access time, and the User-Agent header your client software sent (i.e. the name and version of your Web browser).

In successful normal operation, your IP address is not logged in the clear. An anonymized hash thereof may be logged even during normal operation with a keyd hash function whose key is kept in RAM and discarded from time to time to make general usage statistic analysis possible while making it infeasible to reverse the hash by brute force even for a small search space such as the space of IPv4 addresses.

If the service encounters an error, it may log the error and include your IP address and/or the URL being validated in the logged error event. These logs are deleted from time to time after fixing the errors or ignoring them as unactionable. More general IP address logging may be temporarily turned on to investigate abuse of the service. Afterwards, the IP addresses will be anonymized as described in the above paragraph. However, IP addresses deemed to have caused abusive traffic may be retained as part of a blocklist.

The URLs of the pages you validate may be kept for a limited time to understand abuse of the service. (Since anyone can validate anyone else's public Web page and you are only allowed to validate public pages by URL, the URLs are not considered personally identifying of the person asking for the validation.)

These logs are meant to be visible to Henri Sivonen only, but there's no technical way for him to prevent Gandi from gaining access to these logs (though they aren't supposed to look). Aggregate usage statistics may be shared publicly. Government requests may be responded to.

The content of POST requests may be written to a temporary file. While these are deleted after processing the request, in principle they might leave forensically recoverable data on disk until actually overwritten.