HTML 4, HTML 5, XHTML, MIME types - the definitive resource

Contents.

  • Terminology
  • Languages and Serializations
  • Specifications
  • Browser Parsers and Content (MIME) Types
  • Browser Support
  • Validators and Document Type Definitions
  • Quirks, Limited Quirks, and Standards modes.

Terminology

One of the difficulties of describing this is clearly that the terminology within the official specifications has changed over the years, since HTML was first introduced. What follows below is based on HTML5 terminology. Also, "file" is used as a generic term to mean a file, document, input stream, octet stream, etc to avoid having to make fine distinctions.

Languages and Serializations

HTML and XHTML are defined in terms of a language and a serialization.

The language defines the vocabulary of the elements and attributes, and their content model, i.e. which elements are permitted inside which other elements, which attributes are allowed on which element, along with the purpose and meaning of each element and attribute.

The serialization defines how mark-up is used to describe these elements and attributes within a text document. This includes which tags are required and which can be inferred, and the rules for those inferences. It describes such things as how void elements should be marked up (e.g. “>” vs “/>”) and when attribute values need to be quoted.

Specifications

The HTML 4.01 specification is the current specification that defines both the HTML language and the HTML serialization.

The XML 1.0 specification defines a serialization but leaves the language to be defined by other specifications, which are termed “XML applications”

The XHTML 1.0 and 1.1 specifications are both in use. Essentially, they use the same language as HTML 4.01 but use a different serialization, one that is compatible with the XML 1.0 specification. i.e. XHTML is an XML application.

The HTML5 (as of 2010-04-18, draft) specification describes a new language for both HTML and XHTML. This language is mostly a superset of the HTML 4.01 language, but is intended to only be backward compatible with existing web tools, (e.g. browsers, search engines and authoring tools) and not with previous specifications, where differences arise. So the meaning of some elements are occasionally changed from the earlier specifications. Similarly, each of the serializations are backward compatible with the current tools.

Browser Parsers and Content (MIME) Types

When a text file is sent to a browser, it is parsed into its internal memory structure (object model). To do so it uses a parser which follows either the HTML serialization rules or XML serialization rules. Which parser it uses depends on what it deduces the content type to be, based for non-local files on the “content-type” HTTP header. Internally, once the file has been parsed, the browser treats the object model in almost the same way, regardless of whether it was originally supplied using an HTML or XHTML serialization.

For a browser to use its XHTML parser, the content type HTTP header must be one of the XML content types. Most commonly, this is either application/xml or application/xhtml+xml. Any non XML content type will mean that the file, regardless of whether it meets all the XHTML language and serialization rules or not, will not be processed by the browser as XHTML.

Using a HTTP content type of text/html (or in most fallback scenarios, where the content type is missing or any other non-XML type) will cause the browser to use its HTML serialization parser.

One key difference between the two parsers is that the HTML serialization parser performs error recovery. If the input file to the parser does not meet the HTML serialization rules, the parser will recover in ways reverse engineered from previous browsers and carry on building its object model until it reaches the end of the file. HTML5 contains the first normative definition of the recovery but no mainstream browser has shipped an implementation of the algorithm enabled in a release version as of 2010-04-26.

In contrast, the XML serialization parser, will stop when it encounters anything that it cannot interpret as XML (i.e. when it discovers that the file is not XML well-formed). This is required of parsers by the XML 1.0 specification.

Browser Support

Most modern browsers contain support for both an HTML parser and an XML parser. However, in Microsoft Internet Explorer versions 8.0 and earlier, the XML parser cannot directly create an object model for rendering as an HTML page. The XML structure can, however be processed with an XSLT file to create a stream which in turn be parsed using the HTML parser to create a object model that can be rendered.

Starting with Internet Explorer 9 Platform Preview, XHTML supplied using an XML content type can be parsed directly in the same way as the other modern browsers.

When their XML parsers detect that their input files are not XML well-formed, some browsers display an error message, and others show the page as constructed up to the point where the error was detected and some offer the user the opportunity to have the file re-parsed using their HTML parser.

Validators and Document Type Definitions

HTML and XHTML files can begin with a Document Type Definition (DTD) declaration which indicates the language and serialization that is being used in the document. Validators, such as the one at http://validator.w3.org/ use this information to match the language and serialization used within the file against the rules defined in the DTD. It then reports errors based on where the rules in the DTD are violated by mark up in the file.

Not all HTML serialization and language rules can be described in a DTD, so validators only test for a subset of all the rules described by the specifications.

HTML 4.01 and XHTML 1.0 define Strict, Transitional, and Frameset DTDs which differ in the language elements and attributes that are permitted in compliant files.

Validators based on HTML5 such as validator.nu behave more like browsers, processing the page according to the HTTP content type and using a non DTD-based rule set so that they catch errors that cannot be described by DTDs.

Quirks, Limited Quirks, and Standards modes.

Browsers do not validate the files sent to them. Nor do they use any DTD declaration to determine the language or serialization of the file. However, they do use it to guess the era in which the page was created, and therefore the likely parsing and rendering behaviour the author would have expected of a browser at that time. Accordingly, they define three parsing and rendering modes, known as Quirks mode, Limited Quirks (or Almost Standards) mode and Standards mode.

Any file served using an XML content type is always processed in standards mode. For files parsed using the HTML parser, if there is no DTD provided or the DTD is determined to be very old, browsers use their quirks mode. Broadly speaking, HTML 4.01 and XHTML files processed as text/html will be processed with limited quirks mode if they contain a transitional DTD and with standards mode if using a strict DTD.

Where the DTD is not recognised, the mode is determined by a complex set of rules. One special case is where the public and system identifiers are omitted and the declaration is simply <!DOCTYPE html>. This is known to be the shortest doctype declaration where current browsers will treat the file as standards mode. For that reason, it is the declaration specified to be used for HTML5 compliant files.


HTML

QA

  • HTML5 still has rather immature QA tools
  • HTML 4 has been around a long time and has very mature QA tools

Browser Support

  • HTML 5 — Bits and pieces are supported by various browsers. You need Javascript to support most things, basic structural elements (like <section>) fall over very badly if the Javascript isn't available. *
  • HTML 4 is well supported

* Some clarification and examples needed.


Strict vs. Transitional vs. Frameset

Why?

HTML as well as XHTML comes in different flavors, namely Strict, Transitional and Frameset. Each "dialect" specifies a different set of elements which are allowed to be used.

Jumping in the deep end with Strict limits some of your options out of the box (e.g. not being able to specify target attributes) that make it a no-go for many.

Main Differences

Please expand


XHTML

QA

XHTML has mature QA tools, but looser DTDs (e.g. <textarea rows="" is a conformance error in HTML 4.01 and XHTML 1.0, but only a validity error in HTML 4.01*). This is despite XHTML 1.0 being, theoretically, HTML 4.01 expressed as XML. There are numerous differences, which are not documented in the "Differences with HTML 4" section of the spec.

An XHTML document when served with a MIME type of application/xhtml+xml (see below) needs to conform 100% to XML standards, i.e. it needs to be "well formed XML". Even a single unescaped ampersand can cause the parser (the browser) to throw a warning and refuse to render the document. When creating dynamic XHTML sites which may include content supplied by third parties (e.g. any user input), great care needs to be taken to escape any and all invalid character sequences, to not allow invalid tags or attributes and to properly nest all elements.

Browser Support

  • XHTML as text/html is well supported, but you have to jump through compatibility hoops. Without jumping through those hoops a perfectly valid page may fail to render (e.g. <script type="text/javascript" src="foo" /> causing the rest of the document to be treated as a script instead of HTML) or display other issues.
  • XHTML as application/xhtml+xml is reasonably well supported by most browsers (minor bugs may exist). It does not work at all in Internet Explorer <= 8.

MIME type application/xhtml+xml vs text/html

XHTML served as text/html is neither XHTML nor HTML. It's handled like HTML by the browser, but since it's not HTML, it's treated as tag soup. Since Internet Explorer does not know how to handle XHTML using application/xhtml+xml, it will need to be served as text/html for IE only. Which means XHTML for IE is always tag soup, unless the differences between HTML and XHTML are tended to as well (see Differences with HTML 4).

Welcome to a world of pain. You get downstream proxy issues (you have to vary caching based on whatever request header you perform your conditional on). The document structure changes (e.g. tables without a <tbody> tag may or may not have a <tbody> element depending on the content-type). It is a lot of work to produce, essentially, two almost identical documents.

XHTML and Javascript

When an XHTML document is parsed with a proper application/xhtml+xml MIME type, there may be differences when manipulating DOM elements via Javascript. Some scripts which have not been prepared properly may work differently or fail in an XHTML environment.
Examples: under HTML, a JavaScript command like console.log(document.body.tagName); would output "BODY", whereas the same command under XHTML would output "body".
And if you have <table><tr>..</tr></table in the markup, the table's firstChild in JavaScript would be the tr in XHTML, but a TBODY in HTML.

Advantages of using XHTML (as application/xhtml+xml)

  • Allows direct interleaving of other XML formats like MathML and SVG.
  • Is theoretically faster to parse. The difference is negligible in practice though.

* Paragraph needs some polishing.