MAMA: Document Encodings

By Brian Wilson


  1. Introduction
  2. How encodings are specified
  3. Can we agree to disagree?


A critical part of rendering a document lies in a browser discerning the proper character set encoding. MAMA tracked most of the methods that can be used to detect the encoding, but it did not attempt to declare a "One, True Encoding" in cases where discrepancies existed. In this section we will examine the usage of three of the main encoding sources and discover how/whether they overlapped, and whether they agreed with each other.

How encodings are specified

The HTTP Headers can provide the encoding through the "Charset" parameter of the Content-Type field. The Content-Type (and consequently the "Charset" parameter) can also be specified in HTML via the META element, and XML documents have an additional means to signal the document's encoding by using an "encoding" attribute on the XML prolog. The encoding was specified using at least one of these methods in 2,626,228 URLs from MAMA (74.84%). As you can see, authors show a great preference for using the META element to specify a document's encoding over the other two methods.

Note: Region sizes are not to scale

Venn diagram for breakdown of encoding source specification methods

Can we agree to disagree?

Having multiple encoding sources can quickly become a can of worms—what happens when those encodings do not agree with each other? Nothing is worse than a Web page having a schizophrenic argument with itself about its encoding identity. To compare encodings, the various values were all forced to lowercase and leading/trailing spaces were removed. Encoding variations like "iso-8859-1", "iso_8859_1" and "iso 8859-1" would all be considered different values using this scheme. The results of this comparison show that in the majority of encoding overlap situations, the values agree (72.96% of all overlap scenarios). However, values are expected to agree; another (negative) way to frame the results is that 133,968 URLs (27.04% of the overlap scenarios) have specified multiple encodings and they clash with each other. So in at least 1/4 of cases where a browser does not have a single encoding source, it must then resort to torturous gymnastics to determine the outcome.

Fig 3-1: Encoding agreement when specified by multiple sources
Encoding specification method Total
HTTP Header and META only 417,113 293,868 70.45%
META and XML only 49,115 45,029 91.68%
HTTP Header and XML only 6,791 4,553 67.04%
All three: HTTP Header, META and XML 22,500 18,101 80.45%

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.


The forum archive of this article is still available on My Opera.

No new comments accepted.