MAMA: W3C validator research
Page 1 index : Page 2 index : Page 3 index
- About markup validation—an introduction
- Previous validation studies
- Sources and tools: The URL set and the validator
- What use is markup validation to an author?
- How many pages validated?
- Interesting views of validation rates, part 1: W3C-Member companies
Note that this document is large, so has been broken up into 3 pages; use the navigation at the bottom of the document to navigate between pages.
About markup validation—an introduction
MAMA is an in-house Opera research project developed to create a repeatable and cross-referenced analysis of a significant population of Web pages that represent real world markup. Of course, part of that examination must also cover markup validation—an important measure of a page's adherence to a specific standard. The W3C markup validation tool produces useful metrics that add to the rest of MAMA's breakdown of its URL set. We will look at what validation reveals about these URLs, what it means to validate a document, and what benefits or drawbacks are derived from the process.
The readership of this section of MAMA's research is expected to be the casual Web page author out for a relaxing weekend browse, as well as those developing the W3C validator tool itself, looking for incisive statistics about the validation "State Of The Union". As a result of this diverse audience, some readers will find that some sections are redundant or mystifying (possibly both at the same time even!). Feel free to skip around the article as needed, but the best first-time reading flow is definitely a linear read. Some of the data presented may need some prerequisite knowledge, but I hope that even the most detailed examinations here may be of interest to all readers in some way. There are some positive trends, some surprises, and some disappointments in the figures to follow.
A quick summary:
The good news: Markup validation pass rates are definitely improving over time.
The bad news: The overall validation pass rate is still miserably low and is not increasing as fast as one would hope
Previous validation studies
There are two previous, large-scale studies of markup validation to which we can compare MAMA's results regarding markup validation trends. Direct correlation with these previous studies was not an original goal of MAMA, but it is a happy accident, given that many of MAMA's design choices happen to coincide.
- Dec. 2001: Dagfinn Parnas's "How to cope with incorrect HTML" thesis; University of Bergen, Norway
- Jun. 2006: Rene Saarsoo's "Coding practices of Web pages" bachelor thesis [PDF, In Estonian] [English summary]
The analysis tools and target URL group were roughly the same between MAMA and these other projects. Both Parnas's and Saarsoo's studies used the WDG validator (see next section), which shares much of the same back-end mechanics with the W3C validator. Both studies also used the DMoz URL set (see next section). The main difference between the URL sets used lies in the amount of DMoz analyzed; where MAMA's research overlaps with Parnas's and Saarsoo's studies, we will attempt to compare results.
|Study||Date||URL Set||Full DMoz Size||Study Set Size|
|Parnas||Dec. 2001||DMoz||~2.5 million||~2.4 million|
|Saarsoo||Jun. 2006||DMoz||~4.4 million||~1.0 million|
|MAMA||Jan. 2008||DMoz||~4.7 million||~3.5 million|
Sources and tools: The URL set and the validator
[For more details about the URLs and tools used in this study, take a look at the Methodology Appendix section of this document.]
Treading on familiar ground: The Open Directory Project (DMoz)
There is a lot of MAMA coverage elsewhere about the DMoz URL set and the decision to use it as the basis of MAMA's research. MAMA did not analyze ALL of the DMoz URLs, though. Transient network issues, dead URLs, and other problems inevitably kept the final URLs analyzed from being bigger than its final total of about 3.5 million. The number of URLs from any given domain was limited in order to decrease per-domain bias in the results. This was an important design decision, because DMoz has a big problem with domain bias (~5% of all URLs in it are solely from cnn.com, for example). Parnas and Saarsoo did not do this, but it has proven to be a useful strategy to employ. I set an arbitrary per-domain limit of 30 URLs, and this seems to be a fair limitation. This restriction policy also helps track per-domain trends—if any are noticeable, they will be presented where they seem interesting.
Any comparison of MAMA's data to other similar studies, even if they also use DMoz, must take into account that DMoz grows and changes over time as editors add, freshen, or delete URLs from its roster. URLs can grow stale or obsolete through removal, and domains can and do die on a distressingly regular basis. The aggregation source of these URLs remains the same, but the set itself is an evolving, dynamic entity.
The W3C validator
To test the URL set, MAMA used the W3C Markup Validator tool (http://validator.w3.org/, v. 0.8.2 released Oct. 2007), which uses the OpenSP parser for its main validation engine. The W3C Markup Validator is a free service from the W3C that helps authors improve the quality of their documents by checking adherence to standards via DTDs. The Parnas and Saarsoo studies both used the WDG validator, but for MAMA's analysis, the W3C validator was the validation tool of choice. As stated on the WDG's Web site, there are many similarities between these two validators,
"Most of the previous differences between the two validators have disappeared with recent development of the W3C validator".
So, even though the validators used are different, there is significant overlap between MAMA's validation study data and the other previous studies. The W3C Quality Assurance group has produced many excellent tools and processes over the years, and that hard work definitely deserves to be showcased in a study like this. Kudos to the W3C validator team!
What use is markup validation to an author?
Why would an author validate a document at all? A validator does not write a Web page for you— the inspiration and perspiration must still come completely from the author. There does not appear to be any real negative consequences to omitting this step. Sticking rigorously to a standard does not necessarily spell success—using a validator on a page and correcting any problems it brings to light does not guarantee that the result will look right on one browser, let alone all of them. Conversely, an invalid page may render exactly the way an author was expecting.
Both authors and readers have come to expect that all browsers perform impeccable error recovery in the face of the worst tag soups the Web can throw at it. Forgiveness is perhaps the most under-appreciated yet important feature we expect from a browser. However, that is asking a lot, especially for the increasingly lightweight devices that are being used to browse the Web. If there are any consequences for sloppy authoring practices, it would be here.
Henri Sivonen properly framed the role of the markup validator in an author's toolkit:
"[A] validator is just a spell checker for the benefit of markup writers so that they can identify typos and typo-like mistakes instead of having to figure out why a counter-intuitive error handling mechanism kicks in when they test in browsers."
Continuing with the spell-checker analogy, there are no dire consequences for a page failing to validate, just as there is seldom a serious consequence of having spelling typos in a document—the overall full meaning is still conveyed well enough to get the point across.
Using the spell-checker analogy also helps dispel a practice that the W3C encourages, something that we will talk more about in a later section—proclaiming that a page has been validated. This is a pointless exercise and means nothing (W3C tool evangelism aside). It is like saying a document has been spell-checked at some time during its history. Any subsequent change to a document can introduce errors—both spelling and syntax-wise—and make the claim superfluous code baggage. As we will show in later sections, pages that have passed validation in the past often do not STAY validated!
Markup validation is a useful tool to help insure that a page conforms to a target you are aiming for. The most obvious thing to take away from the entirety of the MAMA research is that people are BAD at this "HTML thing". Improper tag nesting is rampant, and misspelled or misplaced element and attribute names happen all the time. It is very easy to make silly, casual mistakes—we all make them. Validation of Web pages would expose all these types of simple (and avoidable) errors in moments.
For even more (and probably better) reasons to validate your documents, have a look at the W3C's excellent treatment of the subject: "Why Validate?".
How many pages validated?
The raw validation numbers
The validator's SOAP response has an
element with Boolean content values of "true" and
"false". A "true" value is considered a successful validation.
MAMA found that 145,009 out of 3,509,180 URLs passed validation.
|Study||Date||Passed validation||Total validated||Percentage|
Another interesting view of MAMA's URL validation study is how many domains in MAMA that contained ANY page that validated: 130,398 (of 3,011,661 distinct domains validated) [4.33%]
Validation rates where select Web-page authoring features are also involved
Now, we need to ask the same basic "does it validate?" question multiple ways, keeping our main variable (validation rate) constant, while varying other criteria. This has the potential to say some interesting things about the validation rates as a whole, while also providing insight to biases that can arise when mixing popular factors and technologies found in web pages. Note: instead of listing overall URL totals, the totals mentioned are only for the URLs that use each technology.
|Criteria used to match||Quantity
|IIS Web Server||
|Apache Web Server||
Validation, content management systems (CMS), and editors
MAMA looked at the
value to find popular CMS and editors in use for the following table, looking for
any noticeable trends in validation rates. One might expect per-domain numbers to
be more interesting in this case than per-URL, because sites are often developed
using a single platform, but there is very little difference between the two views.
In general, CMS systems generate valid pages at markedly higher rates than the overall
average, with "Typo3" variants leading at almost 13%. On the other hand, the editor
situation has some wild differences. Microsoft's FrontPage has a VERY wide deployment rate,
but a depressingly low validation pass rate of ~0.5%. Apple's iWeb editor, however,
has a freakishly high validation rate. Kudos to iWeb for this happy discovery.
|Microsoft Visual Studio||272
|Claris Home Page||48
Interesting views of validation rates, part 1: W3C-Member companies
The W3C is the organization that creates the markup standards and the markup validator used in this study. One would hope that the individual companies that support and comprise the W3C would spearhead the effort to follow the standards that the W3C creates. Well, it turns out that is indeed the case. The top pages of W3C-member companies definitely adhere to markup standards at much higher rates than the rest of the Web. However, these "standard-bearers" (pun intended) could definitely do better at this than they currently do.
In February 2002, Marko Karppinen validated 506 URLs of all the W3C-member companies at that time. Only 18 of these pages passed validation. Compared to Parnas's validation study of the DMoz URLs just two months before, the W3C-member company validation rate of 3.56% was considerably better than the 0.7% rate for URLs "in the wild", but it is nothing for the paragons of Web standards to brag about. Such a low validation pass rate could easily be perturbed by any number of transient conditions or other factors.
Saarsoo also did a study of W3C-member company validation rates in Jun. 2006. By that point, the validation situation had improved nicely for the member companies to 17.00%. Fast-forwarding now to Jan. 2008 [W3C-member-company list snapshot], and we see that the general Web-at-large has caught up to, and even exceeded, the previous validation pass rate of W3C-member companies from Karppinen's study era. The general validation pass rate in the DMoz population is now running at ~4.13%, and the W3C-member company pass rate is a strong 20.15%, with more member companies than ever claiming the validation crown.
|W3C-member list study||Date||Total in
|Marko Karppinen||Feb. 2002||506||506||18||3.56%|
Just showcasing the increased validation rate does not tell the whole story. Saarsoo left an excellent data trail to which to compare the present validation pass rate. It is interesting to note that, although the overall pass rate has increased, many of the sites that passed validation previously no longer do so at the time of writing. Achieving a passing validation status does not seem to be as hard as maintaining that status over time. Compared to Saarsoo's study, there are just as many URLs that previously validated but currently do not as there are URLs that maintained their passing validation status.
|URLs that validated before and do now||25|
|URLs that validated before but do not now and are still in W3C-member-company list||25|
|URLs that validated before but are no longer in W3C-member-company list||11|
Saarsoo commented in 2006 on the dynamic nature of the W3C company roster. From early 2002 there were 506 member companies, dipping down to 401 in mid-2006, to the present time (early 2008) where we see the list back up to 429. To put the change in some perspective, the net loss of companies in the list over this time-frame is 77, which is almost as many companies as the number that currently pass validation. Put simply, a pessimist might say that a company on this list is just about as likely to drop out of the W3C as it is to achieve a successful validation.
The W3C-Member List successful validation Honor Roll
In his 2002 study, Karppinen prominently listed the W3C-member companies whose main URLs passed validation in order to,
"highlight the effort that goes into making an interoperable web site".
This is an excellent idea and is becoming a bit of a time-honored tradition that both the Saarsoo study and this one has followed. The first list from Karppinen was easy to keep inline with the rest of the study, because it was (unfortunately) short and sweet. As the pass rate has improved over time, this list becomes progressively longer. This is the goal, though; everyone wants the list to be too long to display easily. [See the Honor Roll list here.]
And the crown goes to ...
Two companies' URLs have maintained valid sites throughout all three studies from 2002-2008. These companies deserve extra congratulations for this feat.
- Joint Info. Systems Comm. of the UK Higher Ed. Funding Council (JISC)
- Opera Software (the company for which the author works)
Many sites are constantly changing, but being a member of an organization that creates standards should be compulsion enough to attain a recognized level of excellence in those standards. Saarsoo ended his 2006 look at the W3C-member list with an optimistic wish for the future,
"Maybe at 2008 we have 50% of valid W3C member sites."
Unfortunately, that number is nowhere close to the current reality. It may be too much for the W3C to require its member-companies' sites to pass validation, but they should definitely try to push for higher levels than they currently attain, to serve as a good example if nothing else.
Page 1 index : Page 2 index : Page 3 index
- Interesting views of validation rates, part 2: Alexa Global Top 500
- Validation badge/icons: An interesting diversion?
- Character sets
Interesting views of validation rates, part 2: Alexa Global Top 500
About the Alexa Global Top 500
Now, we will look at another "interesting" small URL set, the Alexa service from Amazon. Alexa utilizes Web crawling and user-installed browser toolbars to track "important sites". It maintains, among many other useful measures, a global "Top 500" list of URLs considered popular on the Web. The Alexa list was chosen primarily because the size of the list was similar in size to the W3C list—so even though MAMA might be comparing apples to oranges, at least it compares a fairly equal number of apples and oranges. The W3C-company list skews toward academic and "big money" commercial computer sites. The Alexa list is representative of what and how people actually use and experience on the Web on a day-to-day basis.
While few could argue that Alexa's "Top 500" list is relevant and popular, there are some definite biases in its list:
- It is prejudiced toward big/popular sites with many country-specific variants, such as Google, Yahoo!, and eBay. This ends up reducing the breadth of the list. Google is the most extreme example of this, with 63 of the 487 URLs in the analyzed set being various regional Google sites.
- It includes the top pages of domain aggregators with varied user content, such as LiveJournal, Facebook, and fc2.com. These top pages are not representative of the wide variety of the user-created content they contain.
- The list consists entirely of top-level, entrance, or "surface" pages of a site. There is no intentional "deep" URL representation.
Validating the Alexa Top 500
On 28 January 2008, the then-latest Alexa Top 500 list was inserted into MAMA [January 2008 snapshot list, latest live version]. About half of these URLs were already in MAMA, having been part of other sources. Of the 500 URLs in this list, 487 were successfully analyzed and validated. Only 32 of these URLs passed validation (6.57%). This is a slightly higher percentage rate than the much larger overall MAMA population, but the quantity and difference are still too small to declare any trends.
|Alexa Top 500 List study||Date||Passed validation||Total set-size||percentage|
For future Alexa studies
OK, so the Alexa Top 500 does have some drawbacks. Should the URL set be tossed out entirely? Can this set be improved? Aside from the Top 500, Alexa has a very deep catalog and categorization of URLs, some of them available freely, but most are available only for a fee. Some categories of URLs include division by country and by language. Alexa currently has publicly-available lists of the top 100 URLs for 21 different languages (2,100 URLs) and 117 countries (11,700 URLs). Note: The per-country list represents popularity among users in a country, not sites hosted in the country. An undoubtedly-interesting expanded list of the Alexa Global Top 500 could be created by aggregating all of these sources, which would probably yield 5,000-10,000 URLs (if duplicates were eliminated).
If the validation rates of the Alexa Global Top 500 are studied in the future, the current version of the Top 500 list of URLs will likely be quite different than it is at this time of writing. The topicality of the list—a strength that promotes the relevance of the analysis—and also makes cross-comparisons over time difficult. Documenting the list that was used in each analysis will be helpful in doing that.
Validation badge/icons: An interesting diversion?
Before MAMA had validated even a single URL, the author discovered this page at the W3C's site: http://www.w3.org/QA/Tools/Icons. This page lists icons that,
"may be used on documents that successfully passed validation for a specific technology, using the W3C validation services".
It seemed like an interesting idea to compare the pages that were using these images claiming validation with how they actually validate. This can only be a crude measure for a number of reasons, but, by far, the main one is as follows: an author can easily host the validation icon/badge on their own server and name it anything they want.
For those gearheads in the audience who have some "regexp savvy", the following Perl regular expression was
used to identify validation icon/badges utilizing the W3C naming scheme. This pattern match was used against the
Src attribute of the
IMG elements of URLs analyzed:
This seems to capture fully all the variations of the W3C's established naming conventions (any corrections are very welcome if it does not). Note that the regexp errs on the cautious side and can also capture unintended matches like JPEG files matching the naming scheme. One might think this an error, but it turns out it is not. JPEG versions of the validation icons are not (currently) listed on the W3C's Web site, but a random spot-check of JPEG images thus detected by MAMA ARE validation badge icons! In this case, what appears to be false-positives are actually valid after all.
is stored as 'html401-blue'
Validation rates of URLs having validation badge/icons
Now we will look at the list of W3C Validation Image Badges found in MAMA by URL [also by domain]. Even with the various pitfalls that could occur with MAMA's pattern matching, there is still a comparison that is interesting to explore: how many pages that use a badge actually validate? If we consider that the only type of badge of real interest in our sample is an HTML variant (html, xhtml), looking for the substrings "html" and "xhtml" within this field in MAMA gives us:
|Type of badge
This is just under 50% in each case, which is frankly a rather miserable hit ratio. If these URLs do not validate, do they bear ANY resemblance to the badge they are claiming?
Comparison of stated validation badge/icon type versus actual detected Doctype
Next, we will try comparing the actual Doctypes detected compared to the badges claiming compliance to those respective Doctypes. Doctypes detected in both the validator and MAMA analyses are listed for comparison. The situation definitely improves here over the previous figures. Note: Fatal validation errors cause the validator to under-report Doctypes by reporting no Doctype at all in such cases.
|Type Of badge
The validation badges certainly increase public awareness of validation as something for which the authors strive, but it does not appear to be the best measure of reality. For the half of badged URLs that claim validation compliance but currently do not validate, one has to wonder whether they ever did validate in the past. Pages definitely tend to change over time and removing or updating an icon badge may not be high on most author's list of "Things To Do". The next time you see such an icon, consider its current state with a grain of salt.
For future W3C badge studies
After this survey was completed, the following rather prominent quote was noticed on the W3C's Validation Icons page,
"The image should be used as a link to re-validate the document."
It may be useful to incorporate this fact to identify further validation badges in the future.
What are we examining?
First up is the Doctype. The Doctype statement tells the validator which DTD to use when validating—it is the basic evaluation metric for the document. MAMA used its own methods to divine the Doctype for every document, but the validator actually detects the Doctype in two slightly different ways: one by the validator itself and the other by the SGML parser at the core of the validator.
|Information being used|
|MAMA||Detected Doctype statement|
|Validator||'W09'/'W09x' warning messages|
This is a good time to dissect a Doctype and see what makes it tick. We will look at a typical Doctype statement, and examine all of its parts:
Ex: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|"<!DOCTYPE"||The beginning of the Doctype|
|"html"||This string specifies the name of the root element for the markup type.|
|"PUBLIC"||This indicates the availability of the DTD resource. It can be a publicly-accessible object ("PUBLIC") or a system resource ("SYSTEM") such as a local file or URL. HTML/XHTML DTDs are specified by "PUBLIC" identifiers.|
|"-//W3C//DTD XHTML 1.0 Transitional//EN"||This is the Formal Public Identifier (FPI). This compact, quoted string gives a lot of information about the DTD, such as its Registration, Organization, Type, Label, and the Encoding language. For HTML/XHTML DTDs, the most interesting part of this is the label portion (the "XHTML 1.0 Transitional" part). If the processing entity does not already have local access to this DTD, it can get it from the System Identifier (next portion).|
|"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"||The System Identifier (SI); the URL location of the DTD specified in the FPI|
|">"||The ending of the Doctype|
MAMA's analysis stores the entire DOCTYPE statement, but the validator's SOAP response only returns a portion of it— generally the FPI, but some situations may return the SI instead or even nothing at all if an error condition is detected. These situations are infrequent, though; only 70 URLs analyzed by the validator returned the Doctype's SI, for example.
The validator examined 3,509,10 URLs overall. Of those, the validator says that 1,474,974 (42.03%)
"definitely" did not use a DOCTYPE (indicated by an empty content for the
<m:doctype > element
in the SOAP response). In addition to the empty
<m:doctype > element in the SOAP response, the validator
also returns explicit warnings for the instances it does not encounter a Doctype statement: specifically, warning codes
'W09' and 'W09x' are generated by the SGML parser layer of the validator. Is there any correlation between these warning
codes and the "official" empty Doctype mentioned in the SOAP response? The quick answer is yes. Some 1,373,352 URLs have either
the 'W09' or 'W09x' warnings. Looking closer for a direct correlation, 1,371,899 URLs were issued a 'W09'/'W09x' warning
AND do not have a Doctype listed in the SOAP response. This leaves 1,453 URLs that had some sort of
validator-detectable Doctype, but a warning for No Doctype was issued. Sampling several URLs from the above set showed that,
in every case, the Doctype statement was not at the very beginning of the document. So, it appears that the OpenSP
parser does not like this, but the validator itself is OK with this scenario.
MAMA also looked at Doctypes in its main analysis. We have compared cases where both tools found no Doctype.
MAMA found 1,720,886 URLs without a Doctype. This is a rather large discrepancy compared to the validator's numbers above.
We must alter this figure further because the SOAP response for a validation failure error returns empty
<m:doctype > and
<m:charset > elements. To improve the quality of our
comparison between MAMA and the validator's results, we must exclude from our mutual examination all URLs with
a positive validator failure count. After this minor adjustment, the numbers are much more in line with
each other. To the numbers:
|MAMA detected no Doctype.||1,465,367|
|Validator detected no Doctype.||1,474,974|
|MAMA and validator both detected no Doctype.||1,423,478|
|MAMA detected no Doctype, but the validator did.||41,889|
|Validator detected no Doctype, but MAMA did.||51,496|
The final two numbers are the most interesting. These discrepancies are still quite large (~3% of the overall 'no Doctype detected' count). What could account for this? Some reasons noticed for the differences (there could be others):
- MAMA did not look for a Doctype in the destination document of a META refresh/redirect. The validator appears to do this.
- MAMA does not request or handle gzipped content, but it was occasionally served to it anyway. The validator appears to handle this.
- MAMA looked anywhere in the document for a Doctype, but the validator only looks near the beginning of the document. A rather large set of URLs unfortunately fit this description.
- URL content can change over time, including the addition or deletion of Doctypes. MAMA's analysis occurred in November 2007, and the validation of those same URLs happened in January 2008—over 2 months later. In sampling random parts of the URL set where MAMA did not initially detect a Doctype, a current, live analysis by MAMA does indeed detect a Doctype in most cases tried. Other than a bug existing in MAMA (unfortunately, always possible in any software), this is the best explanation to put forth.
Doctype statement present details
What about URLs that had validator-detectable Doctypes? We will linger on the comparison between MAMA's Doctype detection and the Validator's before looking in depth at what those Doctypes were.
|MAMA detected a Doctype.||1,788,294|
|The validator detected a Doctype.||1,625,509|
|MAMA and the validator both detected a Doctype, and it was the same.||1,583,620|
|MAMA and the validator both detected a Doctype, and it was different.||36,119|
Where MAMA and the validator both found a Doctype, they disagree 2.28% of the time. Other than the aforementioned time delay between the MAMA and validator analyses, could there be other reasons to account for this difference? Scanning a list of results for MAMA/validator Doctypes that differed, there may indeed be a trend—and a positive one at that. Of the 36,119 URLs that changed Doctype, 23,390 of them (64.76%) changed from an HTML Doctype to an XHTML Doctype. There are a few reasons mentioned above that could be affecting these results, and the above numbers could be a coincidence, but this looks like a data point supporting the gradual shift from HTML to XHTML.
To summarize the per-URL and per-domain frequency tables for validator Doctype, Transitional FPI flavors have a lock on the top three most popular positions. The other variants trail far behind. If a document has a Doctype, it is likely to be a Transitional flavor of XHTML 1.0 or (even more likely) HTML 4.0x. XHTML 1.0 Strict dominates over any other Strict variant (98% of all Strict types).
Totals for common substrings found in the validator Doctype field
A survey of the FPIs the validator exposed is like a microcosm of the evolution of HTML—there are documents claiming to adhere to "ancient" versions from the early days all the way through to the language's present XHTML incarnations. Searching for a few, well-chosen substrings demonstrates this variety well, and we can see how well an author's choice of Doctype FPI results in actually passing validation. Out of the 1,625,509 URLs exposing a Doctype to the validator, Strict Doctypes pass validation twice as often as the other flavors, and XHTML Doctypes are much are heavily favored for passing validation than other Doctypes. More could be said about the final two items in the table below (to say the least), but that is left for a future discussion.
|Doctype markup language||Qty||Percentage
|" html 4" (HTML 4 variants)||987,701||60.76%||66,535||6.74%|
|" xhtml 1.0"||544,622||33.50%||71,537||13.14%|
|" html 3.2"||44,642||2.75%||1,753||3.93%|
|" xhtml 1.1"||19,984||1.23%||4,074||20.39%|
|" html 2"||4,792||0.29%||176||3.67%|
|" html 3.0"||884||0.05%||44||4.98%|
|" xhtml 2"||11||0.00%||0||0.00%|
The studies from Parnas and Saarsoo did not use the W3C validator, and, as a consequence, there was not such an extreme focus on Doctype usage. Generally, the validator they used only tracked whether a Doctype was used at all. The main reported error type in Parnas' study was a missing Doctype, with only 18.8% of URLs having one present. By the time of Saarsoo's study, the number of URLs having a Doctype moved up to 39.08%. Fast-forward to now, and that number has grown considerably yet again—to 57.7% according to the W3C validator. This is a very respectable increase over time. If few authors are actually creating valid documents, at least most of them seem to understand that there IS a standard to which they should be adhering.
Doctypes for our small, special interest URL sets
Backtracking just a little, the next two tables are a quick look at the Doctypes used for the W3C-member-company URLs and the Alexa Top 500 list. Almost 76% of those URLs passing validation are XHTML variants in the W3C-company set, and in the Alexa list it is almost 66%.
of FPI type
|-//W3C//DTD XHTML 1.0 Transitional//EN||36||145||24.83%|
|-//W3C//DTD XHTML 1.0 Strict//EN||23||45||51.11%|
|-//W3C//DTD HTML 4.01 Transitional//EN||16||95||16.84%|
|-//W3C//DTD XHTML 1.1//EN||4||8||50.00%|
|-//W3C//DTD HTML 4.0 Transitional//EN||3||22||13.64%|
|-//W3C//DTD HTML 4.01//EN||1||7||14.29%|
|-//W3C//DTD HTML 3.2//EN||0||1||0.00%|
|-//W3C//DTD HTML 4.01 Frameset//EN||0||1||0.00%|
|-//W3C//DTD HTML 3.2 Final//EN||0||1||0.00%|
|-//W3C//DTD XHTML 1.0 Strict//FI||0||1||0.00%|
|-//W3C//DTD XHTML 1.0 Frameset//EN||0||1||0.00%|
of FPI type
|-//W3C//DTD XHTML 1.0 Strict//EN||10||37||27.03%|
|-//W3C//DTD XHTML 1.0 Transitional//EN||9||130||6.92%|
|-//W3C//DTD HTML 4.01 Transitional//EN||5||77||6.49%|
|-//W3C//DTD HTML 4.0 Transitional//EN||3||22||13.64%|
|-//W3C//DTD HTML 4.01//EN||2||12||16.67%|
|-//W3C//DTD XHTML 1.1//EN||2||5||40.00%|
|-//iDNES//DTD HTML 4//EN||1||1||100.00%|
|-//W3C//DTD HTML 4.01 Frameset//EN||0||1||0.00%|
|-//W3C//DTD XHTML 1.1//EN http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd||0||1||0.00%|
|-//W3C//DTD XHTML 1.0 Strict //EN||0||1||0.00%|
|-//W3C//DTD XHTML 1.0 Transitional//ES||0||1||0.00%|
|-//W3C//DTD HTML 4.0 Strict//EN||0||1||0.00%|
In the previous section on Doctypes, there were many ways to look at just a single variable (presence or lack of a Doctype). Now, with character sets it becomes even more complex. Even a simplistic view of character set determination can involve at least three aspects of a document. MAMA, the validator, and the validator's SGML parser ALL have something to say about the choice of a document's character set. To cover every permutation and difference between the many possible charset specification vectors would definitely exhaust the author and most likely bore the reader. Every effort will be made to present some of this data in a way from that is not TOO overwhelming.
There are three main areas of interest when determining the character set to use when validating a document:
- The charset parameter of the
Content-Typefield in a document's HTTP Header
- The charset parameter of the
Contentattribute for a
- The encoding attribute of the XML prologue
For brevity, these will be shortened to "HTTP", "META", and "XML" respectively.
Character set differences between MAMA and the validator
An important difference exists between MAMA and the validator when talking about character sets. There is an HTTP header that allows a request to specify which character sets it prefers. MAMA sent this "Accept-Charset" header with a value of "windows-1252, utf-8, utf-16, iso-8859-1;q=0.6, *;q=0.1". This header field value is used by Opera (9.10), and MAMA tried to emulate this browser as closely as possible. The character sets that were specified reflect the author's own particular language bias. The validator is another story. It does not send an "Accept-Charset" header field at all. This may cause differences between the two and affect the reported character set results.
MAMA's view of character sets
First up is a look at what MAMA was able to determine about these three fields, and how they are used in combination with each other. The totals here account for all cases where a non-empty value was present for any of the HTTP/META/XML charset specification types. The following tables show the frequencies for the different ways that character sets are established and mixed. A document can have none, any or all of these factors. Note: The XML level in Fig 9-1 appears to be very low in comparison to the other specification methods, but this is because the number of documents with an XML declaration is also rather low. Looked at in this way, that ratio actually the highest, being even more favorable than the META case at 96,264 of 104,722 URLs (91.92%). Fig 9-2 offers a breakdown of all the combinations of ways to specify a character set. By a large majority, authors do this using only the META element method. The final table, Fig 9-3, shows what happens when more than one source for a character set existed in a document, and whether these multiple values agreed with one another.
|HTTP and META||417,109||2,626,206||15.88%|
|HTTP and XML||6,791||2,626,206||0.26%|
|META and XML||49,115||2,626,206||1.87%|
|All three sources||22,500||2,626,206||0.86%|
|HTTP and META||123,245||417,109||29.55%|
|HTTP and XML||2,238||6,791||32.96%|
|META and XML||4,086||49,115||8.32%|
|All three sources||4,399||22,500||19.55%|
The validator's view of character sets
Now, we will look at the way the markup validator views charset information. The validator generally looks for the same three document sources mentioned previously to determine charset information. Before looking at these actual charset values, it is useful to examine whether the validator's view of charset information is internally consistent or not. It can also be instructive to compare, where possible, the validator's view of charset information versus MAMA's view.
To directly compare validator and MAMA charset information, we must remove some URLs from consideration. The validator's SOAP response returns an empty charset value in all cases where there is a validator failure. It is useful to know if the validator is returning a "truly" empty charset value, so all URLs with a failure error are removed from the examination set for this phase. This immediately reduces our URL group by 408,687 URLs.
The items of interest to look at in the validator response are the contents of the
<m:charset > element and warnings issued for no
detected charset or charset value mismatch from differing sources. We will explore
how/if all these factors mesh when the validator is determining which charset to use.
Validator-detected charsets versus MAMA-detected charsets
The following table is mostly for sanity checking to see if the validator's results resemble MAMA's results. The first two entries have very low totals, but this may involve some corner charset detection cases worth taking a second glance. The third case is a definite indication that the validator has default fallback values used for character set when none is detected through the typical methods.
|No||No MAMA charsets detected||47|
|No||MAMA charset detected||1,179|
|Yes||No MAMA charsets detected||592,361|
|Yes||Validator also issued: "Warning! Conflicting charsets..." message||118,367|
|Yes||Validator also issued: "Warning! No charset found..." message||480,942|
Validator Warning 04 issued: No character encoding found
This table might be a little confusing with some of the double negatives being tossed around. The presence of a Warning 04 means that the SGML parser portion of the validator did not detect a character set. This result may differ from what the validator ends up deciding should be used for the charset. Note that Row 2 is the sum of rows 1, 3, and 4. Row 6 is the sum of rows 5, 7, and 8. Row 5 is another indication that the validator uses a default character set value.
|Warning 04||Charset state||Total|
|No||No validator charset detected||1,226|
|No||Validator charset detected||2,618,315|
|No||No MAMA charset detected||137,286|
|No||MAMA charset detected||2,482,255|
|Yes||No validator charset detected||0|
|Yes||Validator charset detected||480,942|
|Yes||No MAMA charset detected||455,122|
|Yes||MAMA charset detected||25,820|
Validator Warnings 18-20 issued: Character encoding mismatches
In these cases, the validator discovers more than one encoding source, and there is some disagreement between them. The validator does not say what the disagreement was, so for some idea, we can look at the data MAMA discovered about these sources. Note that the final row in each table is the expected scenario for the warning to be generated; naturally, those totals are the highest by a wide margin. URLs from the other rows may merit further testing, but there is one reason mentioned before that can explain at least some of these quantities: the two-month delta between MAMA's analysis and the validator's analysis of the URL set.
Validator-detected charset values
We have saved the best of our character set discussion for last: what values are actually used by the validator for character set? (We will be looking at similar frequency tables for each of the MAMA-detected charset sources (HTTP header, META, XML) in another section of this study.) The full per-URL and per-Domain frequency tables for validator charset show very little movement between the two—you have to go down to #17 before there is a difference! Below is an abbreviated per-URL frequency table for validator character-set values (out of 243 unique values found for this field).
Page 1 index : Page 2 index : Page 3 index
- Validator failures
- Validator warnings
- Validator errors
- Summing up ...
- Appendix: Validation methodology
When the validator runs into a condition that does not allow it to validate a document, a failure notice is issued. The validator defines nine different conditions as fatal errors, but MAMA only encountered four of them among all the URLs it has processed through the validator. It is certainly possible that MAMA's selection mechanism may have contributed to prevention of these errors from occurring. Some 408,920 URLs out of the 3,509,180 URLs validated (11.65%) officially failed validation for various reasons.
|Transcode error||No||Occurs when attempting to transcode the character encoding of the document|
|Byte Error||Yes||Bytes found that are not valid in the specified character encoding|
|URI Error||No||The URL Scheme/protocol is not supported by the validator|
|No-content error||No||No content found to validate|
|IP Error||No||IP address is not public|
|HTTP Error||Yes||Received unexpected HTTP response|
|MIME Error||Yes||Unsupported MIME type|
|Parse External ID Error||Yes||Reference made to a system-specific file instead of using a well-known public identifier|
|Referer Error||No||Referer check requested but 'Referer' HTTP header not sent|
Frequencies of failure types in MAMA
By far, the "Fatal Byte Error" occurs the most of any failure error—300,008 times (8.55%) out of all URLs validated. This error type occurs when characters in the document are not valid in the detected character encoding. This is an indication to the validator that it cannot trust the information it has about the document, so it chooses to quit trying rather than attempt to validate incorrectly.
An additional failure mode relating to MAMA's processing of the validator's activities should be mentioned. If MAMA did not receive a response back from the validator, or some other (possibly) temporary factor caused an interruption between MAMA and the validator, an "err" message code was generated. MAMA encountered this type of error 34,950 times out of the 3,509,180 URLs (1.00%) that were passed to the validator. Note that MAMA has not yet tried to re-validate any of these URLs. There are various pluses and minuses to dismissing the "err" state, or any other validator failure mode from the overall grand total of URLs validated. These failed URLs remain in the final count, but if you disagree, there is enough numerical data to be able to arrive at your own tweaked numbers and percentages.
|Failure type||Number of
|Fatal byte error||300,008||8.55%|
|Fatal HTTP error||63,908||1.82%|
|Fatal Parse Extid error||8,360||0.24%|
|Fatal MIME error||1,709||0.05%|
Number of failures
A field was created in the MAMA database to store the number of failures encountered in a document.
The expectation was that the validator could only experience one failure mode at a time, so this field would hold
either a '0' or '1'. Imagine the surprise when 248 URL cases registered as having two failure types at the same time!
It turns out that in every one of these cases, it was the "Fatal Byte Error" and "Fatal MIME Error" occurring at the
[Note: 98 of the 248 URLs returning these double-failure modes are definitely text files (ending in ".txt") and should be removed from consideration]
The validator issues a Warning if it detects missing or conflicting information important for the validation process. In such cases, the validator must make a "best guess"; if the validator has chosen wrong, it can negate the entire validation results. The validator suggests that all Warning issues be addressed so that the validator can produce results that have the highest confidence.
The validator can produce 27 different types of Warnings, but MAMA only encountered 14 of them in its journeys through DMoz and friends. A specific Warning type will only be issued once for a URL if it is encountered, but multiple Warning types can be issued for the same URL.
Frequencies of Warning types
The most common Warning type in MAMA's URL set was W06/"Unable to determine parse mode", with W09/"No DOCTYPE found" coming a close second. These two each dwarf all other Warning types combined by a factor of two. For full explanations of the Warning codes, see the Validator CVS.
|W06||Unable to determine parse mode (XML/SGML)||1,585,029||45.17%|
|W09||No DOCTYPE found||1,372,864||39.12%|
|W04||No character encoding found||480,942||13.71%|
|W19||Character encoding mismatch (HTTP header/META element)||113,927||3.25%|
|W11||Namespace found in non-XML document||65,807||1.88%|
|W23||Conflict between MIME type and document type||19,097||0.54%|
|W21||Byte-order mark found in UTF-8 File||17,148||0.49%|
|W22||Character Encoding suggestion: use XXX instead of YYY||8,237||0.23%|
|W24||Rare or unregistered character encoding detected||7,149||0.20%|
|W18||Character encoding mismatch (HTTP header/XML encoding)||3,220||0.09%|
|W20||Character encoding mismatch (XML encoding/META element)||1,220||0.04%|
|W09x||No DOCTYPE found. Checking XML syntax only||488||0.01%|
|W07||Contradictory parse modes detected (XML/SGML)||72||0.00%|
|W01||Missing 'charset' attribute (HTTP header for XML)||21||0.00%|
Warnings in combination
MAMA never encountered more than five different Warning types at a time for any given URL. The most common scenario found was for a URL to have two types of Warnings at a time. There is a definite correlation between the two most frequent Warning types and that big "bump" in the Warning-count list below. Of the 1,025,319 cases where only two different Warning types were encountered, 951,957 (92.84%) were the W06 and W09 type together.
Ex: 5 Warning types in combination: http://www.hazenasvinov.cz
... And, er ... those other types of warnings too
The truth is, the validator seems to define a warning somewhat loosely, hence the capitalized use of "Warning" in the previous section to make the validator's two interpretations distinct. Firstly, it defines a "Warning" according to the warning codes and meanings in the above section, where MAMA encountered no more than 5 Warning types at a time. The validator additionally has a warnings section in its SOAP output, and a warning summary count. When the validator uses this latter interpretation of warning, it seems to have a more liberal meaning. It lumps other error types in with the strict Warnings measure as classified before. By this accounting, a number of URLs in DMoz have more than 10,000 of these warnings each.
The URL that contained the most "warnings" of this expanded type is a blog at: http://club-aguada.blogspot.com/. In MAMA's initial analysis, it reported 19,602 warnings! When collecting together this research soon after, this URL was re-checked on 16 Feb., 2008 through the validator and it still had 14,838 warnings—and an additional 14,949 errors. This URL only has about 10-20 paragraphs of text content and an additional 1,400 or so non-visible search engine spam hyperlinks. Such a big change in results seems somewhat suspect in a short amount of time, but content in blogs tend to change rather rapidly which could account for the difference.
What IS of concern is how a page that is less than 250KB in size generates over 26MB from the validator's SOAP output mode. The SOAP version is much more terse than the HTML output, so the validation results size could have been even bigger. A validation result like this is just far too excessive. Perhaps the validator should offer a way (at least as an option) to truncate the warnings and/or errors after a certain amount to control this problem.
Any problem or issue that the validator can recognize that is not a failure or a warning is just a common "error". Errors have the most variety—446 are currently defined in the error_messages.cfg file in the validator's code. The validator only encountered 134 of them through MAMA's URL set. The validation studies done by Parnas and Saarsoo kept track of far fewer error types—perhaps to decrease the studies' complexity. MAMA kept track of them all in the hopes that it might be useful to those developing or using the validator. First we will take a look at the various error types and error frequencies. To wrap things up, we will showcase URLs demonstrating some of the extreme error scenarios discovered (the URLs exhibited the error behavior at the time of writing but can change over time).
For each error type found in a URL, MAMA stored only the error code and the number of times that error type occurred. Shown below is a short "Top 10" list of the most frequent error types. The frequency ratios for the top errors generally agree with Saarsoo's research, with a few minor differences. The error that happens most often in the analyzed URL set is #108 (2,253,893 times), followed closely by: #127 (2,013,162 times). Coming in third is an interesting document structural error, #344: "No document type declaration; implying X". This error appears to mirror the functionality of Warning W09/W09x, "No DOCTYPE found" (see previous section) very closely - notice that the occurrence numbers for the two types are almost identical.
|Error code||Error description||Frequency||Percentage|
|108||There is no attribute X||2,253,893||64.23%|
|127||Required attribute X not specified||2,013,162||57.37%|
|344||No document type declaration; implying X||1,371,836||39.09%|
|79||End tag for element X which is not open||1,232,169||35.11%|
|64||Document type does not allow element X here||1,229,145||35.03%|
|76||Element X undefined||1,114,796||31.77%|
|325||Reference to entity X for which no system identifier could be generated||859,846||24.50%|
|25||General entity X not defined and no default entity||859,636||24.50%|
|338||Cannot generate system identifier for general entity X||859,636||24.50%|
|247||NET-enabling start-tag requires SHORTTAG YES||798,046||22.74%|
The full validator error-type frequency table for MAMA's study is in a separate document. For brevity, only the error codes are listed there. The complete list of validator error codes and their explanations can be found on the W3C's site. Note that a few error message codes are not described in the aforementioned W3C document, and need a little extra exposition:
- "xmlw": XML well-formedness error
- "no-x": No-xmlns error (No XML Namespace declared)
- "wron": Wrong-xmlns (Incorrect XML Namespace declared)
Quantity of error types
There were 3,000,493 URLs where at least one validation error occurred. But among these URLs, there was a great variety in the types of erorrs encountered. The vast majority of URLs encountering errors found 10 types of errors or less. The average total number of validation errors per page is 46.70.
DMoz has many URLs, and some are bound to have unbelievable numbers of errors. Believe it, though—the following three tables showcase the most extreme offenders in generating validator error messages.
The URLs in these lists are fairly diverse. Some of the documents are long, yet some are also fairly brief (considering the error quantity). Some use CSS or scripting, while others do not. IIS and Apache are usually both well-represented. The only noticeable tendency is found in the last table (Fig 13-5) for the widest variety of error types; five of the eight worst offenders in this category use Microsoft IIS 6.0/ASP.NET servers (note the same URL pattern in 4 of them). There is no noticeable correlation other than this. One plausible explanation for the inflated error numbers could be that IIS servers browser sniff for the User-Agent header string and deliver lower-quality content based on the validator's UA value "W3C_Validator/1.575".
|URL||Error Type||Error Qty|
Summing up ...
Parnas' study presented an interesting statistic:
"In October 2001, the W3C validator validated approximately 80,000 documents per day"
Olivier Théreaux, who currently works on development of the W3C validator, provided an updated usage statistic in February 2008 of ~700-800,000 URLs per day. This is a ten-fold increase. The awareness regarding the process of validating documents definitely seems to be increasing over time, as this sharp increase in usage of the validator indicates. The perceived importance of having documents pass validation though, needs to improve. Yes, the pass-rate in the general Web population has also increased by a respectable rate—0.71% to 4.13% in "just" six years. It also has increased similarly for the W3C member companies in that time. However, in the case of the W3C members, they appear to regress in their validation pass state about as often as they attain this goal. How can the Web-at-large strive to do better when these key companies do not seem to be trying harder? As we have seen with the (non-)success of the validation icon badge, it is one thing to say you support the standards—and validation as a means to that end—but it does not necessarily reflect reality.
If we relax our concentration on simply passing validation, we notice that support for other parts of this process are improving nicely over time. At least one aspect of the validation process has made great strides and definitely contributes to a perceived importance for document correctness—Doctype usage. Doctypes help concentrate author focus toward thinking about what standards their documents are trying to adhere to. This can only help the validation cause over time. The Web may be crossing an important threshold in this regard. The number of URLs in this study carrying a Doctype of some kind, has just barely crossed the 50% boundary. In the U.S. political system this is called a "clear mandate"—so, an avalanche of authors validating their documents must not be far behind ... right? Joking aside, there is a clear and obvious connection between claiming to adhere to a standard and then actually doing so. Increased outreach by the standards community to help developers draw the dotted line between the two points in this line can only help matters here.
Appendix: Validation methodology
Markup validation was the last main phase of the research completed. MAMA only attempted to validate the URLs that were successfully analyzed in the other big analysis phase, so as to maximize the possibility for data cross-referencing.
The URL set
MAMA employed several strategies to refine and improve the analysis set of URLs. The full size of the DMoz URL set was ~4.5 million as of Nov. 2007, which was distilled down to ~3.5 million URLs. Saarsoo's study chose to follow, as closely as possible, the URL-selection strategy that Parnas used in his study, to ensure maximum compatibility between the two. MAMA's URL selection methods do not directly match these other studies. Even with the set size reduction, this appears to be the largest URL sample of validation trends to date.
- URL sets analyzed:
- Basic filtering: Domain limiting of the randomized URL set to no more 30 URLs analyzed per domain
- Other filtering: Excluded non-HTTP/HTTPS protocols
- Skipped analysis of URLs that hit any failure conditions
Various parts of the examined URL sets have definite bias. Alexa's top URL lists, for example, are the result of usage stats from voluntary installation of a Windows-only MSIE toolbar. The DMoz set has definite top-page-itis—it is skewed heavily toward the root/home pages of domains by as much as 80%!
The W3C validator
MAMA was only able to employ two local copies of the W3C validator on separate machines. One of these machines was very "old" and weak by today's hardware standards, while the other one was more of a "typical" modern system. The weak machine was simply not up to the task and could only handle about 1/10th of the load that the more powerful machine easily handled. MAMA would feed a URL to the validator, parse the output result, then send it to the MAMA database for storage, and move on to the next URL in the list to be analyzed ... rinse and repeat until complete. The big bottleneck was the validator. If MAMA had more validators available to use, the processing time would be drastically cut from weeks to days.
- Validator machine 1: CPU: Intel 2.4GHz dual core P4; RAM: 1GB
- Validator machine 2: CPU: AMD 800MHz; RAM: 768MB
- Driver script: Perl (using LWP module for validator communication and DBI module for database connectivity)
- Number of driver scripts: Usually about 10 at a time
- Duration of validation: 8-29 January, 2008 (~ 3 weeks), usually 24/7
- Processing rate: ~150,000 URLs per day
- How many URLs validated: 3509170 URLs from 3011661 domains.
- URL list: randomized
The markup validator has a number of processing options, but a main goal for the validation process was to keep the analysis simple and direct. Each candidate URL was passed to the validator using the following options. The SOAP output was chosen for its brevity and ease of results parsing.
- Charset: Detect automatically
- Doctype: Detect automatically
- Output: SOAP
MAMA stored a compacted version of the results of each URL validation. In retrospect, it would also have been useful to store at least part of each error description (the unique arguments portion), but during this first time through there was no way to know just how much storage all that data would need. So, MAMA opted to store as little as possible. As it is, MAMA's abbreviated format stored over 25 million rows of data for the abbreviated error messages alone. A goal for "next time" is to store all the unique error arguments in addition to what MAMA currently stores.
- Did it validate? (Pass/Fail)
- Doctype FPI
- Character set
- Number of warnings
- Number of errors
- Number of failures
- Date the URL was validated
- An aggregated list of error types and the quantity of those errors for the URL
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.