MAMA: Markup report, part 1: the basics
Our first look at markup in MAMA's URL set will begin with the basics. We will look at some of the housekeeping concerns of every document: how they are encoded, how big documents are, how much of the document is actual markup, doctype usage, etc. We will also give a quick overview of the most popular elements and attributes before we start digging deeper into those sub-topics. This overview should give you an idea of what markup documents look like in the broadest sense.
For a deeper look at these areas and more, the following MAMA article topics are also available this week:
Before a document's content can be examined, its encoding must be determined. The biggest trouble with specifying HTML encoding is that there are so many ways to do it. A document may specify none, one, or even ALL of the different methods. And, if there is disagreement at any of these levels of twirled spaghetti, a precarious dance must ensue.
MAMA tracked the specific encoding values from three primary locations:
- The "charset" parameter of the HTTP header:
Content-Type: text/html; charset=ISO-8859-1
- The "charset" parameter for the Content-Type
- The XML encoding:
The HTML encoding was specified using at least one of these methods in 2,626,228 of the URLs MAMA examined (74.8%). Of these, the dominant scheme was the META element syntax, used in ~90% of the cases where MAMA detected any of the 3 encoding sources.
Note: Region sizes are not to scale.
Overall document sizes
Among the many statistics MAMA has gathered about the documents it analyzed, length statistics prove to be interesting to examine. One such criteria is the basic overall document length. This is simply the length of the original document, without adding in any of the page's external dependencies like CSS or Scripting. The average basic document size of MAMA's analyzed URLs was 16,406 characters.
|Size range (characters)||Frequency||Percentage|
|0 && <= 5,000||1,113,224||31.7%|
|>5,000 && <= 10,000||717,825||20.5%|
|>10,000 && <= 15,000||509,456||14.5%|
|>15,000 && <= 20,000||324,765||9.3%|
|>20,000 && <= 25,000||213,093||6.1%|
|>25,000 && <= 50,000||432,129||12.3%|
|>50,000 && <= 75,000||112,481||3.2%|
|>75,000 && <= 100,000||40,349||1.1%|
|>100,000 && <= 200,000||35,354||1.0%|
But what about all those dependencies?
One other interesting length factor MAMA tracked was labeled "extras". This measure added up the sizes of all external CSS, scripting, frames, and IFrames referenced by the main document. While the basic document length gives some idea of a user's initial download penalty, this "extras" length gives a better sense of the overall weight of a page before its objects (such as images and plug-ins) are loaded. The overall average length of all the "extras" is 20,296 characters, but it increases to 28,038 characters, factoring in only the cases where any of the "extras" exist. This is definitely a case where the documents that are "extras heavy" are throwing off the total average—as you can see, most documents actually have an "extras" sum of less than 10,000 characters. So, by at least one measure, the average page will download as much in "extras" content size as it must download for the main document.
|Size range (characters)||Frequency||Percentage|
|>0 && <= 5,000||753,392||21.5%|
|>5,000 && <= 10,000||361,460||10.3%|
|>10,000 && <= 15,000||207,283||5.9%|
|>15,000 && <= 20,000||182,116||5.2%|
|>20,000 && <= 25,000||190,039||5.4%|
|>25,000 && <= 50,000||467,214||13.3%|
|>50,000 && <= 75,000||173,156||4.9%|
|>75,000 && <= 100,000||78,497||2.2%|
|>100,000 && <= 200,000||95,093||2.7%|
The markup validator has a lot to say about Doctypes. They are a key component in determining a successful validation. MAMA stored the information about a document's Doctype pulled from the W3C validator, but it also looked for the Doctype information separately. In MAMA's URL set, 1,788,294 of the URLs analyzed (50.96%) had a Doctype present. About 85% of MAMA's URLs would be rendered in most browsers using their "Quirks" mode.
Different versions of the HTML standard can be detected via unique strings in the Doctype statement. The leading space in most of the values below helps differentiate between HTML and XHTML versions. HTML 4 variants are twice as popular as any of the other versions.
|" html 4" (HTML 4 variants)||1,122,392||62.8%|
|" xhtml 1.0"||548,307||30.7%|
|" html 3.2"||57,354||3.2%|
|" xhtml 1.1"||20,958||1.2%|
|"softquad" || "//sq//"||9,950||0.6%|
|" html 2"||7,640||0.4%|
|" html 3.0"||1,711||0.1%|
Beginning with HTML 4.0, HTML was stratified into 3 separate variants: Strict, Transitional, and Frameset. A portion of the Doctype statement directly reflects these variants and we can easily discern the "flavors" of HTML by searching for the substrings. The Transitional configuration is more than 10 times as likely as the other types.
For more about doctypes, read the doctypes section of the Basic structure article.
Miscellaneous document structure matters
MAMA calculated a "Tag Ratio" for each document. This was total length of the content within all tags divided by the overall page length. A Tag Ratio of 0 would be all plain text, while a Tag Ratio of 100.0 would be completely tags, without even having linefeeds or spaces between the tags. The average document had a Tag Ratio of 61.64%—almost 2/3 of each document being tags.
MAMA also kept its eye out for character entities—a portable way to express characters that are not in the page's specified character set. All Unicode characters can be given as a numeric entity, while many of these can also be expressed via special name codes. In almost every case where a named entity counterpart exists for a character, the named version is least as popular as the numeric version, if not often much more so.
ex: (used 2,537,947 times);   found 41,390 times
The character entity used most often was the "non-breaking space", found in 72.3% of all MAMA's documents.
|Non-breaking space||nbsp||2,537,947||Small 'u' with Diaeresis||ü||uuml||226,695|
|Ampersand||&||amp||1,256,005||Small 'e' with Acute||é||eacute||207,322|
|Copyright Sign||©||copy||776,051||Small 'a' with Diaeresis||ä||auml||204,855|
|Quotation Mark||"||quot||520,902||Small 'o' with Diaeresis||ö||ouml||184,313|
|Greater-Than Sign||>||gt||276,149||Right Double-Angle Quotation Mark||»||raquo||123,207|
Popular markup elements and attributes
For authors that have spent any time at all writing HTML documents, there will be no real surprises about which elements are the most popular. The 10 most popular markup elements can be divided into three basic categories:
- Basic document-structure elements (
- Tables (
- Hyperlinks and images (
The list of the most popular attributes in MAMA comes primarily from
only 4 of the previously covered top 10 elements. Attributes for
the most popular. It is interesting that none of the top structural
BODY) score any
attributes in the top 10 attribute beauty pageant.
This overview of the basics of markup seems to be too thin to allow us to come to any real "conclusions" on the topic just yet. We are just getting started. The information here on document encodings and length, doctypes, tag ratios, and character entities glosses over many of the details found in the deeper writeup. Our final, simplistic mention of popular markup elements and attributes merely sets the stage for what will follow—in the coming weeks, we will devote considerably more attention to the details of elements and attributes in common use.
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.