- XML processing instructions (PIs)
- The XML prolog and its encoding
- XML Namespaces
- Other XML-related attributes
MAMA tracked a number of XML-related details in order to get a better sense
of how XML is used on the Web. We have already seen evidence of XML in some
of the other sections of this write-up. The
HTTP header revealed just over 1,000 documents using XML-related MIME types.
URLs analyzed that ended in ".xml" or
".xhtm" were detected in just under
1,000 documents. Certain conditions in scripting also mark a document with the
"stamp" of XML. In this section, we will look at some additional factors which
contribute to the evidence of XML usage in Web documents: XML processing
instructions (PIs), the XML prolog (a type of PI), and XML namespaces. We can
even consider the presence of an XHTML Doctype as another tip that XML is in use.
Combining these factors together, over 700,000 URLs (~20% of all URLs analyzed)
exhibited evidence of trying to be XML in one form or another.
Adding in other details that MAMA did not directly analyze, this number can
likely be said to be even higher. XML syntax definitely deserves some scrutiny
in MAMA's research.
XML processing instructions (PIs)
The number of URLs reporting PIs was 104,413. This is slightly lower than the number mentioned below for XML prologs detected. In the MAMA code, this should not be possible, since the XML prolog detection is a child condition of the XML PI case. This difference exposes a small bug in PI detection when frames are used. It looks like it affects ~2-4% of the PI URLs, and will be fixed in the next version.
The XML PI quantity full frequency table shows that some documents have significant numbers of PIs. Investigating such cases exposes some practices that would be considered sloppy; some authors put multiple XML prologs in a single document, while others misplace PI-looking constructs from pre-processing languages.
Ex: 33 XML prologs in a document: http://www.711.ru/
CSS stylesheets in XML
In all, 569 CSS stylesheet PIs were detected. MAMA used the following approach to judge a positive match:
- The PI begins with the string "xml-stylesheet".
- The PI has a "Type" attribute value of "text/css".
The XML prolog and its encoding
An XML prolog is a type of processing instruction and is an optional component of an XML document. MAMA found 104,722 URLs with XML prologs amongst its URLs. The following is a typical XML prolog:
The XML prolog can also have an optional
attribute, which specifies the character set used in the document. Use of the
Encoding attribute in the prolog is very popular—if we look at all URLs that actually use an XML prolog, 96,264 of them (over
92%) specify a document's encoding in this manner. The "iso-8859-1"
value is twice as popular as any other encoding method.
Although detecting the XML prolog declaration gives some idea of how XML is used on the Web, it is not a required item for an XML document. MAMA also looked for XML namespace URIs used in documents and the number was MUCH higher than for the XML prolog. XML namespaces were found in 656,808 URLs (18.72% of all URLs analyzed). The XHTML namespace is the prevalent value here, but another conspicuous trend is easily noticeable: a significant number of Microsoft-related namespaces are very prominent. Twenty-two of the top 100 namespaces were from Microsoft. Conversely, some interesting XML-related technologies had very low representation in the URLs that MAMA analyzed; XLINK was detected 152 times, but XML events, XHTML2, XFORMS and XSLT each only had 1-2 dozen cases each. Server-side XSLT (a separate implementation vector than this evidence of client-side XSLT) likely has higher usage rates than MAMA's XSLT numbers indicate—evidence from the "Server" and "X-Powered-By" HTTP header fields support this view.
Other XML-related attributes
xml:space attribute is used to signal that contained spacing is important
and should be preserved. It can be applied to any element, but, for most markup
In the URLs that MAMA analyzed, this holds true: the attribute was used 520 times
SCRIPT element, 140 times in the
STYLE element, but was not detected with any other elements.
This attribute is used to define the natural language of an element's contents.
It takes as a value an RFC 3066
language code. As you can see from the table below (Fig 5-1), the most popular
place to use this attribute is the
HTML element; it
dwarfs all other usages by a factor of almost 100.
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.