This week's overview on MAMA's scripting tokenization analysis can only be called "brief" compared to the following detailed articles that it summarizes:
Tokenization: Problems encountered
MAMA has done well with this initial tokenizer design, but it is a first-generation
effort that could be improved. For instance, a major bug was discovered after the
analysis phase—a database field created to store token identifier chains never
stored anything at all. This field would have allowed better correlation for keywords
with ambiguous meaning. For example, the
open method is
used by multiple objects including Window and XMLHTTPRequest; the lost MAMA field
would have allowed a greater degree of clarity with multiple uses of the same
keyword. The next major MAMA crawl will definitely address this lack and will
go even farther in its examination of scripting.
Some other basic issues were noticed during this process:
- The fewer characters in a keyword, the less likely it is to be used for a single
specific purpose. This is especially true for common or ambiguous words like
- Mixed-case keywords stand a higher chance of representing unique usage,
especially camel-case property and method keywords, such as
- The keyword
functionis used the most of all keywords, with 87.2% of all scripting cases. In all, 84.7% of pages that used
functionhad at least one case of the
functionis more popular than
var, and in turn
varis used at higher rates than
- Conditional constructs (
if) are favored over looping (
for), and the alternate looping mechanism
whileis less popular than the primary
elsekeyword for conditional code flow was used 80% as often as the companion keyword
breakis favored over
- Boolean values
falseare used a similar number of times and are used together 1,314,911 times (91.2% of
falsecases and 85.6% of
catchsyntax is used a similar number of times. The fallback condition
finallyis only used in 5.5% of the cases using
The Array object
This object's property and method keywords were detected in 1,835,275 URLs.
The Array object is mostly concerned with manipulations of array structures,
but it has one main informational property,
which informs about the number of items in the array. This property is used
far more often than any of the other Array-specific methods, but that is an
expected state of affairs, because it is also used by the String object to
count the number of characters in a string. The way MAMA is currently set up,
it can not distinguish between these two uses. Of the standard array operations,
push is much more popular than
shift is much more popular than
shift has only marginally higher representation than
This is not an object class, but it covers references to all the major predefined objects, including the Error object types. MAMA detected the use of these global object references in 1,817,657 URLs (almost 70% of all URLs using scripting). The Array object is referenced most, used in over 55% of all URLs using script. The Date object is explicitly used in over 42% of scripting cases. All the other global objects were mentioned by only 15% or less each of scripting cases.
The String object
The String object is used to manipulate groups of individual characters. MAMA
discovered 1,982,954 URLs using String object-specific keywords, but that
length property, which is also used as
a property by the Array object. This name collision issue also happens with the
which are used by both the String and Location objects. The way MAMA is currently
set up, it can not distinguish between multiple uses of a keyword. Judging by
the relative popularity of other properties and methods between the two objects,
it is likely that the majority of the 1,825,953 uses of the
property are in a String object context.
Most of the String object methods have high usage rates;
all used in more than half of the URLs that use scripting. There are some
clear trends though—
MUCH more popular than
substring is MUCH more popular
Scripting use was found in MAMA 2,617,305 times. This section is devoted to the results uncovered for 294 DOM-related keywords in 15 categories encompassing the largest Objects and conceptual areas of the DOM. The DOM is a very large domain to cover, so in the end I limited the initial set to objects and keywords that I thought might be the most popular or interesting.
The Document object
Forty keywords were associated with the Document object for MAMA, with 2,353,632
of the URLs having at least one of the keyword snippets (89.9% of all URLs
using script). The parent substring
the highest popularity here, and it actually has the highest occurrence of
ANY tokenized keyword detected by MAMA (in 89.6% of all
script). This could be persuasive evidence in demonstrating that dynamically
keywords are understandably quite popular, being the basic historic methods for
addressing and dynamically creating parts of a document; each was found in over
50% of all script cases. The W3C DOM method of addressing content
document.getElementById is more popular than the
document.all by a comfortable margin.
getElementById method is almost twice as popular as
getElementsByTagName, and both trounce
getElementsByName by a wide
write method is clearly preferred by
writeln 4.5 to 1.
Other keywords from the Document object can tell us a lot about many aspects of usage in
Web-page authoring. The
layers keyword is actually
the most common process used to detect (browser sniff) Netscape Navigator 4,
which explains why the use of this keyword in script is so
large compared to the
element (the script keyword is used over 34 times as much)! The
keyword can give a good measure of how often client-side cookies are used by
script (22.4% of all Web pages). This is probably a much better measure than
the Navigator object's
cookieEnabled property reflecting
only 45,411 cases. The
images keyword here is just one
useful factor in determining whether scripting is dynamically controlling images;
top keywords from the token remainders list also suggest Image usage (
height). These could also be leveraged to discover
scripts that are manipulating Images. Direct use of the
in markup were detected in 1,068,842 cases, while the DOM level 0
keyword was detected 665,305 times. However, these factors occurred together
only 293,048 times. What this disparity might suggest about form control via
script is not really clear—perhaps, in a significant number of cases,form
widgets are generated dynamically.
The Element object
The keywords collected under the Element object umbrella were found in 1,336,464
URLs from MAMA. The MSIE shorthand
innerHTML, which is
used to read and dynamically write content in a document, is very popular. If we
or any of the Node object's methods for accessing and writing child nodes, it
appears that it may actually be less popular these days than equivalent W3C DOM
methods. Writing attributes with the
method appears to be a more frequent authoring task than merely reading it with
currentStyle keyword (used 111,964 times) comes
from IE and is only slightly more widespread than the W3C DOM version
window.getComputedStyle (used 99,815 times). These
two methods of accessing a browser's CSS interpretation share usage in a large
majority of the cases (92,505 times), indicating an author preference to get
the job done using any and all methods at their disposal.
The offset/scroll methods originated by IE show clear trends.
are more popular than either
offsetWidth. Similarly, Top and Left are both more
popular than Height and Width for the "scroll" methods. The Top and Height
properties are always more popular than the Left and Width properties for
both the offset and scroll method groups. In cases where the Left and Width
component methods are used, the overwhelming majority (more than 90% each)
are used in conjunction with the more dominant Top/Height methods.
The Node object
appendChild keyword was especially popular in
this group. Authors apparently like to dynamically add content to documents—what a surprise! It was detected in 713,711 of MAMA's URLs—more than twice
as often as the next-nearest Node object keyword. This number may seem unusually
high compared to its other keyword siblings, but not if we look outside the Node
object for a correlation. The related DOM method
is a likely companion to
appendChild, and it was
seen 731,116 times.
Some other relative comparisons can also be interesting;
is four times as popular as
removeChild is MUCH more popular
is approximately three times as popular as
nextSibling is more than three times as popular as
nodeName are used a similar number of times and
are used in combination ~2/3 of the time (93,546 cases).
The Window object
This object represents a browser window or sub-frame. Of all the keywords in
window was obviously going to be the
most popular. There are a number of intriguing comparisons to be made between
the various keyword couplings.
of the Window object. Of these,
alert is used most—17.8% of URLs using script utilize it in some fashion;
prompt are only found in 4.1% and 1.2% of scripted
setTimeout is almost twice as popular as
clearTimeout is almost NEVER
setTimeout (found together in 490,124
setInterval is significantly more
popular than its companion
clearInterval use is almost always paired with
setInterval (detected in unison 311,890 times).
Some of the keywords in this group are generic in nature and can be used across
multiple objects. The keywords
were placed here, but also apply to other objects (like Input and Link). The
open definitely applies as the Window
object method, but as a concept
open is very generic
and there may be some name collision (such as another official use as a separate
method of the XMLHttpRequest object).
XML related objects, properties, and methods
Not all of these keywords are dedicated solely to XML processing. The keyword
with the highest detected frequency here was
which is MSIE's generic system for using ActiveX controls in Web pages. How do
we filter out non-XML related usages of
Firstly, authors wanting to use XMLHttpRequest these days will typically allow for
both types of objects. These two keywords are used together
in 105,013 cases (93.5% of
Another notable pairing is the incidence of the
keyword, which also tracks very close to use of
(94.9%). The readyState is a vital part of XMLHTTP processing, so tracking
its numbers can also expose MSIE-only uses of XMLHTTP. The keywords
onreadystatechange were used together 104,763 times.
The remainder of the
readyState cases (in 45,329 URLs)
will likely be MSIE centric syntax.
Saarsoo also looked for "XMLHttpRequest" usage and only encountered it 6,125
study. By comparison, MAMA's usage rate is quite a bit higher. Considering
only the same metric (use of
was found in 4.3% of MAMA URLs that were using script.
Overall summary - MAMA phase 1
So, this brings the release of the current crop of MAMA analysis data to completion—I truly hope this data has been useful, and that it answers some of the burning questions you have about what is out there on the Web. However, this is by no means the end of what MAMA has to offer; there will be a short pause until after the new year while MAMA gathers more data. The next phase of MAMA's life will involve a full re-crawl of the URL set used in this study in order to examine how Web pages change over time. During that process, a number of brand-new search criteria will also be analyzed. New data resulting from this update will of course end up published in additional articles here on dev.opera.com as soon as they are ready.
There has been considerable interest in making MAMA's data available for general consumption, and we are definitely moving in that direction as resources allow it. Please let us know if you would like to be included in the preliminary betas of this project.
And of course, please let us know also if you have any ideas for further data mining you would like to see done, or think there is anything noticeably absent from the current data set.
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.