MAMA: Scripting report, part 1: Basic scripting syntax and features
To my knowledge, there has never been an in-depth examination of scripting factors in Web pages. Rene Saarsoo's study of Coding Practices of Web Pages was able to analyze some factors, but a bug in his analysis program prevented deeper investigation.
A number of strategies were employed by MAMA to extract information from scripting content. Substring matching was used, in addition to regular expressions and complex scripting language tokenization. We will look this week at the basics of scripting, leaving room for next week's thorough examination of MAMA's scripting tokenization. For a deeper look at the details of MAMA's scripting examination, the following MAMA article topics are also available this week:
Script inclusion methods used in Web pages
Scripting was detected in 2,617,305 of MAMA's URLs, from four different sources:
- External scripts via the
- Embedded scripts as inline content of the
- Common event handler attributes (attributes beginning with the string "on")
All of these sources together form an interesting and complex backdrop on which to paint our analysis of what MAMA discovered about script usage on the Web.
Quantities of script components in Web pages
Of the four possible methods to specify scripting, the most popular technique found by MAMA is via embedded script—just over 88% of scripts used this method. External scripts and event-handler attributes were used in a similar number of cases (they were both used in ~2/3 of all scripting cases). The "quantity per page" values and other counters represent the number of occurrences for the specific syntax that was discovered for a URL. For example, the maximum number of external scripts encountered in any single page was 264; the maximum number of event handlers discovered was 37,658. The average "per-page" numbers listed in the table below apply where that type of scripting was used and does not cover the total MAMA URL space.
|Script type||Description||Total URLs|
|Embedded scripts||Inline content of the ||2,303,363||88.0%||1||2,010||3.6|
|Event handlers||Content of attributes beginning with "on"||1,707,594||65.2%||1||37,658||19.2|
|External scripts||Content from ||1,651,383||63.1%||1||264||2.5|
Diagram: Script usage by type
Note: Region sizes are not to scale
Scripts dynamically creating/writing other technologies
During MAMA's development process, a number of URL examples I tested exhibited behaviors that appeared to be distressingly common. So common, in fact, that it seemed imperative for MAMA to measure just how frequently it was happening in the wild. Scripts have the ability to dynamically add markup and code to a document, and some even go so far as to dynamically create other scripts. Full script parsing and execution would be necessary to track down, detect and analyze ALL of these cases, but MAMA is not able to do that in the current version. Instead, MAMA settled for simply detecting the situations where external dependencies are dynamically written in order to gauge the relative importance of this type of behavior. MAMA's discovery that as many as 25% of the URLs using scripting matched its rudimentary "Script writing a Script" criteria definitely warrants future MAMA attention!
|Scenario||What was detected||Frequency||% Total|
|Script writing Script||Substring/Regexp: |
parsed JS String tokens containing:
|Script writing CSS||Substring/Regexp: ||95,066||3.6%|
|Script writing Frames||Substring/Regexp: ||14,840||0.6%|
Mentioning specific browsers in script content
This feature began as a generic question many at Opera had: "How many authors write their Web pages with Opera in mind?". Opera already had evidence that some authors make use of browser-specific workarounds, and this is especially true of scripting. For a simple answer to this question, MAMA detected the use of browser name keywords (case-insensitively)—these were expected to be unique enough to give a good idea of how many authors were at least thinking about specific browsers when they developed their documents. MAMA's approach searched against all scripting content, including script comments. This method does not give 100% reliable numbers—it can be fairly easy for simple keyword matching to give false-positives, after all. The choice of the keywords used was expected to reveal true browser name mentions in the majority of cases.
It turns out that the most difficult of all the browsers to detect in script is Opera, because authors generally refer to Opera with only the single "opera" keyword. This keyword can also match "operator", for example; about 25,000 of MAMA's URLs used the keywords "operator" or "operators".
|Microsoft Internet Explorer||"Internet Explorer", "MSIE"||916,306||35.0%|
|Mozilla Firefox||"Mozilla", "Gecko", "Firefox"||475,628||18.2%|
expectation, but it is somewhat unrealistic. Some fraction of Web pages are
definitely known to support Microsoft's IE-only VBScript. There have not been
any big public studies into script usage before, so MAMA had no idea at the
beginning of the study about how prevalent VBScript might be. A special check
was added to detect the use of this scripting language: all opening
SCRIPT tags and all script content was examined for
(case-insensitive) traces of the substring "vbscript".
103,485 URLs in MAMA were found satisfying this condition (4.0% of pages using
Script library evidence using MAMA search factors as archaeology tools
Top scripting libraries detected by function name
To see script library activity in action, we need to look at the top 75 entries in the full function name list (cutoff value chosen to demonstrate the proximity effect of libraries in the list):
- The most popular values are Macromedia-related (function names prefixed by "MM_"). The first two have similar frequencies, and the next pair have similar frequencies as well.
- Google's Urchin tracker comes next, with 29 of the top 75 spots, all with VERY similar frequencies (384-394,000 times each). The function names are prefixed with "__utm" or "_u". Not coincidentally, an external script file name "urchin.js" was found 383,870 times.
- Google's ad-syndication platform is also well represented in the function name list. The function names are all very compact—typically 1-2 letters long. The entire code for this ad-syndication script is also compacted, with no linefeeds and extra spacing. These function names are all adjacent in the frequency list, being used 160-185,000 times. It is no coincidence again that the external script file name "show_ads.js" was used 178,697 times.
- The following image control/rollover effect functions are very popular and
all seem to be related, based on their similar naming schemes and proximities
in the frequency list:
- Adobe's "Active Content" seems to control Flash instances in Web pages. These 5 "Active Content" functions have names prefixed by "AC_" and occurred between 60-64,000 times in MAMA. A corresponding external script with the name "AC_RunActiveContent.js" was found 60,428 times and is no doubt related to these instances.
- Two adjacent entries appear to read and write browser cookies
- In the top 75, two function names (
Menu) can be found, but if you go below position 75 you can find many more functions obviously relating to menus.
This is just a small sample, a number of other unique prefixes are noticeable by glancing further down the frequency list—Adobe GoLive has many functions prefixed by "CS" (after finding 100 such unique function names, I stopped counting). Functions common to Lycos/Angelfire/Tripod scripts were well represented with the common prefixes "lhb_" (17 times), "LR_" (18 times) and "lycos_" (11 times).
...And there is more evidence
Detecting libraries was a very important task for MAMA. The external script file names and function names were the passive evidence found. MAMA also identified unique strings that would track usage of a number of specific script libraries in common use (e.g., Prototype and jQuery), tracking systems (e.g.: Urchin, Omniture, and Hitbox), and DHTML menu systems (e.g. Milonic). Every effort was made to guarantee that the patterns were distinctive, but the criteria used may not be totally reliable. There can, of course, always be the occasional false positive, and future versions of these script libraries may alter some of the (currently) unique criteria that MAMA seeks. The full script syntax article details the results for all 24 of the libraries it looked for in more detail.
Note: All of the search criteria are case-sensitive regular expressions.
|DHTML menu/library name||Search criteria (regexp)||Frequency|
|Macromedia functions from Dreamweaver/Fireworks||Script: ||682,019|
|Google Analytics/Urchin Tracker||Script: |
|Omniture/SiteCatalyst Analytics||Script: |
|JQuery Library||Script: |
|Dynamic Drive HV Menu||Script: ||15,111|
|Milonic DHTML Menu||Script: ||13,585|
|WebSideStory/HitBox Analytics||Script: |
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.