By Brian Wilson

MAMA: What is the Web made of?

The Web has search engines—many of them. However, they are typically concerned only with the text content of a Web page. What about a search engine for a Web page's structure? Say you want to find a sampling of Web pages that have more than 100 hyperlinks or for pages that use the Font-size CSS property that also use the FONT element with a Size attribute? Many parties would be interested in such a service, even if the market would be smaller than for a "traditional" search engine. For browser makers and standards bodies, the structure and composition of the Web is a more pressing issue than its content. A content search engine numbers its market in millions, even billions, of individuals; the number of browser makers and standards bodies who would see the biggest benefit from a structural search engine is much smaller, but they include some of the most powerful software companies in the marketplace. Yet, there have been no obvious efforts to address this shortcoming ... until now.

Enter MAMA—the "Metadata Analysis and Mining Application". MAMA is a structural Web-page search engine—it trawls Web pages and returns results detailing page structures, including what HTML, CSS, and script is used on it, as well as whether the HTML validates. In this document, and the ones that link from it, you'll find data that has been pulled from MAMA so far. There is a lot of information here, but every effort has been made to keep it readable and interesting for the various types of people who might be interested in such data.

What can you get from MAMA?

The intent has always been for MAMA to provide those developing the Opera Web browser with a tool to quickly find live examples of markup and other Web page structural components. We at Opera believe this tool can also be useful to other stakeholders in the standards and browser-making world. For example:

  • Browser manufacturers and others can use MAMA data on the popularity of widely used technologies to prioritize bugs and justify adding support for new technology to in-progress releases.
  • Standards bodies can use the data to measure the success and adoption rates of various technologies.
  • Web developers can use the same data to justify support of various technologies in their work.
  • It can provide real-world, practical samples of the Web developer's "art", for inspiration and instruction.

MAMA can definitely provide data on discrete issues such as "what is the 18th most popular element?" (SPAN), or "how popular is Flash?" (found in 33.5% of MAMA URLs). It can also dig deeper, by yielding regional and other data breakdowns. This allows us to discover that some countries like Germany show a decreased tendency for Flash (25% of pages), while other countries have much higher incidences (Chinese URLs in MAMA used Flash 67% of the time). For a quick look at some of MAMA's basic results, the Key Findings document provides a summary.

There will inevitably be those who are looking for simple, blanket answers to complex questions, such as what the prevalence of "Web 2.0" is, or how many sites are "mobile-ready". Questions like these require answers that consider numerous facets and issues. Only a tool like MAMA is up to this challenge. MAMA can give answers to many components of such questions. As its net of features and detections is cast ever wider, the granularity of the answers will only increase.

The history of MAMA

I have been working in Opera's QA department since 2002. In the projects on which I worked, there has always been a great need to justify various features and bug fixes. A tester can come up with all the isolated test scenarios they like, but it is difficult to know if such tests truly represent how those things are actually used in the real world of the live, evolving Web. Authors do the darnedest things. Real-world Web pages are often convoluted, complex, and messy, not insular or sanitary. This is the true environment in which a Web browser must survive, not the simple conditions of a test case.

When the seeds of this project were sown in early 2004, there was little in the way of effective data about the state of the Web. MAMA was designed to bridge the divide between the test case and the actual Web, to provide examples of real-world sites that use existing and emerging technologies. Browser makers and standards bodies cannot control the authoring population—on the Web, you merely have to point to a single concrete usage of code, and that is sometimes enough to win an argument about usage; the question then switches from "who in their right mind would do that??" to a simple statement of reality: "authors ACTUALLY do that—now, how do we react?"

Because of my past experience with creating my own comprehensive HTML and CSS resource, I personally had many interesting questions to ask about the Web, but I knew that there were many other stake-holders at Opera that would have additional questions and insight that would be useful to consider. I talked with a variety of my co-workers at Opera regarding what they wanted to know about the Web—program managers, core developers, QA, those involved with writing specs at the W3C, marketing people—anyone I could think of that might provide unique perspectives and questions. Almost every person involved with creating Opera had different questions about "what is out there" on the Web; thus began the genesis of MAMA.

MAMA's full results

Previous studies into Web-page structure have focused primarily on overall statistics. MAMA's nature as a structural search engine automatically provides that sort of data and much, much more. MAMA's extensive results should provide easy comparison to previous Web page studies as well as a broad baseline for any future studies. We will start by providing analysis into a number of Web page topics mined from the MAMA data. For a quick look at some MAMA's basic results, be sure to look at the Key findings page.

What follows is not as informal and witty as a blog nor as dry and formal as a research paper—it lies somewhere in between. Those expecting rigorous academia will forgive the occasional humorous turns of phrase or moments where personal observations and experience intrude—I try to limit it to places where they seem useful or interesting. For blog junkies, this will grow into a long, multi-part saga (hopefully) worthy of a company from Scandinavia. Go get some coffee and buckle up—it's about to get bumpy!

MAMA: Table of contents

Here is what MAMA found:

  1. Key findings
  2. What does an "average" Web page look like?
  3. HTTP Headers (Summary report available: HTTP headers)
  4. Markup validation (Summary report available: Validation summary)
  5. Markup (Summary reports available: Markup basics | Primary functional and structural markup | body markup | Forms, tables and plugins)
  6. CSS (Summary report available: CSS)
  7. Script (Summary reports available: Scripting syntax and features | JavaScript and DOM tokenization)
  8. Appendices

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.


The forum archive of this article is still available on My Opera.

No new comments accepted.