MAMA: Code comments
- Comment quantities
- Overall comment sizes
In another section of this research, we looked at the sizes of documents and their components. When considering this, one has to remember that SGML/XML comments can be present in content and this also affects these sizes. Browsers do not care about comments and neither do users/readers of Web pages—so why should MAMA? To MAMA, comments are just "extraneous fluff" that only serve to bulk up the overall size of a document and increase its loading time. However, comments deserve at least a cursory overview in our discussion to expose just how much "extraneous fluff" is out there in the wild.
Where might large quantities of comments be encountered? Longer documents and auto-generated content are definitely the leading candidates. We will look at how many comments documents have as well as how much data is contained in those comments.
MAMA used a MySQL SMALLINT data type (max. value: 65,535) to store the quantity of comments on a page. Surprisingly, in the most extreme cases this was not big enough. Only 1 URL was found to exceed this in MAMA's URL set, but plausible automated page creation scenarios could conceivably recreate these conditions with regularity. The lone URL with the copious comments was http://genforum.genealogy.com/ny/all.html. MAMA stored its maximum value of 65,535, but a live analysis showed 146,376 comments in a 9.2MB HTML file!
In all, 2,840,900 URLs contained at least 1 comment—80.96% of the overall total URLs analyzed. When comments are present in a document, they have an average quantity of 12.4. The most popular number of comments for a page to have is 1 (448,326 times). The full frequency table for comment quantities monotonically decreases—each additional comment sees a corresponding drop in frequency, at least out past the 20th position.
Overall comment sizes
In the original sample, the maximum overall comment size was 4,980,813 characters, but comment size seems to vary a lot. The URL originally reporting this comment size is much lower at the time of writing, but other cases have been found to be as big as or even far exceeding MAMA's recorded maxmimum size. In the original URL sampling, the average comment size, when comments are present, was 2,252 characters. In any case, a number of URLs have comments that are several megabytes in size. There does seem to be a big contributor to this problem of burgeoning comments: "Conditional Comments" syntax. Hands down, code produced by Microsoft's Publisher and Word products creates huge amounts of content in conditional comments on sites such as http://www.rodontinipasquale.it/ and http://www.norman.k12.ok.us/160/.
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.