MAMA: Character entities
- Character entity usage
- A popularity contest: Named or numeric character entities?
- Illegal code points for numeric entities
- The mystery of the word jumble
I am not aware of any past studies of character entities (either numeric or named). MAMA's study appears to be the first; although, in hindsight, more could have been done. Knowing what character entities are used is definitely a good start, but it would also be nice to know how many character entities a document has as well.
In all, 3,002,458 of 3,509,180 URLs analyzed (85.56%) use at least 1 character entity. The most popular entity reference is the "Non-breaking space", used 2,537,947 times (72.32% of pages overall)—twice as much as any other entity used.
|Non-breaking space||nbsp||2,537,947||Small 'o' with diaeresis||ö||ouml||184,313|
|Ampersand||&||amp||1,256,005||Right double-angle quotation mark||»||raquo||123,207|
|Copyright sign||©||copy||776,051||Small 'a' qith grave||à||agrave||119,984|
|Quotation mark||"||quot||520,902||Small 'e' with grave||è||egrave||104,890|
|Greater-than sign||>||gt||276,149||Less-than sign||<||lt||100,218|
|Small 'u' with diaeresis||ü||uuml||226,695||Small Sharp S||ß||szlig||94,842|
|Small 'e' with acute||é||eacute||207,322||Apostrophe||'||39||89,642|
|Small 'a' with diaeresis||ä||auml||204,855||Small 'o' with acute||ó||oacute||86,211|
A popularity contest: Named or numeric character entities?
Among the most popular entities, there is a definite preference for using the named version rather than the numeric version—in the frequency table above (Fig 2-1) a numeric entity is not encountered until the 15th slot. In almost every case where a named entity counterpart exists, the named version is at least as popular as the numeric version, if not much more so.
An allowed alternate form of numeric character entity is a hexadecimal version of the entity number, like so:
Standard numeric entity:
= Hexadecimal numeric entity:
This form of numeric entity was detected many times in MAMA's URLs, but its usage is sharply lower than the equivalent standard decimal representations of the same entity.
Illegal code points for numeric entities
The range of code points from 127-159 is designated as "system control characters" in ISO-8859-*, and Unicode character sets and should not be used. This does not stop authors from including them as numeric character entities in the wild, though. The most popular entities in this range correspond to certain Windows system-specific characters that are not very portable. As mentioned before, the legal named entity versions of the Windows-specific character are quite a bit more popular, as are the legal Unicode numeric entity forms of the characters. The only slight exception is the "Bullet" character—it is slightly more popular in its illegal • incarnation than either of its legal forms separately.
|Left single quotation mark||‘||145||3,284||8216
|Right single quotation mark||’||146||25,056||8217
|Left double quotation mark||“||147||9,165||8220
|Right double quotation mark||”||148||8,536||8221
The mystery of the word jumble
Upon assembling a list of the top numeric character entities, a number of seemingly unrelated, unremarkable ASCII characters stand out. The most popular numeric entity characters do not reflect the letters with the highest relative frequencies (in the English language at least). This group of characters only makes sense though when they are put together. They indicate that obfuscated e-mail addresses are very popular. The following e-mail-related character and word groupings stand out: "@", "at", ".", ":", "nospam", "email" and "com"...that could make, for example:
"email: test at foo.com.nospam"
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.