MAMA: HTTP Headers
- Previous article—MAMA: The "average" Web page
- Next article—MAMA: W3C validator research
- Table of contents
- About MAMA's HTTP requests
- The HTTP response—general anatomy
- Most popular HTTP Header fields and other additional data
- HTTP protocols
- HTTP Header fields:
- Content-Type: MIME types and character sets
- The other HTTP Header fields
In the beginning, there were only a few HTTP Headers with which MAMA was concerned
(such as the
fields). As feature requests for HTTP Header data accumulated over time, managing
the results became more difficult. MAMA generally was not as concerned with what
the HTTP Headers contained as with what the rest of the HTTP response had to say.
At a certain point, the various checks that MAMA was doing on the HTTP Headers
became too numerous, so the decision was made to store the entire
HTTP Header in the database. This way, any new requests could be quickly completed
locally without having to do an entirely fresh re-crawl of the entire MAMA URL set.
For this study, we will first look at the general shape and composition of the HTTP Headers MAMA encountered before looking at some of the results found for select individual HTTP Headers. Saarsoo's study is the only comparable large-scale study of HTTP Headers of which I am aware, and MAMA's discoveries will be compared with Saarsoo's data where possible.
About MAMA's HTTP Requests
An HTTP response is often heavily dependent on the original HTTP request. It
is important to look at what MAMA is sending as its HTTP request before
looking at the responses received. An original goal for MAMA was to mimic, as
accurately as possible, what an Opera Web browser would encounter when surfing
the net; this likely led to a coloring of some of the data returned. Servers
can and do discriminate on the basis of User-Agent or other HTTP Header fields.
The HTTP request headers used in this study are shown in figure 2-1 below. The
biggest difference between Opera's HTTP request headers and MAMA's lies in the
Accept-Encoding value. Opera can handle gzip,
deflate and other encodings. This functionality was not added to MAMA in order
to limit the coding and analysis effort MAMA needed to do for each URL.
Accept-Charset values chosen reflect the author's
own particular language bias.
|Header name||Header value|
||"text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1"|
||"windows-1252, utf-8, utf-16, iso-8859-1;q=0.6, *;q=0.1"|
||"Opera/9.10 (Windows NT 5.1; U; en)"|
The HTTP Response - general anatomy
HTTP/1.1 (RFC 2616) goes into great detail about the allowed configurations that an HTTP response can take on. In this section, only the HTTP Header block that comes before the main HTTP response body will be covered. In general terms, the header block consists of a "status line" followed by any number of newline-separated header field/value pairs. The status line contains important basic information about the entire HTTP transaction and has the following format:
[Protocol]/[HTTP Version] [HTTP Status Code] [HTTP Status reason text]
The format of the Header name/value pairs that follow is generally:
[Field Name (case-insensitive)]: [Field Value]
Most popular HTTP Header fields and other additional data
In this section, we will look at which HTTP Headers were the most popular ones encountered.
To start off our look at HTTP Headers, Fig 4-1 below is an abbreviated look
at a frequency table of the most popular header fields found in MAMA's URL
set (see also the more extensive per-URL
frequency table). The astute reader may have already noticed that the
caption in Fig 4-1 below reads, "Top 10...", but there are 13
values. This is because the first three field names—the ubiquitous ones
prefixed by "Client-"—are generated in MAMA's process as a result of the
usage of Perl's LWP module. These fields will be ignored in this study. As
a result, the most frequent HTTP Header field name then becomes the expected
Content-Type. With this adjustment, MAMA's
frequency table generally agrees very closely with Saarsoo's study of HTTP
Headers for the top values.
|Header name||Header value||Header name||Header value|
Other semi-random data about HTTP Headers
The most common number of HTTP Header fields encountered in this study was 12 (please see also the full frequency table)—actually 9, if you take into account the adjusted value due to the ignored "Client-*" headers. The overall length of the header had a fairly wide distribution, with the average length being 381 characters. The longest header block length encountered was 9,725 characters, found at http://www.studentenwerk.uni-freiburg.de/ in an apparently isolated case; the URL has an overabundance of "EACCELERATOR hit" header fields repeated 100 times! This is definitely not typical, and the Header block for that URL is not otherwise remarkable.
HTTP protocol versions
MAMA used a native Perl LWP method to get the protocol and version used in the URL's HTTP response. It then did a simple substring match for "1.1", "1.0" or "0.9" within that value. To allow for instances where some other version was detected, MAMA also had "unknown" as a fallback default value, but all HTTP responses fell within the three expected version types anyway. Almost half (50/104) of the HTTP/0.9 URLs were from galeon.com variants (ex: http://equipobarzamudio.galeon.com/). Whittling down duplicate domains from the HTTP/0.9 result, MAMA discovered only 42 unique servers using this HTTP protocol version (insert witty joke referencing Douglas Adams here).
HTTP Header fields
"Content-Type" HTTP Header field:
MIME types and character sets
Of the 3,509,180 URLs that MAMA analyzed, the vast majority (~99.9%) used a "text/html" MIME type (see full frequency table). "text/plain" and "application/xhtml+xml" types also had some occasional representation in the set (~1,000 cases each). Other values were also encountered, albeit very rarely—including some values that would clearly indicate they should not be analyzed by MAMA as markup (like "text/rtf"). It was not known in advance of performing the analysis whether some content could be masquerading using unexpected MIME types (such as HTML being served as "text/plain"), so MAMA was not as discerning in this area as it could have been. In the future, more will be done to filter out unprocessable MIME types from analysis, including checking file extensions for inappropriate types (there are 694 URLs still present in MAMA's set that have a ".txt" file extension, for example).
The Content-Type 'Charset' parameter
The character set used for a document can be specified in several ways. One
of those ways is through the
Content-Type HTTP Header
field using a "charset"
parameter (we'll look at the other methods in the Document
encodings section). The HTTP Header line syntax for defining the character
set via this method looks like this:
Content-Type: text/html; charset=ISO-8859-1
Content-Type Header field was
encountered in almost every single URL analyzed, the "charset"
parameter was only detected in 688,819 (19.63%) of them. A look at the
frequency table of the values for
the charset parameter shows that the value is usually (~88% of the time)
either "utf-8" or "iso-8859-1".
Silly HTTP Header pet tricks - Case-sensitivity of the 'Charset' keyword
Although HTTP/1.1 defines the media type "charset" parameter as being case-insensitive, what sort of capitalization is found on the Web? The dominating usage is all lowercase in 99.2% of the cases. A recent question from a co-worker prompted MAMA to answer this minor question. Answers big and small, MAMA can provide them all!
"Server" HTTP Header field
This field contains information about the Web server used to serve the
HTTP request. MAMA again used a built-in Perl LWP method to detect this
field. The value of the
Server HTTP Header field
is expected to be the same for all pages from that server, so rather than
look at results on a per-URL basis, it is more instructive to look at
In the brief summary below (Fig 6B-1), notice the obvious (and expected)
dominance of the Apache and IIS Web servers. In fact, in the full
per-domain and per-URL
unique value frequency tables for the HTTP Header
field, all the values in the top ten are either Apache or IIS
related. Apache is represented in a whopping 2,011,088 (67.72%) of domains
in MAMA while IIS is used in 769,375 (25.91%) domains. The popularity ranking
of the Web servers mentioned below is very similar to
|Server substring value||Quantity||Server substring value||Quantity|
The count for Apache also includes an additional Server string added for
completeness: "a p a c h e" (2,616 times). The
results in Fig 6B-1 are also slightly muddied by an odd value that is undoubtedly
some form of joke— 86 domains had a Server header field value with both
"Apache" in them (all of them being the string
"Apache-IIS/5.0"), from domains such as
The value is undoubtedly a spoof or hoax— one would expect a real hybrid of
the dominant Web servers to be a bit more popular. Lastly, notice also that
a fair number of servers use the
Header field to rebuff our desire for deeper knowledge of the Web servers
that are in use: 9,746 domains tell us that it is "NOYB"
(None Of Your Business) ... the nerve!
"Connection" HTTP Header field
The Connection field specifies options that are to be used for the current particular HTTP connection. In MAMA's analysis, the value is almost always "close" (~98.5%) when it is present. This result is very different than Saarsoo's research where "close" was actually a minority value (~41.3%) compared to the dominant value of "keep-alive" (~58.7%). A number of factors may have influenced this, with the most likely culprit being the facilities used in the respective studies to fetch the URLs— the Perl LWP module in this MAMA study and GNU Wget in Saarsoo's case. For now we will let this discrepancy stand, but the issue may be interesting to revisit in the future.
|keep-alive, timeout=50, maxreq=60||203|
|keep-alive, te, close||143|
Somebody get that kid a dictionary
In the list of HTTP Header fields encountered (Fig 4-1 above), some
variations are noticeable. Often these variations are misspellings, but it
is difficult to know whether these are deliberate or not. The HTTP Header
field with the most variations and frequency was definitely the
Connection field. Some of the misspellings
are so demonstrably wrong that one wonders how they could survive even the
simplest of inspections, but 13,764 occurrences of "Cneoction"
seems far too high to be an accident. Table Fig 6C-2 below
shows the strange and slightly bizarre list of erroneous
Connection header misspellings.
"X-Powered-By" HTTP Header field
This is a common Header extension field used to identify the Web server
pre-processing engine in use (if any). ASP and PHP dominate this field (you
have to go down past the 20th position in the popularity
frequency table to find a value that does not contain
either ASP or PHP). Combined, the various ASP and PHP values comprise 98.2%
X-Powered-By values. PHP is the most diversified
of the values in use, with about 450 of the 750 values (~60%!) in the frequency
table being unique PHP flavors. Finally, let us pause a moment and contemplate
all the hard work that the fine folk mentioned in the 6th position put into all of this...
...OK, that is enough. Back to the analysis.
|"the blood, sweat and tears of the fine, fine textdrive staff"||633|
"Expires" HTTP Header field
Expires header documents a "best before"
date. Unlike with food products, an expired date does not necessarily mean
that the resource has changed or disappeared. The field is used to give the
date and time after which the content is considered "stale"— proxy
servers need to be especially mindful of this value to prevent old cached
content from being passed on to an end-user instead of fresher content from
the originating source. Other than some extremes and error cases, this field
is somewhat tedious to sift through—as you can see from
the full Expires frequency table,
the values are mostly simple dates.
Those who cannot remember the past are condemned to repeat it (George Santayana)
Looking closer at the proper format for the
Expires field in
HTTP 1.1 (RFC 2616), MAMA
uncovers quite a few transgressions:
"[It] MUST be in RFC 1123 date format, such as: 'Tue, 26 Oct 1999 19:00:00 GMT'...HTTP/1.1 clients and caches MUST treat other invalid date formats, especially including the value '0', as in the past (i.e., 'already expired')"
Not only are values of "0" for the date used, but also more extreme values like "now", "never", "-1", "-1d" and "-10000". Values in the past generally don't go further back than the UNIX origin date favorite of "01 Jan 1970", but the occasional URL makes a foray in the time machine back to the turn of the last century (1900). An enterprising group of 27 URLs made the jump back to Bastille Day ("14 Jul, 1789") for their expiration—it might entertaining to double-check to see if those were all French URLs. 10 URLs authoritatively stated they are expired (and probably mummified) by using an expiry of "01 jan 0001".
Back to the future!
Going in the other direction, MAMA also discovered many
beyond the MAMA analysis timeframe. HTTP
1.1 (RFC 2616) has an interesting comment on future expiries:
"To mark a response as 'never expires,' an origin server sends an Expires date approximately one year from the time the response is sent. HTTP/1.1 servers SHOULD NOT send Expires dates more than one year in the future."
Contrary to this mentioned proviso, a number of URLs (92) jump forward to
the future, but not the recommended single year. URLs with expiries set
clearly in the future used a smattering of dates between 2010 and 2035—quite
a bit forward in time than what is suggested. The wording of
RFC 2119—a document
about the wording of requirements in RFCs—says that pesky, previously
mentioned "SHOULD" terminology indicates that the future date values MAMA
encountered are permissible if the creator knows what they are doing. One
hopes that the creators of four URLs in MAMA know what they are doing when
they set their
Expires field in the year 2999!
What would the Web even look like then?!
"Cache-Control" HTTP Header field
This field communicates information used to override normal caching strategies employed by proxies or clients. The value is a comma-separated list of related header fields that are relevant when deciding the caching status for a document. MAMA's raw per-URL frequency table for this field is a list of unique compound values, but that does not really reveal the popularity of the sub-components very well. The table below (Fig 6F-1) goes further by showing the most popular frequencies of the components referenced in each "Cache-Control" header value from the full frequency table.
|Value component||Frequency||Value Component||Frequency|
It is just a matter of time
One thing that stands out when looking at the complete
Cache-Control frequency table is the wide variety
Max-age time values. Since the
Max-age value should take precedence over any
Expires header value, it can be informative to
look closer at the times represented. In the distribution table below (Fig
6F-2), noticeable spikes are apparent. Other values were also detected, but
their frequencies were below the chosen threshold.
|Max-age value (sec)||Frequency||Max-age value (sec)||Frequency|
|0||33,882||10800 (3 hr)||408|
|1||2,188||14400 (4 hr)||238|
|10||508||18000 (5 hr)||347|
|20||579||21600 (6 hr)||1,091|
|30||305a href=||28800 (8 hr)||318|
|60 (1 min)||11,553||43200 (12 hr)||297|
|120 (2 min)||676||86400 (1 day)||5,546|
|300 (5 min)||2,224||172800 (2 day)||262|
|600 (10 min)||9,848||259200 (3 day)||482|
|900 (15 min)||808||432000 (5 day)||384|
|1200 (20 min)||261||604800 (7 day)||665|
|1800 (30 min)||1,295||864000 (10 day)||200|
|3600 (1 hr)||3,735||1209600 (14 day)||651|
|7200 (2 hr)||1,773||2592000 (1 month)||715|
"Vary" HTTP Header field
This field consists of a comma-separated list of other header fields that are used to determine:
"... whether a cache is permitted to use the response to reply to a subsequent request without revalidation. For uncacheable or stale responses, the Vary field value advises the user agent about the criteria that were used to select the representation."
The full per-URL frequency table of unique values found for this field is not very extensive, but a quick summary is still useful. Notice that "accept-encoding" is the dominant value here.
"SSL-Cipher" HTTP Header field
Popularity: a small sample space, but more to come
In the URL set that MAMA analyzed, there were relatively few URLs using
the HTTPS protocol—only 4,994. MAMA detects all
HTTP Header fields due to a past request from an Opera developer to discover what cipher
types are in use, but MAMA's sample space is not overly large. To satisfy
the original request fully, a much deeper study of HTTPS domains was performed
and will be presented as an adjunct to this study at a later time. As a
consequence, I will not go into great depth about this header field here.
The SSL settings are expected to be uniform across a given Web server, so
the focus here is
SSL-Cipher values on a per-domain basis
(full frequency table). The 4,994
HTTPS URLs are from 4,355 unique domains. Among these domains, two cipher strings
are rather evenly dominant: "RC4-MD5" and
"DHE-RSA-AE256-SHA". Other values also
occurred, although with far lower frequency.
HTTPS headers from an HTTP URL?
One would expect that only HTTPS URLs would deliver an
SSL-Cipher header, but that is not always the
case. Some 4,994 URLs used the HTTPS protocol, but in fact 4,997 URLs had
SSL-Cipher header field—three
non-HTTPS servers were sending the header as well.
The other HTTP Header fields
Several of the other popular HTTP Headers were also analyzed but did not
yield much in the way of interesting trends to present here. For instance,
Client-Transfer-Encoding header is actually
used, it only yielded the value "chunked". The
Content-Language have a fair amount of variety
but are not otherwise very remarkable. Other fields like
Set-Cookie produced so many unique random
values that there was no point in searching for trends. For some fields that
had small result sets, like the
the only thing that stood out was how dominant a single value was
("no-cache": 98% of the time). The
Pragma field also demonstrated the same sort of
curiosity found in fields like the
header—that the specification writers' original choice of the field
name was sometimes unfortunate in that some keywords are just easily misspelled,
such as "cache"
giving way to semi-frequent variations like "chache",
"cashe" and "cahce".
- Previous article—MAMA: The "average" Web page
- Next article—MAMA: W3C validator research
- Table of contents
This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.