| Interpreting
Cookies
Although cookies were initially implemented to facilitate
shopping carts, a common use of cookies is to uniquely
identify users within a web site. Cookies work in the
following manner. When a person visits a cookie enabled
web site, the server replies with the content and a
unique identifier called a cookie, which the browser
stores on the user's machine. On subsequent requests
to the same web site, the browser software includes
the value of the cookie with each request. Because the
identifier is unique, all requests that were with the
same cookie are known to be from the same browser. Since
multiple people may use the same browser, each cookie
may not actually represent a single user, but most web
sites are willing to accept this limitation and treat
each cookie as a single user. Recently, browser vendors
have provided users with controls to select the cookie
policy that maps to their privacy preferences. This
enables users to choose various levels of awareness
when visiting a server that issues cookies in addition
to allowing users to completely disable their browser
from sending cookies. Consequently, unless a site requires
people to use cookies to receive content, the cookie
field may be null, which leaves the task of identifying
user paths to relying upon the other recorded fields.
Given the limitations of the information recorded
in Web access logs, it is not surprising that sites
require users to adhere to cookies and defeat caching
to gain more accurate usage information. Still, numerous
sites either do not use cookies or do not require users
to accept a cookie to gain access to content. In these
cases, determining unique users and their paths through
a web site is typically done heuristically.
Even when cookies are used, several scenarios are
possible when a previously encountered cookie is processed.
If the request is coming from the same host regardless
of the user agent, the request is treated as being issued
by the same user. This is because a unique cookie is
issued to only one browser. If the user agent field
remains the same but the host changes, it is still the
same user and some form of IP/domain name change is
occurring. This often occurs with users behind firewalls
and ISPs that load-balance proxies. However, if we have
the same cookie with a different user agent, then an
error has most likely occurred as cookies are not shared
across browsers. If no cookies are present, then the
site statistic software can resort to using IP addresses.
If the request comes from a known host, then we could
have a new user or the same user, otherwise the request
is from a different user. It is important to point out
that these latter two cases could also be issued from
non-cookie compliant crawling software.
An interesting set of scenarios occur when a new cookie
is encountered. If the request is from a host that has
already been processed and the previous value of the
cookie was “null” and the user agent is the same, it
is fair to conclude that the request is from a new user
that just received their first cookie from the server
in the previous request. If the client is not using
cookie obfuscation software, one would expect the following
requests from this user to all contain the same cookie.
However, suppose the previous value from the same host
and agent was a different cookie, it could be the same
user obfuscating cookie requests, or a new user from
the same ISP using the same browser version and platform
as the user from the previous request. Barring any other
piece of supporting evidence like the referrer field
or consulting the site's topology, it is difficult to
determine which the correct scenario is. If the user
agent is different from the previous request, but accompanies
a new cookie from the same host, it is fair to assume
that a new user has entered the site. Of course, a new
cookie from a new host regardless of the agent is a
new user.
IP and Domain Name Counting
You can also learn something about visitors by studying
their domain names. Though the log file may record IP
addresses, your log analysis program can determine from
many of these IP numbers the associated domain or ISP.
This might tell you if your most important client --
or competitor -- has been looking at your web pages.
The most simplistic assumption to make about users
is that each IP address or domain name represents a
unique user. Using this method, all the requests made
by the same host are treated as through from a single
user. When a new host is detected, a new user profile
is created and the corresponding requests are associated
to the new user. Several methods that use additional
information recorded in the access logs or other heuristics
are also possible. One refinement is to use the user
agent field. Using this method, new users are identified
as above as well as when requests coming from the same
machine have different user agents. Another refinement
is to place session timeouts on requests made from the
same machine. The intuition is that if a certain amount
of time has elapsed, then the old user has left the
site and a new user has entered.
When using these methods for identifying users, the
following situations occur when sequentially processing
access logs:
- a new IP address is encountered (assume this is
a new user),
- an already processed IP address is encountered
- the user agent matches prior requests (assume
this is the same user),
- the user agent filed does not match any prior
requests form the same IP (assume this is a new
user)
- when a session is terminated due to a timeout,
assume a new user has entered the site.
Therefore, if a substantial part of your statistics
imply that many of the new hosts and timeouts were from
hosts in the same domain/IP address space, you can infer
that a large number of web site users either connect
to the Web via ISPs with load balancing proxies, or
that a large number of different users access the site
from within the same domain as would occur with a large
company, or that some combination of both cases exist.
Regardless, a significant number of page requests
can result in ambiguous cases, where it is not possible
to determine the existence of new users with certainty.
While the incidence rate can vary considerably from
Web site to Web site, the results can be inaccurate
since these IP-based methods and other IP-based derivatives
are used in cases where unique identifiers like cookies
are not present.
Caching
Another major problem that dilutes the quality of
the data is caching. There are two major types of caching.
First, browsers automatically cache files when they
are downloaded. When this is done, it is not necessary
to subsequently download the entire page again. Depending
on the browser settings, it can determine if the page
has changed: in which case, you do know about it, and
a page request is recorded. However, if the browser
is not set to verify if a page has changed, then the
user can read the page without any entry being recorded
in the web log.
In addition, almost all ISPs now have their own cache.
This means that when a web page request is made to the
same page that anyone else from the ISP has made recently,
the cache will have saved it, and will release it without
any request being made to the original site. Therefore
many people could request a site's pages from the same
cache without the original web site (or its logs) even
knowing about it.
For example, AOL uses caching extensively, and a single
user with an AOL account may be reflected in your server
logs by several different IP numbers as AOL uses its
caching to grab the files for its user. If this happens,
the logs will fail to identify a repeat customer. In
addition, the logs will not be able to record if a visitor
typed a URL into their browser after seeing a particular
advertisement. If already cached when called, no page
requests at all might show up in the logs.
|