Web Hosting
Home > Articles > Web Hosting Related > Website Statistics Strategies

Interpreting Cookies

Although cookies were initially implemented to facilitate shopping carts, a common use of cookies is to uniquely identify users within a web site. Cookies work in the following manner. When a person visits a cookie enabled web site, the server replies with the content and a unique identifier called a cookie, which the browser stores on the user's machine. On subsequent requests to the same web site, the browser software includes the value of the cookie with each request. Because the identifier is unique, all requests that were with the same cookie are known to be from the same browser. Since multiple people may use the same browser, each cookie may not actually represent a single user, but most web sites are willing to accept this limitation and treat each cookie as a single user. Recently, browser vendors have provided users with controls to select the cookie policy that maps to their privacy preferences. This enables users to choose various levels of awareness when visiting a server that issues cookies in addition to allowing users to completely disable their browser from sending cookies. Consequently, unless a site requires people to use cookies to receive content, the cookie field may be null, which leaves the task of identifying user paths to relying upon the other recorded fields.

Given the limitations of the information recorded in Web access logs, it is not surprising that sites require users to adhere to cookies and defeat caching to gain more accurate usage information. Still, numerous sites either do not use cookies or do not require users to accept a cookie to gain access to content. In these cases, determining unique users and their paths through a web site is typically done heuristically.

Even when cookies are used, several scenarios are possible when a previously encountered cookie is processed. If the request is coming from the same host regardless of the user agent, the request is treated as being issued by the same user. This is because a unique cookie is issued to only one browser. If the user agent field remains the same but the host changes, it is still the same user and some form of IP/domain name change is occurring. This often occurs with users behind firewalls and ISPs that load-balance proxies. However, if we have the same cookie with a different user agent, then an error has most likely occurred as cookies are not shared across browsers. If no cookies are present, then the site statistic software can resort to using IP addresses. If the request comes from a known host, then we could have a new user or the same user, otherwise the request is from a different user. It is important to point out that these latter two cases could also be issued from non-cookie compliant crawling software.

An interesting set of scenarios occur when a new cookie is encountered. If the request is from a host that has already been processed and the previous value of the cookie was “null” and the user agent is the same, it is fair to conclude that the request is from a new user that just received their first cookie from the server in the previous request. If the client is not using cookie obfuscation software, one would expect the following requests from this user to all contain the same cookie. However, suppose the previous value from the same host and agent was a different cookie, it could be the same user obfuscating cookie requests, or a new user from the same ISP using the same browser version and platform as the user from the previous request. Barring any other piece of supporting evidence like the referrer field or consulting the site's topology, it is difficult to determine which the correct scenario is. If the user agent is different from the previous request, but accompanies a new cookie from the same host, it is fair to assume that a new user has entered the site. Of course, a new cookie from a new host regardless of the agent is a new user.

IP and Domain Name Counting

You can also learn something about visitors by studying their domain names. Though the log file may record IP addresses, your log analysis program can determine from many of these IP numbers the associated domain or ISP. This might tell you if your most important client -- or competitor -- has been looking at your web pages.

The most simplistic assumption to make about users is that each IP address or domain name represents a unique user. Using this method, all the requests made by the same host are treated as through from a single user. When a new host is detected, a new user profile is created and the corresponding requests are associated to the new user. Several methods that use additional information recorded in the access logs or other heuristics are also possible. One refinement is to use the user agent field. Using this method, new users are identified as above as well as when requests coming from the same machine have different user agents. Another refinement is to place session timeouts on requests made from the same machine. The intuition is that if a certain amount of time has elapsed, then the old user has left the site and a new user has entered.

When using these methods for identifying users, the following situations occur when sequentially processing access logs:

  1. a new IP address is encountered (assume this is a new user),
  2. an already processed IP address is encountered
    • the user agent matches prior requests (assume this is the same user),
    • the user agent filed does not match any prior requests form the same IP (assume this is a new user)
    • when a session is terminated due to a timeout, assume a new user has entered the site.

Therefore, if a substantial part of your statistics imply that many of the new hosts and timeouts were from hosts in the same domain/IP address space, you can infer that a large number of web site users either connect to the Web via ISPs with load balancing proxies, or that a large number of different users access the site from within the same domain as would occur with a large company, or that some combination of both cases exist.

Regardless, a significant number of page requests can result in ambiguous cases, where it is not possible to determine the existence of new users with certainty. While the incidence rate can vary considerably from Web site to Web site, the results can be inaccurate since these IP-based methods and other IP-based derivatives are used in cases where unique identifiers like cookies are not present.

Caching

Another major problem that dilutes the quality of the data is caching. There are two major types of caching. First, browsers automatically cache files when they are downloaded. When this is done, it is not necessary to subsequently download the entire page again. Depending on the browser settings, it can determine if the page has changed: in which case, you do know about it, and a page request is recorded. However, if the browser is not set to verify if a page has changed, then the user can read the page without any entry being recorded in the web log.

In addition, almost all ISPs now have their own cache. This means that when a web page request is made to the same page that anyone else from the ISP has made recently, the cache will have saved it, and will release it without any request being made to the original site. Therefore many people could request a site's pages from the same cache without the original web site (or its logs) even knowing about it.

For example, AOL uses caching extensively, and a single user with an AOL account may be reflected in your server logs by several different IP numbers as AOL uses its caching to grab the files for its user. If this happens, the logs will fail to identify a repeat customer. In addition, the logs will not be able to record if a visitor typed a URL into their browser after seeing a particular advertisement. If already cached when called, no page requests at all might show up in the logs.

Page 1

Page 3

Web HostingWeb Hosting

http://www.godaddy.com

Web Host

windows 2003 host