Join Date: Jun 2004
Location: Bath, UK
This is not a new problem, AOL has been working this way for as long as I can remember, so you can't put much stock in web log data ... it can only be used for generalised info.
Let me try to explain it in basic terms for people who are unaware of the full process. We'll use your SearchTrax site as an example....
1. The Basics
Let's suppose I visit the SearchTrax web site. I follow a link from somewhere else to your front page, read some pages, and then follow one of your links out of your site.
So, what do you know about it? Well firstly, I make one request (or 'hit') for your front page. You know the date and time of the request and which page I asked for (of course), and the internet address of my computer (my host). I do not tell you my username or my email address.
Next, I look at the page (or rather my browser does) to see if it's got any graphics on it. If so, and if I've got image loading turned on in my browser, I make a separate connection to retrieve each of these graphics. I never log into your site: I just make a sequence of requests (hits), one for each new file I want to download. The referring page for each of these graphics is your front page. Maybe there are 10 graphics on the SearchTrax Home page. Then so far I've made 11 'hits' to your server.
After that, I go and visit some of your other pages, making a new request (or hit) for each page and graphic that I want. Finally, I follow a link out of the SearchTrax site. You never know about that at all. I just connect to the next site without telling you.
It's not always quite as simple as that. One major problem is caching. There are two major types of caching. First, my browser automatically caches files when I download them. This means that if I visit the SearchTrax site again, the next day say, I don't need to download all of the pages again. Depending on the settings on my browser, I might check with you that the page hasn't changed: in that case, you do know about it, and the logs will count it as a new request for the page. But I might set my browser not to check with you: then I will read each page again without you ever knowing about it.
The other sort of cache is on a larger scale. Almost all ISP's (Internet Service Providers - BTInternet, Freeserve, AOL etc) now have their own cache. This means that if I try to look at one of the SearchTrax pages and anyone else from the same ISP has looked at that page recently, the ISP cache will have saved it, and will give it out to me without ever telling you about it. (This applies whatever my browser settings). So hundreds of people could read your pages, even though you'd only sent it out once.
3. What you can know
The only things you can know for certain are the number of requests made to the SearchTrax server, when they were made, which files were asked for, and which host asked you for them.
4. What you can't know.
You can't tell the identity of your readers. Unless you explicitly require users to provide a password, you don't know who connected or what their email addresses are.
You can't tell how many visitors you've had. You can guess by looking at the number of distinct hosts that have requested things from you. Indeed this is what many programs mean when they report "visitors". But this is not always a good estimate for three reasons. First, if users get your pages from a local cache server, you will never know about it. Secondly, sometimes many users appear to connect from the same host: either users from the same company or ISP, or users using the same cache server. Finally, sometimes one user appears to connect from many different hosts. As you said, AOL allocates users a different hostname for every request. So if your home page has 10 graphics on, and an AOL user visits it, most programs will count that as 11 different visitors!
You can't tell how many visits you've had. Many programs, under pressure from advertisers' organisations, define a "visit" (or "session") as a sequence of requests from the same host until there is a half-hour gap. This is an unsound method for several reasons. First, it assumes that each host corresponds to a separate person and vice versa. This is simply not true in the real world, as discussed in the last paragraph. Secondly, it assumes that there is never a half-hour gap in a genuine visit. This is also untrue. I quite often follow a link out of a site, then step back in my browser and continue with the first site from where I left off. Should it really matter whether I do this 29 or 31 minutes later? Finally, to make the computation tractable, such programs also need to assume that your logfile is in chronological order: it isn't always, and the logs will produce the same results however you jumble the lines up.
Cookies don't solve these problems. Some sites try to count their visitors by using cookies. A cookie is a small text file saved to your computer which identifies your computer to the web site the next time you return. This reduces the errors. But it can't solve the problem unless you refuse to let people read your pages who can't or won't take a cookie. And you still have to assume that your visitors will use the same cookie for their next request.
You can't follow a person's path through your site. Even if you assume that each person corresponds one-to-one to a host, you don't know their path through your site. It's very common for people to go back to pages they've downloaded before. You never know about these subsequent visits to that page, because their browser has cached them. So you can't track their path through your site accurately.
You often can't tell where they entered your site, or where they found out about you from. If they are using a cache server, they will often be able to retrieve the SearchTrax home page from their cache, but not all of the subsequent pages they want to read. Then the first page you know about them requesting will be one in the middle of their true visit.
You can't tell how they left your site, or where they went next. They never tell you about their connection to another site, so there's no way for you to know about it.
You can't tell how long people spent reading each page. Once again, you can't tell which pages they are reading between successive requests for pages. They might be reading some pages they downloaded earlier. They might have followed a link out of searchtrax.com, and then come back later. They might have interrupted their reading for a quick game of Minesweeper. You just don't know.
You can't tell how long people spent on your site. Apart from the problems in the previous point, there is one other complete show-stopper. Programs which report the time on the site count the time between the first and the last request. But they don't count the time spent on the final page, and this is often the MAJORITY of the whole visit.
The bottom line is that HTTP is a stateless protocol. That means that people don't log in and retrieve several documents: they make a separate connection for each file they want. And a lot of the time they don't even behave as if they were logged into one site. The world is a lot messier than this naive view implies. That's why the log file reports requests (hits), i.e. what is going on at the SearchTrax server, which you know, rather than guessing what the users are doing.
Defenders of counting visits etc. claim that these are just small approximations. I disagree. For example, almost everyone is now accessing the web through a cache. If the proportion of requests retrieved from the cache is 50% (a not unrealistic figure) then half of the users' requests aren't being seen by the servers.
Other defenders of these methods claim that they're still useful because they measure something which you can use to compare sites. But this assumes that the approximations involved are comparable for different sites, and there's no reason to suppose that this is true. And even once you've agreed on methodology, different users on different sites have different patterns of behaviour, which affect the approximations in different ways: for example, you will usually find different characteristics of weekday and weekend users at your site.
I've presented a somewhat negative view here, emphasising what you can't find out - as what you can't find out far, far outweighs what you can. However, web statistics are still informative: it's just important not to slip from "the SearchTrax site has received 100,000 hits" to "100,000 people have visited searchtrax.com" In some sense these problems are not really new to the web -- they are present just as much in print media too. For example, a publisher only know how many of their magazines they've distributed, not how many people have read them. In print media we have learnt to live with these issues, using the data which are available, and it would be better if we did on the web too, rather than making up spurious numbers.
Hope that clears a few things up for some people,
Last edited by theUKdude; 8th July 2004 at 11:58 AM.