Back to blog index

Human and Robot visitors

27-01-2007

On occasion I like to browse the Detox server logs to check my stats and to make sure there are no problems being reported. The logs can tell me how many unique visitors I receive each day, week, month and which web pages get the most traffic. Digging deeper you can find what content visitors are searching for but are not finding. This can be a file that was deleted, moved, or was never actually there due to a typo in a local or remote link.

This activity can prove very useful to a webmaster because it allows you to actively reduce the amount of traffic to your site that is lost. Take the 404 error for example. If there are several visits to your site arriving at content that does not exist, they will receive a 404 error, or a message informing them that there is no such page. This is where you can step in and fix the problem. If it is simply a typo in a link from another page, it can be easily corrected. If it is a link located on a remote server, you could contact the owner of that site and point out the typo. What you could also do is create the page that is being linked to (temporarily until the issue is resolved at least) and create a redirect to the correct content.

However if there is no content, you could add the missing web page as I have. Recently I was receiving hits from a forum that discusses mobile phone technology. I had written an article on "drive-by blue jacking" several years ago when blue jacking was a hot topic in the media. It had received a lot of hits at the time but for my own reasons I removed the article and promptly forgot about it. Now, for whatever reason, the pages on this forum linking to my article were being visited and the link was being followed back to my site, but there was no article anymore. My server logs reported the errors and I decided to create a page with nothing more than a basic message saying that the content they were looking for was not found, but here is a link to similar content that may interest you. That way their visit to my site did not have to terminate at an error page. By looking into my logs later I could see that several visitors actually clicked the new link and browsed further content.

The above tips are great for human visitors but what about robots? By robots I am of course referring to web crawlers and spyders, the programs employed by search engines to trawl through your site indexing content for their database. While looking through my server logs I noticed that there was over 2MBs of error log devoted entirely to the fact that for every single day of my servers operation over the last year, robots.txt could not be found. It was an interesting read to see that multiple IP addresses were looking for robots.txt in my root directory each day and my server was logging an error saying it was not found. What this also meant was that each robot that could not find the robots.txt file then went on to trawl through my entire web site and that eats into my bandwidth allowance. And this was going on every single day!

So I decided to create my very first robots.txt file. After taking a quick look online for some info on how to create one, I had one on my server 5 minutes later. It is easy to make one yourself. Just open your favourite text editor (I use notepad) and start with a header as follows:

User-agent: *

This tells any robot that reads the file that it should adhere to the contents. You can create sections for specific robots but I don't need to specify one rule for one robot and another rule for a different robot. This one file will apply to any robot that tries to index my site.

After that line you just add lines allowing or disallowing access to files and directories. If you are not specifying different rules for each robot type (such as Googlebot) then its far easier to create a list of content that you do not want any robot visitor to look at. I specified the directories on my site that did not contain any web pages (html/php/asp/pl..) so that my bandwidth was not wasted by robots looking through my image and css directories hunting for content that was not there.

To do this just create a new disallow line for every file or directory you want the robot to ignore.

Disallow: /images
Disallow: /css

Now upload robots.txt to the root directory of your web site (i.e. public_html) and that's that. The amount of bandwidth utilised by robots, web crawlers and spyders each day should be greatly reduced without any affect to your existing search engine listings.



Home

Copyright 2006 detoxcomic.com