ROBOTS ON THE WEB - Tech Monitor

From Directions in Desktop Development (D3), a sister publication

Automatic software agents are already carrying out important tasks on the WWW, but they bring their own brand of problems. If you’ve seen the film Star Wars often enough – say, three or four times – then you’ll remember the scene I’m about to describe. It’s the one where the heroes are aboard the Death Star, and C- 3PO has just located a computer outlet socket. A straight-faced Ben Kenobi gestures to R2-D2 and says Plug in. He should be able to interpret the entire Imperial computer network. Now nobody ever pretended that Star Wars was supposed to be cerebral, but that particular line always struck me as moronic. Don’t get me wrong, I like the film, but I’ve always cringed at that bit. It seems so hopelessly naive to expect this little dustbin robot to make sense of some huge network. But it seems that I ought to stop cringing and give Kenobi his due, because this is exactly the sort of thing that robots do best, apparently. Sadly, the robots in question aren’t as cute as R2-D2 – they’re not even physical. They’re programs that automatically traverse the World Wide Web, navigating by following hyperlinks. This class of software robot (call them ‘softbots’ if you want to sound hip) are sometimes referred to as ‘wanderers’, ‘crawlers’, ‘ants’ or ‘spiders’. The colorful names should not be taken too literally, however, as they tend to give the mistaken impression that the program moves from site to site, like a worm or a virus. This is not the case. The robot must reside on a client machine and ‘visit’ sites by downloading information, just as a browsing human would.

By Lem Bingley

Web robots can do some very useful things. They can be designed to maintain a Website, clearing out the dead wood by periodically exploring links to ensure that referenced resources are still online. Or they can be built to explore a given range of sites repeatedly, giving an alert when there are updates or changes. But perhaps the most useful task that a robot can undertake is resource discovery. A software robot can be instructed to recursively follow hyperlinks – proceeding from one Website to another, building a database of useful things that it finds along the way – a tractor beam control unit here, an imprisoned princess there, perhaps. This approach, of using a robot to build an index of a part of the Web, has already been well exploited. WebCrawler, InfoSeek and Lycos – three of the most popular search services available on the Web – have databases compiled by robots that continuously seek out new Websites and new data to link to. Digital’s relatively new Alta Vista engine also uses the Web crawling technique.

Personal Softbots

Clearly, the Web is of such size and complexity that maintaining a useful index is a massive undertaking – both Lycos and Alta Vista store links to more than 16 million distinct URLs. But there is no reason why a personal robot could not be given a simplified version of the same task. It’s very tempting to exploit a robot’s ability to remorselessly keep on searching to meet our own, more specific needs. This situation has been described by future-visionaries from Arthur C Clarke to Nicholas Negroponte. Someday soon, we are told, Intelligent Agents will cruise around the information superhighway on our behalf, looking for bargains, compiling personalized newspapers, and searching for things of interest to fit our individual likes and dislikes. This picture seems tantalizingly close – it needs no great leap of imagination to jump from manual browsing to a Web populated with softbots that do the bulk of the browsing for us. One company that has made an effort to explore the feasibility of this kind of automated assistant is Andersen Consulting. In June 1995 its C-STaR (Center for Strategic Technology Research) division launched what it hoped might become a progenitor of just this type of service. Its experimental shopping agent, called BargainFinder (BF), is not terribly refined (or particularly intell

igent) but then the job it attempts is very specific. BF looks for CDs on behalf of Web users, who must fill out an online form stating the name of the album they wish to locate. Andersen’s automaton then visits the nine record outlets that it knows have searchable Web catalogs. It returns in short order with a digest of pricing and shipping details, making it a snip to pick the best deal. BF expects no favors from the sites that it visits – it operates by downloading forms like the rest of us. But by automating the business of wading through all nine Websites, and by presenting a unified face, it makes things dramatically easier for the user. Indeed, BF is not really a robot at all, but an alternative interface, according to Jeff Leane, former Research Associate at C-STaR and part of the team behind BF – An intelligent interface [to help people] to use the Web productively. That’s the theory. In practice, if you try the BargainFinder service today, you’ll be informed that several of the stores are blocking Andersen’s agent, and that you should browse through those sites yourself.

Volume of Traffic

BF’s problems have actually been very instructive. Knowing that its agent needed no special facilities, Andersen saw no need to actively seek the permission of the stores which it targeted – in fact it viewed their reactions as part of the experiment. So the first the stores knew was when the same site kept turning up in their server logs – typically as often as 2,000 times a day. Several of the stores found this volume of requests too much to handle. The key lesson of the BF project is that ordinary users are ready for agents on the Net, but many ‘leading’ businesses are not, thinks Leane. Some stores welcomed BF – some stores not targeted by BF wrote us letters asking to be added, and took steps to make it easy for BF to visit. Those are the stores that are planning for success, not for failure. The volume of traffic at an individual server is not the only consideration, however, the Web at large has to carry the traffic to and fro. Andersen’s experiment is actually fairly modest. It goes out onto the Web with explicit knowledge of where it is going, and targets a fairly small number of sites. But imagine a next-generation agent that promises to shop for any item, by finding sites on the fly. A necessary element of this sort of service already exists in the form of a personal resource-discovery robot called the Fish Search – a C program distributed with X-windows versions of the Mosaic browser, so- called because it spawns streams of enquiry like fish in a school. A Fish-like shopping agent would expect to do a lot more browsing than BF, so would generate much larger volumes of traffic. This kind of behavior prompted David Eichmann, an assistant professor of software engineering at the University of Houston to comment in his 1994 paper Ethical Web Agents that agents can be pests, generating substantial loads on already overloaded servers, and generally increasing Internet backbone traffic. A future where every Web client is capable of generating large numbers of automated, rapid fire requests is going to be unworkable without an infrastructure vastly superior to the one we have today. It is quite likely that a widespread adoption of Internet agents could yield a marked decrease in the speed with which useful information might be got from the Web, unless they are used wisely.

Robot Exclusion

Some of the earliest attempts at resource-discovery showed us all exactly how not to use robots wisely. In 1994, Martijn Koster, who now works for America Online as part of the WebCrawler project, was visited twice by robots, one of which ran through his Website at such speed that the volume of requests brought his machine down. That annoyed me sufficiently that I decided to do something about it, recalls Koster. He has since been active in compiling guidelines for the responsible use of Web robots, and has led the effort to put together proposal called the Standard for Robot Exclusion.

The proposed standard is designed to allow Webmasters to specify which parts of their sites should not be visited by robots. It allows discrimination between different robots, and uses a simple syntax of commands which must be contained within a file called ‘robots.txt’, stored at the root directory of the site’s Webspace. The standard has been workable and in action for a year or so, but the take-up so far has been slow. The proposal has been adopted by robot authors, with one or two notable exceptions, says Koster, but most Webmasters don’t really have a clue. Digital has just released some numbers which indicate that less than 5% of sites actually have a robots.txt file. And I suspect that a large percentage of the people that do have the file have one that makes no sense. In implementing compliance for WebCrawler, I’ve noticed files that say things like ‘Who are you? Why do you keep asking for this file?’ So it’s not yet widely used and this indicates that most people don’t care, or don’t know that there are robots out there. In the long-term, Koster aims to see the robot exclusion standard submitted to the IETF (Internet Engineering Task Force, one of the governing collectives that guides Web standards) so that it might become ‘official’ Web policy. In the interim, there are still plenty of areas which need more work. For example, the robots.txt approach, though simple, has severe limitations.

Stumbling Block

A big stumbling block is that many sites are maintained through third-party ISPs (Internet Service Providers), so individual Webmasters don’t always have the opportunity to provide a robots.txt file at the root of the URL-space. In these cases, the ISP may not oblige by compiling a composite file to cover all of its customers. A related problem appears where a site is particularly complex. If you have, say, a server with 10,000 documents on it, comments Koster, and there are 5,000 documents all over the site that you don’t want robots to access, you can end up with a robots.txt file that is several thousand entries long. Now nobody is going to want to write that in the first place, and the robot author won’t want to comply. What’s needed is a way of controlling access on a per-document basis, but unfortunately that doesn’t fit within the current format of the standard. It fits more into an HTML ‘meta’ tag or something like that, remarks Koster. He hopes that further work in this area will be undertaken as part of the WebCrawler project. Of course robot exclusion is a voluntary process – it has to be. There is no way to prevent a robot from browsing a site if people are still to be granted access. Perhaps the best way to encourage responsible robots is to turn the current proposal on its head. Telling robots where they shouldn’t go is okay, but telling robots where they can go would be better, Koster observes. And people are likely to be more prepared to put effort into getting their site visited, than into making parts of their site out of bounds. This would be an extra push to help make the robots.txt file more used, more known, and more prevalent, adds Koster. Many of the problems associated with Web robots stem from the fact that they must operate in an environment designed for people, not machines. A better solution to the problem of robot access might be to provide not just a file full of instructions, or a few HTML tags, but a scheme whereby much more of the Web is designed with machine-reading in mind. Tim Berners-Lee, the author of the initial Web protocols, addressed this subject last November, when in Paris for the launch of the European branch of the W3C (the World Wide Web Consortium). For a long time we [the W3C] have been interested in making relationships between objects on the Web more explicit – so that a computer can process them, so that a computer can understand the relationships – between a person, a house, and a document that says they own the house, say – so that computers can start to help us more, said Berners-Lee. This is something which we’ve alway

s wanted to do but there is very little industry pressure to do it. It’s something we have to put on our long term agenda, and try to do as we do all the short-term things. But perhaps the pressure will build, because as Jeff Leane points out, robots operated in a proactive environment could actually decrease network loads and increase efficiency. Agent-assisted interactions clearly are good for users, but they could also be good for merchants and for Net traffic, because their use of resources can be made much more efficient than a human’s. For example, one of the more outspoken agent-blocking stores had as many as two dozen images on its home page – that’s two dozen separate http ‘gets’ and you haven’t even entered the store yet. In contrast, one store in the BF project modified its interface, so a BF visit uses a single http connection and takes only one second.

Objective Picture

One thing is certain. As the Web grows, the databases that we currently rely on to guide our own, human browsing – databases like WebCrawler and Lycos -will find it harder and harder to maintain their accuracy. Client-side robots may well become the only way for a user to build an objective picture of the relevant sites within a realistic period. In this world of little free time and countless sites competing for attention, argues Leane, agents could assure the merchant that users really will see at least some of what he has to offer. For the mature Webmaster, Leane reckons, the robotic agents will not be enemies but allies, fighting to get the message across despite a brutal empire of information overload. More information about softbots, and the proposed Standard for Robot Exclusion can be found via Martijn Koster’s Website: http://info.webcrawler.com/mak/projects/robots/robots.html. Andersen Consulting’s BargainFinder is at: http://bf.cstar.ac.com/bf/

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

Sign up for our regular news round-up!

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

I would also like to subscribe to:

Thank you for subscribing