View all newsletters
Receive our newsletter - data, insights and analysis delivered to you
  1. Technology
March 25, 1996

BEHIND THE SCREENS- THE ALTA VISTA INTERNET SEARCH ENGINE

By CBR Staff Writer

From D3 (Directions in Desktop Development), a sister publication

Like all the world’s pivotal innovations, Alta Vista started life on the back of a napkin. Just about a year ago, Louis Monier and Paul Flaherty, both engineers at Digital’s Palo Alto research labs, sat down to lunch and got talking about big numbers. The newspapers were full of Internet stories at the time, and editorials were predicting that the amount of information online would soon be too much to imagine, much less quantify. Meanwhile, Digital had just launched the Turbo Laser, an Alpha-based server with a 64-bit address bus – theoretically capable of addressing 17 billion gigabytes.

By Lem Bingley

We just had this crazy idea, recalls Monier, of putting two and two together. Twelve months have since passed and the idea – crazy or otherwise – has become a reality. Alta Vista is now online at https://altavista.digital.com. Its Turbo Laser, an eight-processor AlphaServer 8400 5/300 with a massive 6GB memory and 210GB of RAID, provides the largest full-text searchable index currently available on the Web.

Scooter the spider

Monier, now principal engineer on the project, spent much of last summer fashioning a web crawler capable of retrieving the contents of the entire Net. His scratch-built design, called Scooter, is a multi-threaded spider capable of retrieving as many as 1,000 documents simultaneously. It runs from a single Alpha workstation (a DEC 3000 Model 900 with 1GB of memory and 30GB of RAID) at Palo Alto, and has been designed to be a good web citizen – it obeys the Standard for Robot Exclusion (see D3 p39, last month) and avoids hitting the same site repeatedly.

Prototype index

Content from our partners
Rethinking cloud: challenging assumptions, learning lessons
DTX Manchester welcomes leading tech talent from across the region and beyond
The hidden complexities of deploying AI in your business

While work on Scooter was under way, Monier hooked up with Mike Burrows, a fellow researcher at Palo Alto. Burrows had developed a prototype indexing technology as part of another project, and this proved crucial. Monier describes it as, Probably the fastest and best indexing technology in the world. The indexer, which runs on the Turbo Laser, can handle about 1GB of text per hour, building a database that preserves the full text of the pages it has read. This is the main bottleneck of the whole process, and Scooter could actually run much faster. The indexer has so far processed around 100GB – retrieved from around 22 million pages of text. The resulting index is around 33GB. The fact that we provide a full-text search is the biggest factor in keeping it so big, Monier says. A full-text index allows a number of techniques not possible by other means, such as searching for names like ‘John Smith,’ while ignoring documents that just happen to contain the words ‘John’ and ‘Smith’, or for finding specific phrases or juxtaposed words. Unlike most of the other search engines on the Web, Alta Vista also tackles Usenet newsgroups. An AlphaStation 250 4/266, with 13GB disk and 196MB memory, keeps an up-to-date index of all the news groups. Since the Usenet postings form a collection that is constantly changing, this indexer is kept extremely busy, even though the index itself is much smaller than the main index. A news server at the Alta Vista site will serve Usenet postings using HTTP, so that a user can link to an article within a normal browsing session. The whole project, including Scooter, the indexer, and a custom-built multi-threaded Web server, has been written in C, under Digital Unix. We didn’t dare use C++, confesses Monier, mainly because of the relatively untried nature of the C++ standard template libraries. The complete system came together during Autumn 1995. Scooter was run, and a large index of part of the Web was built. Scooter was brought online for the Web at large, with little fanfare, on December 15th. Within three weeks it was serving two million queries per day. Scooter was sent out again in January, and this time 21 million pages were indexed. At present, the service is answering around four million queries a day – all of which are handled by a single AlphaStation 250 4/266,

with 256MB memory and 4GB disk. The system manages an average response time of less than half a second per query – not bad for a service which, according to Monier, is still very much in the Beta phase. Once we have settled down some more, we will run the spider continuously as it was designed to be run, Monier hopes. When that happens, Alta Vista will finally be in a position to attempt a full sweep – gathering in the whole Web. Monier reports that Scooter knows of the existence of more than 40 million pages, and has found 150,000 HTTP servers – a record number. When running non-stop, Scooter won’t simply carry out a round-robin search of those servers, it will use an adaptive schedule to prioritize visits to Websites which change most often (making use of ‘last modified’ lines where possible). This should make Alta Vista not just the most complete, but also the most up-to-date index available. Above all, Monier just can’t disguise his pleasure at having done something that nobody believed possible. Before Alta Vista, he says proudly, Everyone just knew that you couldn’t index the whole of the Internet. You just didn’t even think about it. Now, of course, we know differently.

D3 https://www.computerwire.com

Websites in our network
Select and enter your corporate email address Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
  • CIO
  • CTO
  • CISO
  • CSO
  • CFO
  • CDO
  • CEO
  • Architect Founder
  • MD
  • Director
  • Manager
  • Other
Visit our privacy policy for more information about our services, how Progressive Media Investments may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.
THANK YOU