By Rachel Chalmers
A new study suggests that the leading search engines taken together have indexed less than half the pages published on the world wide web. It also appears that search engines do not treat all sites equally, and that new pages may not appear on any search engine for months. In an article published this week in Nature, Steve Lawrence and C. Lee Giles of the NEC Research Institute in Princeton, New Jersey observe that the web is becoming a major communications medium. They call for the data on it to be made more accessible.
Lawrence and Giles have estimated that the web now contains around 800 million pages. That’s 6 terabytes of data distributed across nearly 3 million servers. Last time the two measured the web, for a December 1997 article in Science, they came up with a figure of 320 million pages. But the difference may not be directly attributable to the growth of the web, since Lawrence and Giles used different methods to arrive at the two estimates. For the Science article in 1997, they used a variation of the capture-recapture method for estimating populations. They captured sites with one search engine and tried to recapture them with another search engine, then estimated the total number of pages based on the overlap.
For the new study, the researchers chose random internet protocol (IP) addresses and tested for a web server at the standard port. Under IP version 4, there are about 4.3 billion possible IP addresses (IPv6 will greatly increase the address space). Based on the fraction of random texts that successfully located a server, Lawrence and Giles were able to calculate that there are 16 million web servers. Discounting servers that are not publicly accessible – for example, those that are behind firewalls, that respond a default page, printers, routers and so forth – they concluded that there are 2.8 million servers on the publicly indexable web.
Next the researchers had to calculate the average number of pages per server. They did this by choosing 2500 servers at random and taking the mean number of pages: it was 289. Multiply that by 2.8 million and you get 809.2 million pages – call it a round 800 million. Finally, to estimate the total coverage provided by various search engines, Lawrence and Giles obtained the number of pages indexed by each engine from the engines themselves. At 128 million pages, Northern Lights covers 16% of the known web, with AltaVista and Snap a hairs-breadth behind it on 15.5%. HotBot has 11.3%, Microsoft 8.5%, Infoseek 8.0%, Google 7.8%, Yahoo 7.4%, Excite 5.6%, Lycos 2.5% and EuroSeek a mere 2.2%.
Our results show that the search engines are increasingly falling behind in their efforts to index the web, the researchers concluded. It’s not all bad news, however: The overlap between the engines remains relatively low; they noted, combining the results of multiple engines greatly improves coverage of the web during searches. Giles and Lawrence estimate that between them, the eleven search engines studied have indexed 335 million pages, or 42% of the web.
So will the search engines use the researchers’ method of random sampling of IP addresses to increase their coverage? Maybe they’re doing that already, Giles told ComputerWire. Or maybe not: It’s not evident that the search engines want to fill out their coverage, Giles observed. They want to maximize the number of people who come to their site. They want to cater to the maximum number of people. That means indexing the most popular information – thus creating a niche for specialized search engines and portals to dig out the information the big players no longer bother with.