A breakthrough in software fault-tolerance is being claimed by AT&T Co’s Bell Laboratories in Murray Hill, New Jersey. As the world comes to depend more and more on telecommunications and computer networks, AT&T is striving ever harder to pre-empt or avoid the kinds of breakdowns that have plagued its long-distance network and caused chaos in parts of the US over the past couple of years. The target the researchers were working towards in this case was the ability for its systems to recover from transient errors and other faults in the software. The developments at Bell Labs are two software components designed to run under Unix – called watch d and libft. The components provide for automatic on-line retry as a way to achieve high system availability, AT&T says, adding that software fault tolerance can be provided in an application whether or not the underlying hardware or operating system is fault-tolerant. And it is sometimes more economical to provide fault tolerance in the software than in the hardware. Watchd and libft are designed to separate fault detection, process restart and volatile data recovery facilities from the application functions and are claimed to provide different levels of fault tolerance with minimal effort and high flexibility. Watchd is composed of distributed algorithms that can run on a single machine or a network of machines. It is designed to watch the life of an application continuously, and when it detects that the application has crashed or hung, it recovers that application at an initial internal state or at the point at which data was last saved. Libft is a library of C language functions that can be used in the application programs to specify and to checkpoint critical data, recover the checkpointed data, log events, do exception handling, and do N-version programming – presumably a means of ensuring that something vital written into the first version of a program survives unadulterated in subsequent versions. AT&T says that is already using the two components is its New Generation Testing system that troubleshoots toll-free 800 number services and can pinpoint problems on up to 42 lines simultaneously. Bell Labs notes that fault tolerance is common in hardware and in operating systems, but generally too costly for many software systems, and says that as far as it knows, these are the first general-purpose software modules, and they set a trend toward low-cost fault-tolerance in user-level software. AT&T says that it now exploring the possibility of marketing the components externally.