Thursday, August 23, 2007

Fault Tolerance: COSA vs. Erlang

When Joe Armstrong and his colleagues designed the concurrent programming language Erlang, their main intention was to provide a tool with which to create software systems that are robust and impervious to local failures. Failure localization is an absolute must for high-availability applications. However, the difference between the COSA and the Erlang philosophies regarding failure localization is telling.

Armstrong wrote in his thesis that “large systems will probably always be delivered containing a number of errors in the software, nevertheless such systems are expected to behave in a reasonable manner.” It is clear that Erlang’s creators were mainly concerned with finding a way to prevent hidden software errors from crashing mission-critical systems. The COSA philosophy, by contrast, is that software should never fail and that fault tolerance should be needed only to prevent catastrophic failure in the case of malfunctioning hardware (sensors, effectors, etc…). In this vein, mission-critical systems should incorporate redundancy as an additional preventive measure.

I really have nothing against Erlang's use of pervasive concurrence and asynchronous messaging to create fault-tolerant software systems. COSA uses the same techniques. The reason that I keep hammering at the software reliability problem is that this is a crucial issue for the software and hardware industries. Fault tolerance is not nearly enough. In some applications, safety is so critical that any software failure, even if localized, is not an option. Unless and until a solution is found, unreliability will continue to put an upper limit on the complexity of our software systems. As an example, we could conceivably be riding in self-driving vehicles right now but concerns over reliability, safety and high development costs will not allow it. As a result, over 40,000 people die every year in traffic accidents, in the U.S. alone. Something must be done!

The solution will require a radical change in both software and CPU architectures. Now is the right time to make the switch as the industry is turning a corner: computing is going parallel. We should use the opportunity to abandon the old algorithmic approach and adopt a non-algorithmic paradigm. This is the main goal of Project COSA.

No comments: