Wednesday, August 22, 2007

The Problem with Erlang

The problem with Erlang and other functional programming languages is that their designers assume that unreliability is an essential part of complex software systems. Joe Armstrong, the main inventor of Erlang, wrote the following at the beginning of his thesis “Making Reliable Distributed Systems in the Presence of Software Errors”:

How can we program systems which behave in a reasonable manner in the presence of software errors? This is the central question that I hope to answer in this thesis. Large systems will probably always be delivered containing a number of errors in the software, nevertheless such systems are expected to behave in a reasonable manner.

I apologize to Joe for using his thesis as an example, but it seems that the entire computer industry has swallowed Fred Brooks’ “No Silver Bullet” arguments regarding the relationship between complexity and reliability, hook, line and sinker. This is very unfortunate because, as I explain on the Silver Bullet page, Brooks’ arguments are fundamentally flawed. The COSA philosophy is that unreliability is not an essential characteristic of complex software systems. As I wrote in a previous article, there are situations where safety or uptime is so critical that not even extreme reliability is good enough. In such cases, unless a program can be guaranteed 100% reliable, it must be considered defective and should not be deployed. Achieving rock-solid reliability in a complex software system is not impossible.

First and foremost (and this is what is missing in Erlang and other concurrent languages), the timing of all processes in the system must be 100% deterministic and must be based on change. Nothing should happen unless something changes. Second, the system must automatically enforce the resolution of data/event dependencies in order to eliminate blind code. The only way to achieve these two goals is to adopt a software model that is based on elementary, concurrent, non-algorithmic, synchronous, communicating processes. In COSA, these concurrent processes are called cells of which there are only two types: sensors and effectors. The main difference between a COSA cell and a concurrent process in other languages is that COSA cells are synchronous, meaning that their execution times are equal. Every COSA cell executes its elementary function in exactly one system cycle. This result in deterministic timing, a must for reliable software.

Note: There is a difference between synchronous messaging and synchronous processing. COSA and most concurrent languages use asynchronous messaging, i.e., the message sender does not have to wait for a reply before continuing it execution.

2 comments:

Keith Sader said...

I think there's a conflation between software systems and distributed applications.

Erlang was designed, in part, to get around the problems described by Peter Deutch.

I don't see, from my very brief skimming of the in-depth COSA site, where COSA manages to solve these fundamental problems of distributed computing.

Thanks.

David Hopwood said...

"As I wrote in a previous article, there are situations where safety or uptime is so critical that not even extreme reliability is good enough. In such cases, unless a program can be guaranteed 100% reliable, it must be considered defective and should not be deployed. Achieving rock-solid reliability in a complex software system is not impossible."

I write safety-critical software, and I don't find your assertion that less than 100% reliable software should not be deployed to be realistic. There are defects that can and must be found before deployment, and there are defects that it is impossible to find before deployment, because without having demonstrated a behaviour to the customer or end-users, that behaviour is not known to be defective.

In a system that has had proper pre-deployment testing (using simulations, etc.), at least half of the remaining defects should be in the category of "working as designed, but not as intended". While better implementation techniques and tools are always welcome, they won't fix defects in this category.

I can well believe that it is possible to improve programming tools and languages so that more than 90% of remaining defects in an initially deployed system would be of the "working as designed, but not as intended" type. That still leaves the issue of how you ensure fault isolation and safety in the face of faults arising from those defects, and from hardware failure. That is the problem that software fault tolerance techniques (among other things, such as hardware watchdogs and safety circuits) are designed to address.

So, I believe your criticism of Erlang is misplaced. Actually, I can't see any coherent criticism of Erlang itself here -- only of one possible motivation behind part of its design.