Tuesday, September 11, 2007

Why Timing Is the Most Important Thing in Computer Programming

An Analogy

Architects, carpenters, mechanical and civil engineers expect things to have specific sizes. If, during construction, the sizes of parts are found to be different than specified or do not match properly with other parts, a search will be conducted to find the cause of the discrepancy and correct it. In this article, I will argue that size (distance) is to architecture what timing is to computing. In other words, deterministic timing is essential to software reliability.

Deterministic Timing in Reactive Concurrent Systems

In a Von Neumann computer, it is unrealistic to expect every operation of a program to occur at a specific relative time based on a real time clock. The reason is that operations must wait their turn to be processed by the CPU. The problem is twofold. First, the CPU load varies from moment to moment and second, algorithmic software is such that it is impossible to predict the duration of every subroutine in an average program. However, it is possible to simulate a parallel, signal-based reactive system based on a virtual system clock. In such a system, every operation is required to be purely reactive, that is to say, it must execute within one system cycle immediately upon receiving its signal. These two requirements (every elementary operation is reactive and is processed in one cycle) are sufficient to enforce deterministic timing in a program, based on the virtual system clock. Deterministic timing means that reaction times are predictable. It does not mean that all the events (such as the movements of a mouse) that trigger the reactions are predictable. However, one event may trigger one or more chains of reactions and these, too, are deterministic, relative to the first event.

Timing Watchdogs

One nice thing about concurrent reactive systems is that interval detectors can be used to automatically find invariant intervals between any number of signals within a program. We can place timing watchdogs at various places in the program (this, too, can be done automatically) so that any discrepancy between an expected interval and the actual measured value will trigger an alarm. The temporal signature of a reactive system remains fixed for the life of the system and this makes for rock-solid reliability. So there are only two ways a timing watchdog can trigger an alarm; either the code was modified or there was a local physical system failure.

Automatic Discovery and Resolution of Data and Event Dependencies

Another nice aspect of concurrent reactive systems is that they are based on change. A change to a program variable is immediately communicated to every part of the program that may be affected by the change. The development environment can automatically link every entity or operator that changes a variable to every sensor that detects the change. This essentially eliminates blind code.

Side Effects in Complex Systems

We all know how hard it is to maintain complex legacy systems. A minor modification often triggers unforeseen side effects that may lay dormant for a long time after release. The right combination of events may cause a system failure that can be directly linked to the modification. For this reason, most system managers will look for alternative ways around a problem before committing to modifying the code. The side effects problem not only places an upper limit to the complexity of software systems, but the cost of continued development and maintenance soon becomes prohibitive. This is a problem that will never go away as long as we continue to use algorithmic systems. Luckily, the problem becomes nonexistent in the temporally deterministic reactive system that I described above. This is because blind code elimination and the use of timing watchdogs make it impossible to introduce undetected side effects. Indeed, timing is so deterministic and precise in a purely reactive system that the smallest modification is bound to violate a temporal expectation and trigger an alarm. It is up to the designer to either accept the new temporal signature, change it or revert back to the old code. As a result, we can build our software as complex as possible without having to worry about hidden bugs. In fact, and this is rather counterintuitive, more complex software will mean more timing constraints and thus more correct and robust systems, i.e., systems that work according to specs.

The Entire Computer Industry Is Wrong

All right. I know this sounds arrogant but it is true. We have been building and programming computers the wrong way from the beginning (since the days of Charles Babbage and Lady Ada Lovelace). To solve the unreliability and low productivity problems that have been plaguing the industry, we must change to a new way of doing things. We cannot continue with the same old antiquated stuff. It is no good. We must abandon the algorithmic software model and adopt a concurrent reactive model. And what better time is there to change than now, seeing that the industry is just beginning to transition from sequential computing to massive parallelism? Reactive concurrent systems are right at home in a parallel, multicore universe. We must change now or continue to suffer the consequences of increasingly unreliable and hard to develop software. This is what Project COSA is about.

No comments: