In Part I, I wrote that a processor should be seen as a necessary evil. It is needed only because cells (elementary objects or instructions) in computer memory cannot process themselves. Thus a processor should be seen as a cell processor. From a programming point of view, the cells are the elementary operations that comprise a program residing in memory. In a truly parallel programming environment, parallelism is implicit and sequential order is explicit. In other words, unless specified otherwise, all cells are assumed to run concurrently with other cells. Every cell must explicitly specify its successor or successors, if any. A parallel program thus consists of multiple linked lists of cells (there are ways to use simple lists for optimization purposes but that’s another story). Immediately after executing its operation, a cell must send a signal to its successor or successors, if any, alerting them that it is their turn to execute. In a reactive system like COSA, execution and signaling count as one system cycle, meaning that they happen simultaneously.
Two Instruction Buffers
In order to prevent the troublesome signal racing conditions that would otherwise occur, the cell processor must use two instruction or cell buffers, A and B. The way it works is rather straightforward. As the cells in buffer A are being processed, buffer B is populated with successor cells that will be processed during the next cycle. As soon as all the cells in buffer A are processed, the two buffers are swapped and the cycle begins anew. Simple. I use this exact method in my own neural network experiments. Of course, this method can be modified and optimized for performance in a processor. Instead of just two buffers one can have three, four or more buffers and an instruction prefetch mechanism can be used to fill the buffers ahead of time. This is important because load balancing must happen before the processor is ready to process a buffer.
The beauty of the two-buffer scheme is that the parallelism is fine grained and scalable and it can be performed entirely by the processor. It can be used in both single core and multicore architectures. It should be done using an MIMD configuration, if possible, an order to take full advantage of vectorization. Highly parallel programs will automatically scale to take advantage of all available cores. In fact, neither the program nor the programmer need be aware of the number of cores. The cores are presented with a list of cells to process at every system cycle and they divide the cell load among themselves. Program modules can send messages to each other via shared memory structures. Both buffers should reside on the chip for fast processing. The size of the buffers should be big enough to accommodate the most demanding situation so as to avoid using main memory. In a single-core processor, managing the buffers should be straightforward. In a multicore processor, however, we need a special controller, a task manager. In Part III, I will describe the task manager’s job, which is mostly load balancing.