Part I, II, III , IV
Previously in Part I and II of this series, I suggested that Tilera Corporation could blow the competition out of the water by adopting the COSA software model and dumping the processor core it is currently using for its TILE64™ multicore processor in favor of a pure MIMD vector core. The latter is somewhat similar to a superscalar processor except that every instruction is a vector and there is no need to test for instruction dependencies (the EPIC architecture of Intel's Itanium is probably a better comparison). The reason is that instructions in the core's input buffer are guaranteed to be independent in the COSA model. The new core should be capable of executing 16 or more different instructions simultaneously. This sort of hyper parallelism would transform the TILE64 into the fastest processor on the planet. Today, I will talk about how to optimize parallel instruction caching for improved performance. But there is more to computing than performance. The COSA heartbeat is a global virtual clock that synchronizes all operations. Synchronization enforces deterministic processing, a must for rock-solid applications and security. Please read the previous posts in the series and How to Solve the Parallel Programming Crisis before continuing.
The L1 instruction cache of every core would normally be divided into two parallel buffers A and B as seen below. L2 and data caches are not shown for clarity.
While the instructions (cells in COSA) in buffer A are being processed (hopefully, concurrently), buffer B is filled with the instructions that will be processed during the next cycle. As soon as all the instructions in buffer A are done, the buffers are swapped and the cycle begins anew. Of course, there are ways to optimize this process. Instead of just two buffers, the cache could conceivably be divided into three, four or more buffers and processed in a round robin fashion. An instruction prefetch mechanism can be used to fill the buffers ahead of time while the core is executing the current instructions. Even sensor-dependent branches (decision branches) can be fetched ahead of time. Branches that are not taken are simply discarded at the time of sensing (comparison). More detailed information on COSA sensors (comparison operators) can be found on the COSA System page.
The COSA Heartbeat
The COSA software model is based on the idea that everything that happens in a computer should be synchronized to a global virtual clock and that every elementary operation lasts exactly one virtual cycle. This is extremely important because it is the only way to enforce deterministic processing, a must for reliability and security. It would make the TILE64 ideal for mission and safety-critical applications, especially in embedded systems. None of the current or projected multicore processors on the market support deterministic processing.
Essentially, the heartbeat is a signal that tells a core to advance to the next parallel instruction buffer. This signal must be made available to all the running programs because it is used to update a global counter that is accessible by all programs. The counter is used by sequence detectors and other cells to calculate intervals.
In a two-buffer, single-core processor, the switch happens as soon as all the instructions in the current buffer are executed. In a multicore system, we need a mechanism that sends a heartbeat signal to all the cores on the chip so they can advance in unison. This mechanism can be as simple as an AND gate that fires when all the current instruction buffers are done. It goes without saying that no core should have to wait long for a heartbeat after finishing its current buffer. This is why precise load balancing is important.
In Part IV, I will go over automatic load balancing and data cache optimization.
How to Solve the Parallel Programming Crisis
Tilera’s TILE64: The Good, the Bad and the Possible, Part I, II, III