Friday, August 22, 2008

Transforming the TILE64 into a Kick-Ass Parallel Machine, Part III

Part I, II, III , IV

Abstract

Previously in Part I and II of this series, I suggested that Tilera Corporation could blow the competition out of the water by adopting the COSA software model and dumping the processor core it is currently using for its TILE64™ multicore processor in favor of a pure MIMD vector core. The latter is somewhat similar to a superscalar processor except that every instruction is a vector and there is no need to test for instruction dependencies (the EPIC architecture of Intel's Itanium is probably a better comparison). The reason is that instructions in the core's input buffer are guaranteed to be independent in the COSA model. The new core should be capable of executing 16 or more different instructions simultaneously. This sort of hyper parallelism would transform the TILE64 into the fastest processor on the planet. Today, I will talk about how to optimize parallel instruction caching for improved performance. But there is more to computing than performance. The COSA heartbeat is a global virtual clock that synchronizes all operations. Synchronization enforces deterministic processing, a must for rock-solid applications and security. Please read the previous posts in the series and How to Solve the Parallel Programming Crisis before continuing.

Instruction Caching

The L1 instruction cache of every core would normally be divided into two parallel buffers A and B as seen below. L2 and data caches are not shown for clarity.
While the instructions (cells in COSA) in buffer A are being processed (hopefully, concurrently), buffer B is filled with the instructions that will be processed during the next cycle. As soon as all the instructions in buffer A are done, the buffers are swapped and the cycle begins anew. Of course, there are ways to optimize this process. Instead of just two buffers, the cache could conceivably be divided into three, four or more buffers and processed in a round robin fashion. An instruction prefetch mechanism can be used to fill the buffers ahead of time while the core is executing the current instructions. Even sensor-dependent branches (decision branches) can be fetched ahead of time. Branches that are not taken are simply discarded at the time of sensing (comparison). More detailed information on COSA sensors (comparison operators) can be found on the COSA System page.

The COSA Heartbeat

The COSA software model is based on the idea that everything that happens in a computer should be synchronized to a global virtual clock and that every elementary operation lasts exactly one virtual cycle. This is extremely important because it is the only way to enforce deterministic processing, a must for reliability and security. It would make the TILE64 ideal for mission and safety-critical applications, especially in embedded systems. None of the current or projected multicore processors on the market support deterministic processing.

Essentially, the heartbeat is a signal that tells a core to advance to the next parallel instruction buffer. This signal must be made available to all the running programs because it is used to update a global counter that is accessible by all programs. The counter is used by sequence detectors and other cells to calculate intervals.

In a two-buffer, single-core processor, the switch happens as soon as all the instructions in the current buffer are executed. In a multicore system, we need a mechanism that sends a heartbeat signal to all the cores on the chip so they can advance in unison. This mechanism can be as simple as an AND gate that fires when all the current instruction buffers are done. It goes without saying that no core should have to wait long for a heartbeat after finishing its current buffer. This is why precise load balancing is important.

In Part IV, I will go over automatic load balancing and data cache optimization.

Related Articles:
How to Solve the Parallel Programming Crisis
Tilera’s TILE64: The Good, the Bad and the Possible, Part I, II, III

7 comments:

tbcpp said...

This is the one thing that's been bugging me for a while:

With this global timer you will run into the "speed-of-light" issue. Where at 2Ghz or so, the pulse of energy can't get to the far reaches of the chip by the time it's time for the second cycle to execute. In current designs this issue is gotten around by breaking the CPU up into multiple cores, each core can then more or less run at it's own speed. But if you lock the entire processor to one clock cycle you will be limiting the size of the chip, and the overall scalability. The more you scale the chip, the slower the clock cycle will have to be.

Secondly, what about a for loop that multiplies two sets of numbers. Let's say we have a trashy multipler (like the ones in the 386 days) that takes about 4 cycles to multiply variables. So we have FOR->MULTIPLY->STORE type thing. The FOR module will load numbers and pass them as messages to the MULTIPLY module the MULTIPLY module will then pass them to the STORE. The issue here is that we're going to get a pileup of messages in the input queue of the MULTIPLY module. Unless we wait to advance until each MULTIPLY is done. In that case we just cut the speed of the processor by a factor of 4.

I'm sure I'm missing something here. But even in reading all the docs you have provided I still haven't figured this out.

Louis Savain said...

With this global timer you will run into the "speed-of-light" issue. Where at 2Ghz or so, the pulse of energy can't get to the far reaches of the chip by the time it's time for the second cycle to execute. In current designs this issue is gotten around by breaking the CPU up into multiple cores, each core can then more or less run at it's own speed. But if you lock the entire processor to one clock cycle you will be limiting the size of the chip, and the overall scalability. The more you scale the chip, the slower the clock cycle will have to be.

I think you're confusing the virtual clock with the real time clock that is in every core. The frequency of the virtual clock will allways be lower than that of the cores' internal clocks because, even if you could process all instructions in parallel, some instructions take several real time ticks to complete. It does not matter if you have a latency of say 2 or 3 real time ticks per heartbeat because what matters is how many instructions are executed in parallel. If you process hundreds of instructions concurrently during a given virtual cycle and have to wait a few real time ticks for the next virtual cycle, you are still running hundreds of times faster than if you were doing things sequentially. And the more cores you have, the better it gets. The latency is insignificant when compared to the parallel bandwidth.

I'll respond to the second part of your comment later.

Tom said...

tbcpp: Just to elaborate: it's not important that every core starts at exactly the same time, only that none start before any have finished the previous cycle. Cores on the outside could be given less work to not let their latency become a bottleneck for overall performance.

Louis: I'm glad you're going to address his second question, because it's been on my mind, too. I think the point is that yes, sequential operations will be slower in the COSA model, but many more can be done simultaneously to make up for it.

I'm also interested in hearing about global memory (outside of the cache), especially cases where there is a race and two components try to write to the same memory. Or there is a write and a read at once. It seems like the heartbeat needs two stages: signal processing, then memory writing. And what about locking? The COSA pages say components aren't re-entrant, which means some kind of synchronization is going on.

The more I think about it, the more enticing the model is and the more unworkable it seems. Louis: have you seen Simulink?

Louis Savain said...

Tom wrote, Just to elaborate: it's not important that every core starts at exactly the same time, only that none start before any have finished the previous cycle. Cores on the outside could be given less work to not let their latency become a bottleneck for overall performance.

I think that the travel time of the heartbeat signal is really a non-issue. Optical interconnect will be more than adequate. It is much more likely that imperfect load balancing will cause some cores to finish before others and have to wait a few cycles. But, as I said, the wait time is insignificant compared to the number of instructions that will be processed concurrently.

tbcpp wrote, Secondly, what about a for loop that multiplies two sets of numbers. Let's say we have a trashy multipler (like the ones in the 386 days) that takes about 4 cycles to multiply variables. So we have FOR->MULTIPLY->STORE type thing. The FOR module will load numbers and pass them as messages to the MULTIPLY module the MULTIPLY module will then pass them to the STORE. The issue here is that we're going to get a pileup of messages in the input queue of the MULTIPLY module. Unless we wait to advance until each MULTIPLY is done. In that case we just cut the speed of the processor by a factor of 4.

At first glance, I thought I understood your message but after re-reading it, I admit that I have no idea what the problem is. Would you rephrase it?

Tom: I think the point is that yes, sequential operations will be slower in the COSA model, but many more can be done simultaneously to make up for it.

Yes, sequencing will be slightly slower, especially if there are lots of stuff going on concurrently and not enough cores to process them concurrently. But this is also true of multithreading. If you have lots of threads running concurently, they will run slower. It's true that you can change the priority of the threads and give them more of the processor's time but I think that this is very bad because it wreaks havoc with deterministic timing and thus with reliability and security.

In my opinion, we are so used to sequential processing that we don't realize that concurrency is the norm, not the exception. Once the right parallel programming environment is built, developers will be surprise to find that almost everything is parallel. Applications that absolutely must have extremely fast sequential processing will have to run on special hardware. There will be very few of those. The power of parallel processing is all around us. Just think of the brains of humans and animals. Their sequential speed is pathetic compared to transistors and yet they can do amazingly complex things.

My prediction is that AI applications will be the norm in the not too distant future. Smart programs will use zillions of cells running in parallel doing things that we never thought possible.

I'm also interested in hearing about global memory (outside of the cache), especially cases where there is a race and two components try to write to the same memory. Or there is a write and a read at once. It seems like the heartbeat needs two stages: signal processing, then memory writing. And what about locking? The COSA pages say components aren't re-entrant, which means some kind of synchronization is going on.

This is a problem that exists only in multithreading because there is no way of determining when actions are taken. In a deterministic system such as the COSA model, the timing of actions is guaranteed and if there is a conflict, the dev tool will find it and alert the programmer in order to correct it.

The more I think about it, the more enticing the model is and the more unworkable it seems.

Oh, it's workable alright. Cellular automata work; and so do neural networks, video games and simulations. They will work even better when the right parallel programming model is adopted and we start making the processors to support the model.

Louis: have you seen Simulink?

Yes. It looks really cool. I haven't played with it but from what I've seen, I think it is too complicated. Development tools need to make use of the concept of information hiding as much as possible. There is a reason that hardware engineers use wiring harnesses and bundle a bunch of communication pathways together using a single multi-connector. There is no need to have everything explicit when you look at a system. Information should be on a need to see basis only. Have a good weekend, everybody.

neotoy said...

Cool. I don't always follow everything in these posts, but overall I get a sense of the concepts as they are outlined.

The point of parallel processes quantitatively overtaking power intensive single-threaded methods; is one that I think is almost universally understood, but not yet embraced by the industry. Many hands make light work as they say.

I did have one suggestion. It may sound a tad sacrilegious given the context, but might I suggest a hybrid approach? A processor die could include a separate dedicated module for 'legacy' multithreaded processes.

Parallel MIMD is clearly the future, however it seems like at some point a transition will need to be made, and I don't see why both models can't be included on a single chip.

[REDACTED] said...

I did have one suggestion. It may sound a tad sacrilegious given the context, but might I suggest a hybrid approach? A processor die could include a separate dedicated module for 'legacy' multithreaded processes.

Yes, rather than forcing the odd program out to run on specialized sequential hardware, it seems pretty obvious that a "wrapper" could be implemented to emulate this functionality. Very well. Developers would only constantly use this wrapper to return the CPU to the mode of functioning that they were comfortable with, for the first few years. ;)

Alpha said...

The author need spend some time to truly understand modern processor concepts i.e. Pipling, superscalar, out-of-order, VLIW and cache coherency.