Saturday, September 8, 2007

How to Design a Self-Balancing Multicore Processor, Part III

Part I,  IIIII


In Part I and II of this three-part article, I argued that a processor should be seen as a device for simulating or processing vast numbers of concurrent, elementary processes or cells. A cell is a basic entity that performs an elementary operation. Unlike other concurrent languages, the programming model used for this processor design assumes that parallelism is implicit but sequential order must be explicitly specified. Simulating a parallel collection of cells requires the use of two processing buffers (A and B). Essentially, while the cells in buffer A are being processed, buffer B is being filled with the cells that will be processed during the next consecutive cycle. This is done in order to avoid signal racing conditions that would otherwise occur. As soon as all the cells in buffer A are processed, the two buffers are swapped (A becomes B and B becomes A) and the cycle begins anew.

Note: Both cells and data are shown sharing the same memory space. This is for convenience only. In a real processor, they would be separate and would use separate address and data busses.

The Task Manager

As already mentioned, both buffers should reside on the processor and buffer management can be handled by relatively simple on-chip circuitry. On a multicore processor, however, the load must be distributed among the cores for fine-grain parallel processing. Note that this is all MIMD processing. Using SIMD for development is like pulling teeth. Besides, SIMD is worthless when it comes to enforcing the deterministic timing of events. There are probably several ways to design a fast mechanism to manage load balancing among multiple cores. The one that I envision calls for every core to have its own pair of processing buffers, A and B. However, the job of populating the B buffers should be that of a special on-chip controller that I call the Task Manager (TM).

Essentially, the job of filling the buffers with instructions is that of the task manager. Whenever the TM encounters new cells to be processed, it should distribute them as equally as possible amongst the B buffers. The TM must have intimate knowledge of the status (the number of items) of the A and B buffers of every core. It should also know the execution duration (in terms of clock cycles) of every cell according to its type. This way, the TM can intelligently assign cells to each core and equalize their individual loads as much as possible. The TM has only one more job to do: as soon as all the A buffers are empty, it must signal the cores to swap the buffers and repeat the cycle. One nice thing about this approach is that the TM works in parallel with the cores and does not slow their performance. Another advantage has to do with power consumption. Since the TM has perfect knowledge of processorload at all times, it can automatically turn off a percentage of the cores, depending on the load, in order to save energy.

The Road Ahead

As far as I know, no other multicore architecture provides for fine-grain, self-balancing parallelism using an MIMD execution model. There is no doubt in my mind that it is the correct approach to designing the multicore architectures of the future. There are additional advantages that are inherent in the software model, such as fault tolerance, deterministic timing, the automatic discovery and enforcement of data dependencies. The result is rock-solid software reliability and high performance.


dvyukov said...

In order to do good load balancing Task Manager (TM) have to handle all requests for new job (that must be added to list B). Let's assume there is about 1 new job, for every currently processing job.
So TM have to handle N times more requests, than usual core (N - number of cores in the system).
Imo this is not scalable approach. With 1000 cores, TM have to be 1000 times more powerful than usual core. And TM must handle 1000 connections (as far as I understand, every core directly connected to TM).

And TM really have to do *very* good load balancing. Since all cores have to wait for most loaded core to complete in order to shift to new system cycle.

Any thoughts?

Louis Savain said...


Thanks for the intelligent comment. You are absolutely correct. The task manager design is not scalable as shown. I have a simple solution that splits the task manager's job among the cores but, unfortunately, I cannot divulge it at this time. All I can say is that the trick is to keep long distance comunication between the cores to a minimum (nearest neighbor is best) and to eliminate it altogether whenever possible. This is possible because the cores do not have their own private caches.

As far as cores having to wait, I don't think this is a problem because load balancing will be pretty much perfect. The virtual cycle is essential because that's what enforces temporal determinism. Without deterministic timing at the instruction level, there can be no software reliability.

dvyukov said...

I'm creating software system similar in some senses. It uses signals on high and medium levels, and plain-old-algorithms on low level. And it allows sufficiently fine-grained work distribution and balancing.

I think what there can be only two options:
1. Really distributed (i.e. scalable) with "best-effort" load-balancing.
2. Fully centralized (i.e. not scalable) with "perfect" load-balancing.

In order to support system-wide cycles you have to have "perfect" load-balancing. So... it will be not scalable.

In my system I find "keep long distance comunication between the cores to a minimum (nearest neighbor is best) and to eliminate it altogether whenever possible" as the only answer too.
But in my system "best-effort" load-balancing is admissible, because I don't have system-wide cycles. So core can be just idle, and this doesn't affect system performance much.

So I just don't believe that you have some "simple solution" to achieve really distributed design with "perfect" load-balancing. Because it's just not possible by definition...

Dmitriy V'jukov

Louis Savain said...

Dmitriy wrote, I just don't believe that you have some "simple solution" to achieve really distributed design with "perfect" load-balancing. Because it's just not possible by definition...

Well, in my opinion, 100% perfect load balancing would be possible only if all CPU instructions had equal execution durations. However, given a high enough load per cycle (which is what we want in a fine-grain parallel system), the wait periods are negligible.

Using the solution that I have in mind, the TM becomes somewhat like a ring communication network. A good metaphor is to think of it as a pair of sieves (A and B) where each core represents a hole in the sieve. During every cycle, one sieve is filled with jobs (fine grained instructions or cells) that need to be processed through. What makes this approach interesting is that the sieve needs to be shaken just a litle in order for all the grains to make it through the holes. The idea is to keep the surface of the grains level. In other words, no grain has to travel far before finding an available hole to fall through. Most of them do not travel at all. This is all I am at liberty say on the matter for the time being. Sorry.

dvyukov said...

Ok. I will be waiting for further information... or for patent application number ;)

Dmitriy V'jukov