Thursday, August 28, 2008

Heralding the Impending Death of the CPU

Ancient Model vs. Modern Necessities

The modern CPU may seem like a product of the space age but its roots are rather ancient. British mathematician Charles Babbage first conceived of the principles behind the CPU more than 150 years ago and he was building on ideas that were older still, such as Jacquard’s punch card-driven loom and the algorithm, a mathematical concept that was invented a thousand years earlier by Persian mathematician, al-Khwārizmī. Like most mathematicians of his day, Babbage longed for a reliable machine that could automate the tedious task of calculating algorithms or tables of instructions. Parallel computing was the furthest thing from his mind. Yet amazingly, a century and a half later, the computer industry is still clinging to Babbage’s ancient computing model in the age of multicore computers.

The Good Way vs. the Bad Way

There are two main competing approaches to parallelism. The thread-based approach, the one chosen by the computer industry for general purpose computing, calls for having multiple instruction sequences processed side by side. The other approach, vector-based parallelism, calls for having multiple parallel lists of instructions, with the lists processed one after the other.
In the figure above, the small circles represent instructions or operations to be executed by the processor. The down arrows show the direction of execution. In thread-based parallelism, each thread can potentially be processed by a separate sequential core (CPU). In vector-based parallelism, the operations are fed into the processor as a collection of elements to be processed in parallel. For that, one would need a pure MIMD (multiple instructions, multiple data) vector processor in which every instruction is an independent vector that can be processed in parallel with the others. Both approaches will increase performance but, as I have explained elsewhere, the vector-based approach is better because it is deterministic and fine-grained; and it reflects the way parallel objects normally behave in nature.

Our brains are temporal signal-processing machines. The ability to sense the temporal relationships between events is essential to our understanding of the world around us. We can make sense of the vector-based approach because we can easily determine which processes are concurrent and which are sequential, and we can make a clear distinction between predecessors and successors. This sort of predictability is the sine qua non of learning and understanding. This is part of the basis of the COSA computing model. Multithreaded applications, by contrast, are hard to understand and maintain precisely because they are temporally non-deterministic. The idea that we can solve the parallel programming crisis by holding on to a flawed and inadequate programming model is preposterous, to say the least.

The World Does Not Need Another CPU

I was happy to read the recent news (see Techlogg’s analysis) that Nvidia has denied rumors that it was planning on developing its own x86 compatible CPU in order to compete against Intel and AMD. Good for them. The last thing the world needs is another x86 CPU, or any CPU for that matter. Nvidia should stick to vector processors because that is where the future of computing is. However, it will have to do something in order to counteract the fierce competition from AMD’s ATI graphics products and Intel’s upcoming Larrabee. The most sensible thing for Nvidia to do, in my opinion, is to transform its base GPU from an SIMD (single-instruction, multiple data) vector core into a pure MIMD vector core. This would make it an ideal processor core for Tilera’s TILE64™ as I suggested in my recent article on Tilera. Come to think of it, maybe Nvidia should just acquire Tilera. That would be a superb marriage of complementary technologies, in my opinion.

Conclusion

The CPU is on its deathbed. The doctors are trying their best to keep it alive but they can only postpone the inevitable for a little while longer. Soon it will die from old age but I, for one, will not be shedding any tears. Good riddance! It is not really hard to figure out what will replace it. It is not rocket science. It will take courage and fortitude more than brains. Who will be the standard bearer for the coming computer revolution? Which organization? Which company? Which country will see the writings on the wall and dominate computing for decades to come? Will it be Intel, AMD, Nvidia, or Tilera? Will it be the US, India, China, Germany, France, Spain, Sweden, Singapore, Japan or Taiwan? Who knows? Only time will tell but I sense that it won’t be long now.
Next: The Radical Future of Computing, Part I

Related Articles:

Parallel Computing: Both CPU and GPU Are Doomed
How to Solve the Parallel Programming Crisis
Transforming the TILE64 into a Kick-Ass Parallel Machine
Parallel Computing: Why the Future Is Non-Algorithmic
Why Parallel Programming Is So Hard
Parallel Computing: Why Legacy Is Not such a Big Problem
Parallel Computing, Math and the Curse of the Algorithm

Tuesday, August 26, 2008

Transforming the TILE64 into a Kick-Ass Parallel Machine, Part IV

Part I, II, III, IV

Abstract

Previously in this series, I suggested that Tilera should adopt the COSA software model and I argued that it should dump the sequential processor core it is currently using for its TILE64™ multicore processor in favor of a pure MIMD vector core. I argued that the number of functional units for every instruction should reflect the usage statistics for that instruction. This would increase performance while lowering energy consumption. I went over instruction cache optimization and the importance of the COSA heartbeat, a global synchronization mechanism that enforces deterministic processing, a must for software reliability and security. In today’s post (the last in this series), I will talk about automatic load balancing and data cache optimization. As it turns out, Tilera’s iMesh™ technology is ideally suited for both tasks. Please read the previous posts in the series before continuing.

Automatic Load Balancing

I have always maintained that an application developer should never have to worry about load balancing and scalability. Changing one’s processor to a more powerful model with more cores should automatically increase the performance of existing parallel applications without modification. An automatic self-balancing mechanism must be part of the processor’s hardware because there is a need to respond quickly to changing application loads. There is also a need for high precision in order to optimize bandwidth and minimize energy consumption.

Multiple Caches

Load balancing would not be so bad if all the cores could share a single instruction cache. Then it would be a matter of every core fetching and processing as many instructions as possible until the current instruction buffer is empty and then go on to the next buffer. The picture that I have in mind, as a metaphor, is that of several farm animals drinking water from a single trough at the same time. Unfortunately, we can’t do it this way in the TILE64 because we would have 64 cores accessing the same cache, which would quickly run into a performance-killing bottleneck. It is better for every core to have its own cache. Besides, with the kind of pure vector core architecture that I am calling for, the core bandwidth would be at least an order of magnitude greater than say, that of an x86, a MIPS or an ARM core. In fact, it would be faster overall than existing GPU cores because it can handle both general purpose and graphics programs with equal ease and performance.

iMesh to the Rescue

The problem with multiple caches is that you want every input buffer to have more or less an equal number of instructions (see the previous post for more on instruction buffers). If the buffers are completely independent, good luck on keeping them balanced. Fortunately for the TILE64, it comes with a fast on-chip network that connects all the cores together on an 8x8 grid. If we use the previous metaphor in which an instruction cache is seen as a water trough for farm animals, then we can use pipes linking the bottom of the troughs to represent the iMesh network.


Just as gravity and fluid dynamics would keep the water surface at the same height in every trough, our load-balancing mechanism should try to keep the number of instructions in the caches about equal. I say ‘about’ because, unlike water with its huge numbers of super fine-grain molecules, processor instructions are comparatively coarse-grained.

Fast and Tricky

There are probably several solutions to the load-balancing problem. One that comes to mind calls for a central load balancer that is independent from the instruction fetch mechanism. The balancer would use the iMesh to keep track of how many instructions are in every cache and move them from one cache to another if necessary. The method for doing this might be a little tricky because we don’t want any instruction to stray too far from its core of origin, whose cache is where its data operands most likely reside.

Another solution is to place a mini-balancer in every core. It would have access only to the caches of its nearest neighbors and would use that information to determine whether or not to exchange instructions. This too could be tricky because there is the danger of entering into never-ending back and forth oscillations, which are undesirable, to say the least. Whichever method one decides to use, it must be fast because it must perform its work during instruction prefetch and must be done by the time the cores are ready to process the instructions.

Conclusion

The computer industry is delaying the inevitable: solving the parallel programming problem will require a radical paradigm shift in computing. The big players are all eying each other’s moves and are not willing to be the first to make the decisive move away from the inadequate computing models of the last century. It seems that the first steps that will trigger the revolution will have to come from an outsider, that is to say, a startup company. In my considered opinion, Tilera is rather well positioned, strategically, to make a winning move and take the computer world by storm. The writing is on the wall.

In a future article, I would like to describe some of the architectural options available to the designer of a pure MIMD vector processor.

Next: Heralding the Impending Death of the CPU

Related Articles:
How to Solve the Parallel Programming Crisis
Tilera’s TILE64: The Good, the Bad and the Possible, Part I, II, III

Friday, August 22, 2008

Transforming the TILE64 into a Kick-Ass Parallel Machine, Part III

Part I, II, III , IV

Abstract

Previously in Part I and II of this series, I suggested that Tilera Corporation could blow the competition out of the water by adopting the COSA software model and dumping the processor core it is currently using for its TILE64™ multicore processor in favor of a pure MIMD vector core. The latter is somewhat similar to a superscalar processor except that every instruction is a vector and there is no need to test for instruction dependencies (the EPIC architecture of Intel's Itanium is probably a better comparison). The reason is that instructions in the core's input buffer are guaranteed to be independent in the COSA model. The new core should be capable of executing 16 or more different instructions simultaneously. This sort of hyper parallelism would transform the TILE64 into the fastest processor on the planet. Today, I will talk about how to optimize parallel instruction caching for improved performance. But there is more to computing than performance. The COSA heartbeat is a global virtual clock that synchronizes all operations. Synchronization enforces deterministic processing, a must for rock-solid applications and security. Please read the previous posts in the series and How to Solve the Parallel Programming Crisis before continuing.

Instruction Caching

The L1 instruction cache of every core would normally be divided into two parallel buffers A and B as seen below. L2 and data caches are not shown for clarity.
While the instructions (cells in COSA) in buffer A are being processed (hopefully, concurrently), buffer B is filled with the instructions that will be processed during the next cycle. As soon as all the instructions in buffer A are done, the buffers are swapped and the cycle begins anew. Of course, there are ways to optimize this process. Instead of just two buffers, the cache could conceivably be divided into three, four or more buffers and processed in a round robin fashion. An instruction prefetch mechanism can be used to fill the buffers ahead of time while the core is executing the current instructions. Even sensor-dependent branches (decision branches) can be fetched ahead of time. Branches that are not taken are simply discarded at the time of sensing (comparison). More detailed information on COSA sensors (comparison operators) can be found on the COSA System page.

The COSA Heartbeat

The COSA software model is based on the idea that everything that happens in a computer should be synchronized to a global virtual clock and that every elementary operation lasts exactly one virtual cycle. This is extremely important because it is the only way to enforce deterministic processing, a must for reliability and security. It would make the TILE64 ideal for mission and safety-critical applications, especially in embedded systems. None of the current or projected multicore processors on the market support deterministic processing.

Essentially, the heartbeat is a signal that tells a core to advance to the next parallel instruction buffer. This signal must be made available to all the running programs because it is used to update a global counter that is accessible by all programs. The counter is used by sequence detectors and other cells to calculate intervals.

In a two-buffer, single-core processor, the switch happens as soon as all the instructions in the current buffer are executed. In a multicore system, we need a mechanism that sends a heartbeat signal to all the cores on the chip so they can advance in unison. This mechanism can be as simple as an AND gate that fires when all the current instruction buffers are done. It goes without saying that no core should have to wait long for a heartbeat after finishing its current buffer. This is why precise load balancing is important.

In Part IV, I will go over automatic load balancing and data cache optimization.

Related Articles:
How to Solve the Parallel Programming Crisis
Tilera’s TILE64: The Good, the Bad and the Possible, Part I, II, III

Wednesday, August 20, 2008

Transforming the TILE64 into a Kick-Ass Parallel Machine, Part II

Part I, II, III, IV

Abstract

In the previous post I proposed that Tilera Corporation should abandon the sequential architecture for the cores used in its TILE64™ processor and adopt a pure MIMD vector architecture in which every instruction is a vector. The reason is that, in the COSA software model, instructions are not presented to the processor in a sequence but as an array of independent parallel elements. When you stop to think about it, what are sequential cores doing in a parallel processor in the first place? It does not make sense. In this post, I will explain how to optimize a pure MIMD vector processor for the best possible performance and energy saving.

Superscalar Architecture with a Twist

One CPU architecture that comes somewhat closer to what I mean by a pure MIMD vector processor is the superscalar architecture pioneered by the legendary Seymour Cray exept that, in the COSA model, there is no need to check for dependencies between instructions because they are guaranteed to have none. In this context, note that Intel and Hewlett Packard have experimented with explicitly parallel instruction computing, a technique that uses a parallelizing compiler to pre-package instructions for parallel execution. This was the basis for the Itanium processor. I think that was a pretty cool idea because it is not only faster, it also simplifies the processor architecture. But why use an imperfect parallelizing compiler to uncover parallelism in sequential programs when COSA programs are inherently parallel to start with? Just a thought.

At any rate, the way I understand it (my readers will correct me if I’m wrong), parallel execution in a superscalar processor is achieved by placing multiple ALUs and/or FPUs on the same die. I feel that this is an unnecessary waste of both energy and silicon real estate because only one functional unit (adder, multiplier, etc.) within any ALU or FPU can be active at any given time. This is true even if instruction pipelining is used.

Vector Optimization

What I am proposing is this; instead of using multiple ALUs and FPUs, it would be best to have a single ALU or FPU but use multiple independent functional units for every type of operation. In order words, you can have two 32-bit integer adders and two 32-bit integer multipliers within the same ALU. However, just doubling or tripling the functional units within an ALU would not serve any purpose that adding more ALUs could not serve. The reason for using multiple functional units is that some instructions are used more often than others. It would be best if the proportion of parallel functional units for a given instruction reflected the usage statistics for that instruction. For example, we might want to have more adders and multipliers than left or right shifters. The advantage of this approach is that the percentage of functional units that are active simultaneously will be higher on average. The effect is to increase overall performance while using much less energy and silicon than would be neccessary otherwise. Based on this approach, I envision a processor core that can easily handle 16 or more independent instructions concurrently. In fact, the entire instruction set could, in theory, be processed concurrently since every instruction is an indenpendent vector. (I could patent some of this stuff, I know.)

In Part III, I will write about instruction caching and the COSA heartbeat.

Monday, August 18, 2008

Transforming the TILE64 into a Kick-Ass Parallel Machine, Part I

Part I, II, III, IV

Reading Material

Please read the following articles before continuing:

How to Solve the Parallel Programming Crisis
Tilera’s TILE64: The Good, the Bad and the Possible, Part I, II, III

Abstract

Tilera Corporation’s TILE64™ processor technology can serve as a springboard for a new type of multicore processor that can jettison Tilera to the forefront of the computer universe. Should it decide to act on this advice, the entire computer industry, the world over, would come to worship at its feet. It must act quickly, however, because many others have shown a marked interest in Project COSA and the parallel programming articles on this blog. In this multi-part article, I will describe several modifications that Tilera can make to the TILE64 that will turn it into a kick-ass multicore processor second to none.

Radical Core Transformation

Assuming that Tilera wants to be the industry leader and not just another me-too multicore vendor, the first thing it must do is to change the design of its processor core from a sequential or scalar core into a pure vector core. And by pure vector core, I don’t mean a GPU. I mean a pure MIMD vector processor (see section on vector processing below) in which every single instruction is a vector that can perform its operation in parallel! That’s right. You are not going to find this kind of processor core lying around anywhere. ARM doesn’t have one and neither do MIPS, Sun Microsystems, IBM, Nvidia, Intel or anybody else.

This is a major change and it is highly doubtful that Tilera could just modify the base MIPS core that it is currently using. The best thing to do is to redesign the core from scratch. Although this transformation is not absolutely necessary in order to support the COSA programming model, the performance increase would be so tremendous that it would be foolish not to do it. In my estimation, even a single-core, pure MIMD vector processor would see, on average, at least an order of magnitude increase in performance over a conventional scalar or sequential processor. Even a superscalar architecture would look like a snail in comparison. Imagine that! a one-core, general purpose, fine-grain, deterministic, parallel processor! This is the kind of benefits that can be obtained by adopting the COSA model. My thesis is that this is the way CPUs should have been designed from the beginning. (In a future article, I will examine the pros and cons of having multiple cores with a few vector units vs. having a single core with a huge number of vector units.)

So, if Tilera were to transform every core of its 64-core processor into a pure vector core, Intel’s much-ballyhooed 48-core Larrabee (with its ancient x86 technology and its 16 SIMD vector units per core) would look like a miserable slug in comparison. LOL.

MIMD Vector Processing

Why a pure MIMD vector processor and how is it even possible? The answer is that, in the COSA software model, instructions are not presented to the processor in a sequence but as an array of independent elements to be executed in parallel, if possible. This means that, ideally, a COSA core should be designed as an MIMD (multiple instructions, multiple data) vector processor as opposed to an SISD (single instruction, single data) scalar processor. This way, every instruction is an elementary parallel vector with its own dedicated registers. Normally a vector processor operates in an SIMD (single instruction, multiple data) mode. A graphics processor (GPU) is an excellent example of an SIMD vector processor. The problem with the GPU, however, is that its performance takes a major hit when it is presented with multiple parallel instructions because it is forced to process them sequentially, which defeats the purpose of parallelism. Not so with an MIMD vector processor. Pure parallel bliss is what it is.

In part II of this article, I will talk about vector optimization.

Saturday, August 16, 2008

Tilera’s TILE64: The Good, the Bad and the Possible, Part III

Part I, II, III

Abstract

In the previous two posts in this series I went over the good and the bad of Tilera Corporation’s TILE64™ multicore processor. I praised the TILE64’s engineering and its iMesh™ technology and I criticized its programming model and development tools. In this post, I will argue that Tilera can leapfrog over the rest of the industry by adopting a radically different parallel programming model and modifying the design of the TILE64 to support the model. Add a comprehensive suite of easy-to-use development tools and the rest of the multicore industry won’t know what hit it until it’s too late.

The Possible

Computer science professor, Kunle Olukotun, said recently, "If I were the computer industry, I would be panicked, because it's not obvious what the solution is going to look like and whether we will get there in time for these new machines" (source: CNN Money). The solution may not be obvious to Professor Olukotun and his team of thread-obsessed [pdf] researchers at Stanford’s Pervasive Parallelism Lab but it does not mean that there is no solution. I have argued that the solution has been staring us in the face for decades but the computer science community is blind to it. It is blind to it because of its infatuation with the Turing Machine, a soon-to-be obsolete computing model that is woefully inadequate to the task. I will not repeat my arguments here for the sake of brevity. Those of you who are interested can click on the links listed below for further information.

How to Solve the Parallel Programming Crisis
Parallel Computing: Why the Future Is Non-Algorithmic
Nightmare on Core Street
Why Parallel Programming Is So Hard
Half a Century of Crappy Computing
Parallel Computing: The End of the Turing Madness
The COSA Saga

Holy Grail

The holy grail of the computer industry is a multicore processor that has the following qualities:
  • Easy to program using graphical tools for rapid application development. Drag and drop software composition. Programming for the masses.
  • Implicit parallelism. Sequences must be explicitly specified.
  • Automatic, transparent scalability. No need to rewrite the application when upgrading the processor. Just plug it in.
  • Automatic load balancing. The programmer/designer should concentrate on the application and should never have to think about the processor.
  • Fast, fine-grain (instruction level) parallelism using an MIMD processing model.
  • Universality and homogeneity: one processor does it all. No need for a separate graphics processor.
  • Deterministic timing for rock-solid applications that do not fail, regardless of complexity. The golden age of automation.
  • Energy efficiency. Automated energy management.

Note that there is no mention of multithreading anywhere on the list. Any company that has the wisdom and the courage to design and build this processor, together with the required development tools, will have solved the parallel programming crisis and will dominate the computer industry for decades to come. There is only one caveat. This is a radical departure from the existing paradigm. Legacy applications will not run on this processor. But who cares? The old machines can be slowly phased out and will eventually disappear altogether. Read Parallel Computing: Why Legacy Is Not Such a Big Problem for more on the legacy issue.

Tilera Can Do It

I believe that Tilera has a chance to become the next computer powerhouse by modifying its existing TILE64 processor and writing the necessary dev tools on existing machines. In my next article, I will go over some of the specific changes that Tilera can make to its processor in order to turn it into a kick-ass product that will leave the competition in the dust.

Next: Transforming the TILE64 into a Kick-Ass Parallel Machine

Thursday, August 14, 2008

Tilera’s TILE64: The Good, the Bad and the Possible, Part II

Part I, II, III

Abstract

In the previous installment of this three-part article I wrote about what I think is good about Tilera Corporation’s TILE64™ multicore processor. In this post, I will show where I think Tilera has gone wrong.

The Bad

The bad thing about the TILE64 multicore processor is that it is a pain in the ass to program. This is not something that is unique to the TILE64, however. All multicore processors currently on the market have the same problem. Right now, Tilera offers nothing new or revolutionary in the way of development tools that will make anybody stand up and take notice. C and C++ and multithreading are not going to cut it, I am sorry. I have said it before (and the industry is painfully aware of it by now), writing multithreaded parallel programs in C or C++ is like pulling teeth with a crowbar. Advertising that the TILE64 has a Linux SMP runtime environment is lame to the core. Linux is a freaking dinosaur, in my opinion, an ancient technology that somehow escaped from a computer museum of the last century. LOL. But then again, I think that all operating systems are dinosaurs.

I remember that when Tilera first announced the TILE64 about a year or so ago, the Erlang fanatics were jumping up and down, drooling at the prospect of using Tilera’s spanking new iMesh™ technology to send fat messages back and forth between their so-called lightweight processes. I remember thinking, my God, what a waste of technology. I am sure some of them contacted professor Anant Agarwal, CTO and co-founder of Tilera, and tried to convince him of the superiority of functional languages for parallel computing. I don’t know what happened but I am glad to see that Tilera does not mention Erlang or functional programming on its site. Erlang is a waste of time in my opinion. Its message-passing, copy-everything, anti-state approach to computing is not just counter-intuitive; it is a disaster as far as performance is concerned. The same goes for all functional programming models. As my readers already know, I don’t mince words when it comes to telling it like I see it.

In conclusion, I will say that Tilera, regardless of its engineering prowess, has committed the same unforgivable sin as the rest of the processor industry. Its leaders decided to design a processor before they could perfect a sensible programming model. It should be the other way around. Unless they take steps to correct that mistake, they will fail. Their only consolation is that everybody else is panicking and riding in the same sinking boat. Not much of a consolation if there are no lifeboats around, that’s for sure. I gave Agarwal a hard time about this before but my motivation for writing this series of posts is that I don’t think that all is necessarily lost.

I will add that my main interest in parallel computing comes from my research in AI. I am convinced that artificial brains will be the most prevalent computer applications in the not too distant future. These smart programs will require massive parallelism in the form of super fast and efficient multicore processors. It does not really matter to me which organization or company unveils the correct solution to the parallel programming crisis and/or uses it to bring out a kick-ass product to market that blows everybody else out of the water. I just think that Tilera has a better chance than say, Intel, Sun, IBM or AMD. Those guys are way too married to last century’s technologies to notice the writings on the wall.

Next: Part III, The Possible

Related articles:

How to Solve the Parallel Programming Crisis
Transforming the Tile64 into a Kick-Ass Parallel Machine

Wednesday, August 13, 2008

Tilera’s TILE64: The Good, the Bad and the Possible, Part I

Part I, II, III

Abstract

This is a three-part article in which I will examine what I think is good and bad about the TILE64 multicore processor and what I think Tilera can do in order to blow everybody out of the water. And by everybody, I mean Intel, AMD, Nvidia, MIPS, Sony, IBM, Sun Microsystems, Freescale Semiconductor, Texas Instruments, you name it. I mean everybody.

The Good

Tilera Corporation’s TILE64™ multicore processor is a marvel of engineering. It sports 64 general purpose, full-featured processor cores organized in an 8x8 grid linked together by Tilera’s super fast iMesh™ on-chip network. Each core has its own L1 and L2 caches with separate L1 partitions for instruction and data. The TILE64 cores can be individually switched into sleep mode to save energy when not in use.

One of the biggest problems in multicore processor design has to do with random access memory. Memory access has always been a problem because the processor needs data faster than the memory subsystem can deliver. The problem is even worse when there are multiple cores sharing memory because you get into bus contention problems. It is very likely that some future breakthrough in quantum tunneling or optical memory will eliminate the bottleneck but, in the meantime, the best that processor designers can do is to keep frequently accessed data in fast on-chip caches. This is all fine and dandy but a new problem arises when two or more caches contain overlapping areas of memory; if you modify the data in one, you must do so in the others. Maintaining cache coherence can quickly turn into a performance killing mess.

Tilera came up with an elegant way around this problem by arranging the cores in a grid and connecting them with a high-speed network or mesh. Tilera’s iMesh™ network makes it possible for a core to access the cache of an adjacent core or even that of a distant core, if necessary. This way, there is no problem of cache coherence. Apparently, the way it works is this; if a core needs a piece of data and the data is already in its own cache, then everything is hunky dory and there is no need to look elsewhere. If the data is not in the core’s local cache, the core uses the mesh to find it elsewhere. Obviously, in order to minimize latency as much as possible, it pays to optimize the system in such a way that cached data is as close as possible to the cores that are using it. I suspect that Tilera's approach is not a problem with scalability; that is to say, the performance hit is not exponential as you increase the number of cores. We can expect Tilera to come out with processors boasting hundreds of cores in the foreseeable future. In sum, Tilera’s TILE64™ is a beautiful thing.

Next: Part II, The Bad

Sunday, August 10, 2008

Parallel Computing: Timing, Reliability, Safety and Security

A Computer Is Not What It Does

The COSA computing model is closely related to my ongoing research in artificial intelligence and my rejection of the GOFAI symbol manipulation approach to AI. I understood early on that it is symbol manipulation that requires intelligence and not the other way around. In other words, symbol manipulation is something that intelligence does, not what it is. Likewise, I realized that function calculation is what a computer does and not what it is. Calculating algorithmic functions is just one of the many possible behaviors that a computer can exhibit but it is not what computing is based on. This is analogous to the way human intelligence operates. We can do all sorts of things such as speaking, running, walking, and grasping but our intelligence is not based on any of those things. Some other principle is necessary.

Nature Is Parallel and Reactive

It also occurred to me that serial processing is not an inherent property of real-world objects but an emergent phenomenon that results from the way objects react to one another. A predecessor/successor sequence should thus be seen, not as an ordered list of operations to be processed by a central processor (the Turing model), but as a causal or reactive mechanism analogous to stimulus/response coupling in psychology or to sensor/effector pairing in neuroscience. I reasoned that computers and intelligent systems are types of behaving machines and they can exhibit an indefinite number of behaviors. For more on this subject, read The Hidden Nature of Computing.

Timing Is of the Essence

It did not take me long to figure out that precise timing is essential to the reliable operation of a parallel system. It turns out that deterministic timing can be easily implemented in software by ensuring that all elementary reactions have the same duration, based on a virtual system-wide clock. This is the reason that, as I explained elsewhere on this blog, I initially promoted COSA as a solution to the software reliability crisis. I still do, although I have recently changed my approach by emphasizing the inherent parallel nature of COSA programs.

Task Scheduling vs. Security

I am so convinced of the crucial importance of deterministic timing that I insist that every program in a COSA system should be synchronized to a single virtual clock. I refuse to consider any suggestion that applications or component could have their own local clocks ticking at different frequencies so as to manage scheduling. I believe it would be disastrous to system-wide reliability and security. Which brings me to a message that Tom Lieber emailed to me recently:

Consider a program which is (maliciously?) designed with a large number of components continuously slinging signals around so that the number of signals to be processed for its execution is gigantic compared to the number of signals in the other components. In a process-based model, every process can still be given its fair share of the processor time via scheduling, but in COSA it seems that there is no recourse, if the only mechanism for scheduling priority is the message queue.
In my opinion, this is precisely why traditional task (or process) scheduling is bad. It allows malicious programs to infect a system by giving them a portion of the processor’s time. Traditional multitasking operating systems provide no easy way to detect intruders because the overall processor load (number of requested instructions for a given concurrent cycle) is unknown. By contrast, a denial of service attack, such as the one described by Tom, would never work in a COSA system because the system keeps track of the load being processed during every cycle. As soon as the load grows above a pre-configured safe limit, the system can either alert an administrator or automatically shut down the offending application. In fact, every COSA application can monitor its own performance (measured in number of application cycles per second) and trigger an alarm if it drops below a preset unsafe level.

Event Completion in Mission-Critical Applications

In many industries (e.g., aviation, power generation, transportation, health care, defense, etc.), software failure is not an option. COSA will go a long way toward eliminating the sort of accidental failures that are the bane of traditional systems by automatically eliminating blind code or hidden data dependencies. There is another way in which COSA can increase the robustness of a system even in the event of a hardware failure. Safety-critical systems, such as avionics, use a plurality of sensors to guide automated decision-making. Unlike software, hardware sensors can suffer from material fatigue. Every so often, they fail. Failure can also occur due to unforeseen circumstances in the environment. For example, a light sensor may become temporarily affected by smoke, dirt or some other obstruction.

During the testing phase of a COSA program, the system can automatically discover temporal correlations. At the discretion of the programmer, temporal watchdogs can be inserted in the program that can either trigger an alarm if an event (discrete signal) fails to occur at its expected time and/or automatically generate a replacement event in order to prevent a total breakdown. In artificial intelligence, this is called pattern completion. Human and animals could not survive without it. This anticipatory capability is what allows us to follow a conversation in a noisy room or recognize a partially occluded object. I believe it can go a long way toward increasing the robustness of complex systems even in the presence of hardware failures and sensory uncertainty.

Conclusion

The nice thing about the COSA model is that it addresses not just the parallel programming problem but the software reliability, safety and security issues as well. These are the most important issues facing the computer industry at this time. The captains of the industry must either embrace the COSA model or suffer increased financial and societal hardships. There are no two ways about it, in my opinion. COSA is the solution and it is not rocket science. More to come.

Related Article:

How to Solve the Parallel Programming Crisis

Wednesday, August 6, 2008

Parallel Programming: Intel, AMD, Nvidia, US and the World

Larrabee

There is no getting around the fact that, unless powerful domain-specific tools are built on top of GPGPU-type processors like Larrabee, programmers will find it extremely hard to write software for them. The reason is that they would need to gain intimate knowledge of the hardware.

No Solution In Sight

The problem with heterogeneous processors/GPGPUs is that they incorporate two different and incompatible approaches to parallel programming (fine-grain data-parallel SIMD and coarse-grain multithreaded MIMD). Each has its advantages and disadvantages but having to deal with both makes software development a worse pain in the butt than it already is. It does not seem that either Intel, AMD, Nvidia or any of the other multicore vendors are able or willing to go beyond this intolerable incongruity by offering a universal solution that is easy to program and is equally at home in all sorts of application environments.

Don't be Too Complacent

There are many countries and organizations in the world that would not hesitate to cease a clear opportunity to dominate the computer industry in this century. Very big money is in the balance and the industry is at a crossroad. The parallel programming crisis is quickly getting to the point where something radical will have to rise to the surface in order to solve the problem. The market wants nothing less than a solution and it wants it yesterday. Intel and the others should not be so complacent or arrogant as to think that the heads of their engineering and research teams are the brightest in the world or that their vision of the future of parallel computing is perfect. There is nothing visionary or revolutionary about the design of Larrabee, regardless of the fine engineering that went into it. The parallel programming problem is still with us and the whole world is still searching for a solution. Intel may be a giant in the processor business but so was Xerox in the copier industry. Big is not synonymous with invincible. The industry seems precariously on the verge of undergoing a seismic event that will undoubtedly send the unprepared crashing into the dust. Wisdom and caution are advised.

The Answer Is Staring at Us

The computer industry is still thinking with last century's brain. They must abandon the old stuff and embrace the new stuff, which, when you think about it, is not all that new. It's just that it has never been seen as a parallel programming model by the industry before. At least, not seriously. There is a way to design and program parallel computers that is 100% threadless while offering the flexibility of a CPU and the fast fine-grain deterministic parallelism of the GPU. It is based on a method that programmers have been using for decades to emulate parallelism in such applications as neural networks, cellular automata, simulations and video games. To find out more, read How to Solve the Parallel Programming Crisis. Dismiss it at your own peril.

Conclusion

To give my readers an idea of the wide fascination that the parallel programming problem holds on the world at large, here is an incomplete list of countries that regularly visit my blog, not necessarily in order of frequency, although the US is the clear leader:

United States, China, India, Singapore, Vietnam, Sweden, France, Italy, Finland, Iceland, Israel, Switzerland, Egypt, Iran, Turkey, Canada, Germany, Denmark, Belgium, Spain, Portugal, Greece, Norway, Ireland, United Kingdom, Netherlands, South Korea, Japan, Thailand, Malaysia, Brazil, Peru, Chile, Mexico, Colombia, Bolivia, Romania, Hungary, Poland, Russia, Latvia, Estonia, Serbia, Ukraine, Australia, New Zealand, Argentina, Philippines, Indonesia, Paraguay, Venezuela, Czech Republic, Mongolia, Taiwan, Hong Kong, Nigeria, Kenya, South Africa, Pakistan, Austria, etc.

All right, you get the picture. Theoretically, any one of those countries has a chance to dominate the computer industry in this century. And guess what? Given what I know about intellectual property laws, it would take only a miniscule investment on the part of a country or private organization to lock in a lion share of this business for decades to come. There is big money at stake. And I mean, BIG MONEY.

Related Article:

How To Solve the Parallel Programming Crisis

COSA Is Not Dataflow

I wrote this about two years ago but it bears repeating.

Dataflow Languages Are Not New

I frequently receive emails from people who insist that COSA is nothing but a dataflow programming model and that I am just reinventing the wheel. Dataflow is not new. I have been attracted to visual dataflow languages (e.g., LabView, Prograph, etc...) for many years. I remember thinking that the graphical representation, the powerful reuse mechanism and the flow-through style combined beautifully to create a much more intuitive and easy way to develop computer applications. People have been using dataflow languages for over twenty five years. The problem is that, if the dataflow model was the panacea that would cure all that ails the software industry, it somehow failed to gain widespread acceptance. I mean, shouldn't we all be using dataflow for everything by now? The reason is obvious; something is missing. Dataflow languages tend to be rather high-level and are not as flexible as some of the more conventional languages. Adding a new, non-composite object into a dataflow language is not easy for the average user and usually requires the use of a compiled language such as C or Java.

COSA Is Not Dataflow

Let me come right out and say it: COSA is not a dataflow programming model. COSA is a signal-based, synchronous, reactive software model in the tradition of reactive languages like Esterel, Lustre, Signal, etc... There are certainly similarities between COSA components and dataflow objects in that COSA components can be made to pass queued messages (data) to one another in a unidirectional (input/output) and asynchronous manner but that's about it. COSA is much more flexible than this. Here are a few differences:
  • Above all, COSA is synchronous at its fundamental level (the operation or instruction level). As far as I know, this is not true of dataflow languages. Heck, I don't think it's true of other reactive languages either.
  • Whereas dataflow languages are algorithmic internally, COSA's elementary cells are entirely signal-based.
  • COSA solves the data and event dependency problem.
  • COSA uses fine-grained parallelism.
  • COSA is temporally deterministic, that is to say, the execution order (concurrent or sequential) of operations in a COSA program is guaranteed. As a result, COSA programs are extremely reliable.

In conclusion, let me say that COSA is more like VHDL or Verilog, neither of which fall into the category of dataflow languages. A COSA program is comparable to a logic circuit or pulsed neural network. So please, stop telling me that COSA is a dataflow language. It is not.

Monday, August 4, 2008

Larrabee: Intel’s Hideous Heterogeneous Beast

The Battle of the Cores

According to the latest indications from Intel, its new processor, code named Larrabee, will feature between 8 and 48 x86 cores and will be slated for both general purpose and graphics processing (source: Computer World). Compare this to Nvidia’s Tesla 10P with its 240 cores and AMD’s soon to be released Firestream 9250, which will sport a whopping 500 cores. So obviously, when it comes to number crunching, Intel’s offering does not even come close. Worse, it won’t be released until 2009 at the earliest. Intel counters that Larrabee will be compatible with existing programming languages such as C and C++ and thus take advantage of the wide familiarity that programmers already have with these languages. Well, wup dee do! That'll teach them.

Update (8/5): Please read the comments below. A reader wrote to point out that AMD and Nvidia misadvertise the number of cores built into their graphics processors.

Hideous to the Core

Larry Seiler, chief architect in Intel's visual computing group, claims that Larrabee “will combine the full programmability of the CPU with the kinds of parallelism and other special capabilities of graphics processors” (source: SlashGear). This can only mean one thing. A programmer will have the option of using a percentage of the cores for general purpose, coarse-grained, MIMD multithreaded parallelism while dedicating the rest for fine-grained, SIMD vector processing for graphics purposes. How the partitioning will be realized or whether it will be fixed or programmable is anyone’s guess. It remains that Larrabee is hideous to the core, if you pardon the pun. Heterogeneous processors will wreak havoc on productivity due to the increased programming difficulty of having to deal with two incompatible modes of execution. In addition, effective load balancing across all the cores becomes a nightmare to manage, if it is at all possible.

Intel’s Folly and Industry Lemmings

The following is a quote from an EETimes article:



Intel says it has a number of internal teams, projects and software-related efforts underway to speed the transition, but the tera-scale research program has been the single largest investment in Intels technology research and has partnered with more than 400 universities, DARPA and companies such as Microsoft and HP to move the industry in this direction.
This is too horrible to even contemplate. In my considered opinion, Intel is single-handedly driving the computer industry over a cliff and the rest of the industry is cheerfully following along like a bunch of lemmings. Oh, the humanity!

Intel Needs to Go to Rehab

Intel is high on its own dope. Over the past several months, from the time I posted my Nightmare on Core Street series of articles, Intel Research has visited my blog hundreds of time. My message is simple: the industry must get rid of its addiction to multithreading and algorithmic computing and adopt a universal, non-algorithmic parallel programming model, one which can handle general purpose computing and graphics processing with equal ease, or anything else you can throw at it. The heterogeneous or hybrid approach to parallelism is absurd to the extreme. It hurts me just to think about it. Justin Rattner needs to go to computer science rehab and take all his buddies with him, in my opinion. Which reminds me of Amy Winehouse's Rehab song:



I’m sorry but the only thing that is keeping me from tearing my hair out is a little bit of humor. Besides, I like Amy Winehouse, drugs, alcohol and all. :-D

Related articles:

How to Solve the Parallel Programming Crisis
Heralding the Impending Death of the CPU
Parallel Computing: Both CPU and GPU Are Doomed

Update (8/5):

CNET has a nice article on Larrabee that explains certain details about the chip's vector and scalar units. The author is not very impressed (he compares it to a science project that got out of hand) but check it out anyway.