Wednesday, August 13, 2008

Tilera’s TILE64: The Good, the Bad and the Possible, Part I

Part I, II, III

Abstract

This is a three-part article in which I will examine what I think is good and bad about the TILE64 multicore processor and what I think Tilera can do in order to blow everybody out of the water. And by everybody, I mean Intel, AMD, Nvidia, MIPS, Sony, IBM, Sun Microsystems, Freescale Semiconductor, Texas Instruments, you name it. I mean everybody.

The Good

Tilera Corporation’s TILE64™ multicore processor is a marvel of engineering. It sports 64 general purpose, full-featured processor cores organized in an 8x8 grid linked together by Tilera’s super fast iMesh™ on-chip network. Each core has its own L1 and L2 caches with separate L1 partitions for instruction and data. The TILE64 cores can be individually switched into sleep mode to save energy when not in use.

One of the biggest problems in multicore processor design has to do with random access memory. Memory access has always been a problem because the processor needs data faster than the memory subsystem can deliver. The problem is even worse when there are multiple cores sharing memory because you get into bus contention problems. It is very likely that some future breakthrough in quantum tunneling or optical memory will eliminate the bottleneck but, in the meantime, the best that processor designers can do is to keep frequently accessed data in fast on-chip caches. This is all fine and dandy but a new problem arises when two or more caches contain overlapping areas of memory; if you modify the data in one, you must do so in the others. Maintaining cache coherence can quickly turn into a performance killing mess.

Tilera came up with an elegant way around this problem by arranging the cores in a grid and connecting them with a high-speed network or mesh. Tilera’s iMesh™ network makes it possible for a core to access the cache of an adjacent core or even that of a distant core, if necessary. This way, there is no problem of cache coherence. Apparently, the way it works is this; if a core needs a piece of data and the data is already in its own cache, then everything is hunky dory and there is no need to look elsewhere. If the data is not in the core’s local cache, the core uses the mesh to find it elsewhere. Obviously, in order to minimize latency as much as possible, it pays to optimize the system in such a way that cached data is as close as possible to the cores that are using it. I suspect that Tilera's approach is not a problem with scalability; that is to say, the performance hit is not exponential as you increase the number of cores. We can expect Tilera to come out with processors boasting hundreds of cores in the foreseeable future. In sum, Tilera’s TILE64™ is a beautiful thing.

Next: Part II, The Bad

No comments: