Monday, February 8, 2010

IBM Power 7 and eDRAM Cache

IBM Power 7 and eDRAM Cache

Welcome IBM to the world of 64 Bit Octal-Core Computing!

On February 8th, 2010, Timothy Prickett Morgan wrote about the IBM Power 7 chip launch in The Register, "Sparc T 64-threaded T2 and T2+... quad-core, eight-threaded Tukwilas... the Power7 chip has 32 threads"'

It is nice to see the trail which first generation OpenSPARC T1 had blazed with 32 threads is being followed by IBM Power and Intel Itanium, both applying different technology to compete with Sun's second and second and third generation 64 threaded OpenSPARC processors.

Possible Architecture Trade-offs to eDRAM in Cache

Timothy Prickett Morgan also wrote, "The effect of this eDRAM on the Power7 design, and its performance, is two-fold. First, by adding the L3 cache onto the chip..."

The use of embedded DRAM, to reduce transistors, squeeze more cores, and reduce latency was a great idea, even with the refresh logic added onto the chip!

Every benefit comes with a drawbacks. The discourse on possible trade-offs have been silent, which confuses me from the media.

The use of Static RAM has been traditionally beneficial to the chip manufacturers, since they could get fast and regular access to the memory cells, without having to wait for a slow refresh signal to propagate across the RAM. It is interesting that no one (and I mean NO ONE) is talking about the impact of performance for the CPU cores needing to wait for refresh on the eDRAM.

I wonder what the ratio of performance hit to reduction in latency was in moving to eDRAM?

Multi-Ported Static RAM allows for fast (simultaneous) access from multiple cores into cache. With multi-process heavy workloads, where data in the cache may not be simultaneously accessed from different cores or hardware strands, eDRAM may be a good fit. With software multi-threaded heavy workloads, where the data in the cache will be accessed simultaneously by multiple cores and hardware strands, eDRAM may suffer in comparison to multi-ported SDRAM due to excessive inefficient re-loads from main memory and inefficient sharing.

I wonder what the ratio of benefit to performance hit in throughput for moving to eDRAM was in comparison under various real-world workloads where multi-threaded applications need to share the instructions & data in the cache?

I wonder if the performance of eDRAM will be as linear as SDRAM, as the processors get loaded up? (This reminds me of the Intel 50MHz 80486 vs Intel 66Mhz (33MHz bus) 80486 tradeoff from years past...)

Connection to Network Management

Network Management traditionally deals with extremely highly threaded workloads. Managing tens of thousands of devices with hundreds of thousands of managed resources often requires thousands of threads in a single process with very regular (1-5 minute) polling intervals required tremendous throughput.

The use of Power 7 in these types of managed device facing highly threaded workloads is yet to be measured - it may be one of the most fabulous chips on the market, or it may be mediocre, for the network management space. Power is not a substantial player in the Network Management world, so I would not really expect engineers to tune the CPU for this type of workload.

I would expect that engineers tuned Power for the Database market. Network Management does require long term storage requirements of data, so this may be a very good back-end platform.


The move to eDRAM is very interesting by IBM, almost as interesting as OpenSPARC moving to highly threaded octal cores many years ago.

Will other vendors emulate IBM in the move to eDRAM cache, the same way IBM, Intel, and AMD are moving to 64 bit octal-core as OpenSPARC did years ago?

U P D A T E ! ! !

Another article has come out to discuss the use of eDRAM by IBM.

First in the chain is the 32KB L1 data cache, which has seen its latency cut in half, from four cycles in the POWER6 to two cycles in POWER7. Then there's the 256KB L2, the latency of which has dropped from 26 cycles in POWER6 to eight cycles in POWER7—that's quite a reduction, and will help greatly to mitigate the impact of the shared L3's increased latency.

The POWER7's L3 is its most unique feature, and, at 32MB, it's positively gigantic. IBM was able to cram such a large L3 onto the chip by making it out of embedded DRAM (eDRAM) instead of the usual SRAM. This decision cost the cache a few cycles of latency

1 comment:

  1. Preliminary Performance Analysis...

    It appears the POWER7 does not scale well, from IBM's published SPEC benchmarks.