CPU/Mobo Analyzing Efficiency of Shared and Dedicated L2 Cache in Modern Dual Cores

dipdude

Skilled
It's known that Shared L2 Cache makes a significant microarchitectural difference between modern dual-core processors from Intel, based on the improved P6+ microarchitecture and the new Intel Core microarchitecture (Intel Core Duo/Core Solo and Intel Core 2), and competing solutions (AMD Athlon 64 X2 dual-core processors). It's dynamically distributed between individual cores of a processor depending on their needs in cached memory space. And dual-core processors from AMD are notable for individual L2 Cache of a fixed size for each core.

We can assume that the shared architecture of L2 Cache can sometimes be less advantageous than the traditional architecture of L2 Caches dedicated to each core (given the same total sizes – for example, 2 MB shared versus 1+1 MB dedicated) due to the shared data bus and Shared L2 Cache access system. If it's true, the most efficient way to detect this drawback will be providing maximum load on L2 Cache of a processor (from both cores), which is much easier to detect with a special test application than with real tasks, which requirements to L2 Cache of a processor are not known directly.

That’s why in order to check the above assumption and to compare efficiency of shared access of two cores to Dedicated (AMD) and Shared (Intel) L2 Cache, we did just so. We used a recently developed utility RightMark Multi-Threaded Memory Test that is included into the latest official version (3.70) of RightMark Memory Analyzer.

Here is the idea:
we create two threads, each of them being "tied" to its core by default (to avoid turning these threads over from core to core by an operating system). Each of the threads allocates its own memory space of a specified size and can perform the following operations: reading, writing, reading with software prefetch (a user can vary the prefetch distance), and non-temporal store. The total data size to be read and written is specified by a user separately for each thread. The program can start and stop each thread any time, as well as start and stop both threads at once simultaneously. Results of the test are output on the fly – instant (averaged by a second) and average (averaged by the entire test duration) bandwidth (MB/s). Depending on a selected data size, the application allows to analyze shared (or dedicated, in case of a single thread) accesses to CPU cache as well as to system memory. Evidently, it makes sense to use the first two access modes (reading and writing) to analyze L2 Cache of a processor, as the last two options (software prefetch and non-temporal store) may be useful for analyzing memory characteristics.

Conclusion

Our analysis revealed complete independence of isolated L2 Caches in AMD Athlon 64 X2 cores. This processor shows no reductions in L2 Cache bandwidth in case of shared data access by both cores.

In case of simultaneous access to sufficient L2 Cache by both cores, Shared L2 Cache in Intel Core 2 Extreme processors reduces its per-core bandwidth to 57-83% of the initial value, depending on the access type (the highest reductions are demonstrated for writing, the lowest ones – for reading data). Though such a reduction may seem significant, absolute values of L2 Cache bandwidth in a given processor in these conditions remain on a high level of 10-19 GB/s. That is, running two real single-thread applications simultaneously (which data fit into L2 Cache of a processor) may result in some performance drop, but only if these applications are very critical to L2 Cache bandwidth (like our synthetic test).

The situation is much worse, when processor cores have to compete for Shared L2 Cache, that is when the total size of data processed by both threads (or two single-thread applications) exceeds the size of Shared L2 Cache. Data exchange rate of the core depends much on the data volume, accessed by this core. When this volume is relatively small and does not exceed 1/4 of the total L2 Cache size (1.0 – 1.25 MB for our experiment), efficiency of data exchange rate of the core remains quite high and is comparable to L2 Cache bandwidth for single-thread access. Such an application just "doesn't see" other applications that potentially compete for L2 Cache. Data exchange rate drops (to the level of system memory bandwidth and lower), when cache requirements of a thread/application grow, that is with the increase in the volume of processed data. In our conditions, this situation is demonstrated with a per-thread data block of 1.5 MB and higher. The following situation is quite possible here: an unaware application that uses only half of Shared L2 Cache (2 MB) may lose much of its data exchange rate just because of another application (even if it's not critical to memory bandwidth), which operates with a larger data block (3 MB.) This application-aggressor will not only be executed inefficiently by Intel Core 2 processors with Shared L2 Cache (as its data do not fit into its part of L2 Cache), but it will also significantly reduce the efficiency of the first application, even though L2 Cache size seems more than sufficient for it.

Thus, in our opinion, the system of distributing Shared L2 Cache in Intel Core 2 Duo / Core 2 Extreme processors is not very efficient in "the conflict zone", where requirements of each core to L2 Cache size are more or less identical. In fact, this inefficiency consists in a relatively wide "conflict zone", which undeservedly covers about 2 MB, that is half of L2 Cache. Let's hope that the next implementations of Shared L2 Cache in Intel processors will demonstrate better efficiency of L2 Cache distribution between processor cores depending on their requirements.

For the detailed article visit: Analyzing Efficiency of Shared and Dedicated L2 Cache in Modern Dual-Core Processors
 
Yes, and there are a bunch of new technologies under research where a shared L2 cache would be an advantage. I am working on an offshoot of one.
 
Back
Top