n-Way Set Associative Cache
On this configuration the memory cache is divided in several blocks (sets) containing “n” lines each.
So on a 4-way set associative cache the memory cache will have 2,048 blocks containing four lines each (8,192 lines / 4), on a 2-way set associative cache the memory cache will have 4,096 blocks containing 2 lines each and on a 16-way set associative cache the memory cache will have 512 blocks containing 16 lines each. Here we are continuing our example of a 512 KB L2 memory cache divided into 8,192 64-byte lines. Depending on the CPU the number of blocks will be different, of course.
Then the main RAM memory is divided in the same number of blocks available in the memory cache. Keeping the 512 KB 4-way set associative example, the main RAM would be divided into 2,048 blocks, the same number of blocks available inside the memory cache. Each memory block is linked to a set of lines inside the cache, just like in the direct mapped cache. With 1 GB RAM, the memory would be divided into 2,048 blocks with 512 KB each, see Figure 8.
As you see the mapping is very similar to what happens with the direct mapped cache, the difference is that for each memory block there is now more than one line available on the memory cache. Each line can hold the contents from any address inside the mapped block. On a 4-way set associative cache each set on the memory cache can hold up to four lines from the same memory block.
With this approach the problems presented by the direct mapped cache are gone (both the collision problem and the loop problem we describe in the previous page). At the same time, the set associative cache is easier to implement than the full associative cache, since its control logic is simpler. Because of that this is nowadays the most common cache configuration, even though it provides a lower performance compared to the full associative one.
Of course we still have a limited number of available slots inside each memory cache set for each memory block – four on a 4-way configuration. After these four slots are taken, the cache controller will have to free one of them to store the next instruction loaded from the same memory block.
When we increase the number of ways a set associative memory cache has – for example, from 4-way to 8-way configuration –, we have more slots available on each set, but if we keep the same amount of cache memory the size of each memory block is also increased. Continuing our example, moving from 4-way to 8-way would make our 1 GB RAM memory to be divided into 1,024 1 MB blocks. So this move would increase the number of available slots on each set, but now each set would be in charge of a bigger memory block.
There is a lot of academic discussion regarding what is the perfect balance between the number of sets and the memory block size and there is no definitive answer – Intel and AMD use different configurations, as you will see in the next page.
So what happens if we have a bigger memory cache? Keeping the above example, if we increased the L2 memory cache from 512 KB to 1 MB (the only way to do that would be by replacing the CPU), what would happen is that we would have 16,384 64-byte lines in our memory cache, what would give us 4,096 sets with four lines each. Our 1 GB RAM memory would be divided into 4,096 256 KB blocks. So basically what happens is that the size of each memory block is lowered, increasing the chance of the requested data to be inside the memory cache – in other words, increasing the cache size lowers the cache miss rate.
However, increasing the memory cache isn’t something that guarantees increase in performance. Increasing the size of the memory cache assures that more data will be cached, but the question is whether the CPU is using this extra data or not. For example, suppose a single-core CPU with 4 MB L2 cache. If the CPU is using heavily 1 MB but not so heavily the other 3 MB (i.e., the most accessed instructions are taking up 1 MB and on the other 3 MB the CPU cached instructions are not being called so much), chance is that this CPU will have a similar performance of an identical CPU but with 2 MB or even 1 MB L2 cache.