The cache memory is high-speed memory available inside the CPU in order to speed up access to data and instructions stored in RAM memory. In this tutorial we will explain how this circuit works in an easy to follow language.
A computer is completely useless if you don’t tell the processor (i.e., the CPU) what to do. This is done through a program, which is a list of instructions telling the CPU what to do.
The CPU fetches programs from the RAM memory. The problem with the RAM memory is that when it’s power is cut, it’s contents are lost – this classifies the RAM memory as a “volatile” medium. Thus programs and data must be stored on non-volatile media (i.e., where the contents aren’t lost after your turn your PC off) if you want to have them back after you turn off your PC, like hard disk drives and optical media like CDs and DVDs.
When you double click an icon on Windows to run a program, the program, which is usually stored on the computer’s hard disk drive, is loaded into the RAM memory, and then from the RAM memory the CPU loads the program through a circuit called memory controller, which is located inside the chipset (north bridge chip) on Intel processors or inside the CPU on AMD processors. In Figure 1 we summarize this (for AMD processors please ignore the chipset drawn).
The CPU can’t fetch data directly from hard disk drives because they are too slow for it, even if you consider the fastest hard disk drive available. Just to give you some idea of what we are talking about, a SATA-300 hard disk drive – the fastest kind of hard disk drive available today for the regular user – has a maximum theoretical transfer rate of 300 MB/s. A CPU running internally at 2 GHz with 64-bit internal datapaths* will transfer data internally at 16 GB/s – over 50 times faster.
* Translation: the paths between the CPU internal circuits. This is rough math just to give you an idea, because CPUs have several different datapaths inside the CPU, each one having different lengths. For example, on AMD processors the datapath between the L2 memory cache and the L1 memory cache is 128-bit wide, while on current Intel CPUs this datapath is 256-bit wide. If you got confused don’t worry. This is just to explain that the number we published in the above paragraph isn’t fixed, but the CPU is always a lot faster than hard disk drives.
The difference in speed comes from the fact that hard disk drives are mechanical systems, which are slower than pure electronics systems, as mechanical parts have to move for the data to be retrieved (which is far slower than moving electrons around). RAM memory, on the other hand, is 100% electronic, thus faster than hard disk drives and optimally as fast as the CPU.
And here is the problem. Even the fastest RAM memory isn’t as fast as the CPU. If you take DDR2-800 memories, they transfer data at 6,400 MB/s – 12,800 MB/s if dual channel mode is used. Even though this number is somewhat close to the 16 GB/s from the previous example, as current CPUs are capable of fetching data from the L2 memory cache at 128- or 256-bit rate, we are talking about 32 GB/s or 64 GB/s if the CPU works internally at 2 GHz. Don’t worry about what the heck “L2 memory cache” is right now, we will explain it later. All we want is that you get the idea that the RAM memory is slower than the CPU.
By the way, transfer rates can be calculated using the following formula (on all examples so far “data per clock” is equal to “1”):
Transfer rate = width (number of bits) x clock rate x data per clock / 8
The problem is not only the transfer rate, i.e., the transfer speed, but also latency. Latency (a.k.a. “access time”) is how much time the memory delays in giving back the data that the CPU asked for – this isn’t instantaneous. When the CPU asks for an instruction (or data) that is stored at a given address, the memory delays a certain time to deliver this instruction (or data) back. On current memories, if it is labeled as having a CL (CAS Latency, which is the latency we are talking about) of 5, this means that the memory will deliver the asked data only after five memory clock cycles – meaning that the CPU will have to wait.
Waiting reduces the CPU performance. If the CPU has to wait five memory clock cycles to receive the instruction or data it asked for, its performance will be only 1/5 of the performance it would get if it were using a memory capable of delivering data immediately. In other words, when accessing a DDR2-800 memory with CL5, the performance the CPU gets is the same as a memory working at 160 MHz (800 MHz / 5). In the real world the performance decrease isn’t that much because memories work under a mode called burst mode where from the second data on, data can be delivered immediately, if it is stored on a contiguous address (usually the instructions of a given program are stored in sequential addresses). This is expressed as “x-1-1-1” (e.g., “5-1-1-1” for the memory in our example), meaning that the first data is delivered after five clock cycles but from the second data on data can be delivered in just one clock cycle – if it is stored on a contiguous address, like we said.