Inside the Intel Sandy Bridge Microarchitecture
By Gabriel Torres on December 30, 2010
Sandy Bridge is the name of the new microarchitecture Intel CPUs will be using starting in 2011. It is an evolution of the Nehalem microarchitecture that was first introduced in the Core i7 and also used in the Core i3 and Core i5 processors.
If you don’t follow the CPU market that closely, let’s make a quick recap. After the Pentium 4, which was based on Intel’s 7th generation microarchitecture, called Netburst, Intel decided to go back to their 6th generation microarchitecture (the same one used by Pentium Pro, Pentium II, and Pentium III, dubbed P6), which proved to be more efficient. From the Pentium M CPU (which is a 6th generation Intel CPU), Intel developed the Core architecture, which was used on the Core 2 processor series (Core 2 Duo, Core 2 Quad, etc). Then, Intel got this architecture, tweaked it a little bit more (the main innovation was the addition of an integrated memory controller), and released the Nehalem microarchitecture, which was used on the Core i3, Core i5, and Core i7 processor series. And, from this microarchitecture, Intel developed the Sandy Bridge microarchitecture, which will be used by the new generation of Core i3, Core i5, and Core i7 processors to be released in 2011 and 2012.
For better understanding the present tutorial, we recommend you to read the following tutorials, in this particular order:
The main specifications for the Sandy Bridge microarchitecture are summarized below. We will explain them in more detail in the next pages.
Let’s start our journey talking about what is new the way instructions are processed in the Sandy Bridge microarchitecture.
There are four instruction decoders, meaning that the CPU can decode up to four instructions per clock cycle. These decoders are in charge of decoding IA32 (a.k.a. x86) instructions into RISC-like microinstructions (µops) that are used internally by the CPU execution units. Like previous Intel CPUs, Sandy Bridge microarchitecture supports both macro- and micro-fusion. Macro-fusion allows the CPU to join two related x86 instructions into a single one, while micro-fusion joins two relates microinstructions into a single one. Of course the goal is to improve performance.
What is completely new is the addition of a decoded microinstruction cache, capable of storing 1,536 microinstructions (which translated more or less to 6 kB). Intel is referring this cache as an “L0 cache.” The idea is obvious. When the program that is running enters a loop (i.e., needs to repeat the same instructions several times), the CPU won’t need to decode again the x86 instructions: they will be already decoded in the cache, saving time and thus improving performance. According to Intel this cache has an 80% hit rate, i.e. it is used 80% of the time.
Now you may be asking yourself if this is not the same idea used in the Netburst microarchitecture (i.e. Pentium 4 processors), which had a trace cache that also stored decoded microinstructions. A trace cache works differently from a microinstruction cache: it stores the instructions in the same order they were originally ran. This way, when a program reaches a loop that is ran, let’s say, 10 times, the trace cache will store the same instructions 10 times. Therefore, there are a lot of repeated instructions in the trace cache. The same doesn’t happen with the microinstruction cache, which stores only individual decoded instructions.
When the microinstruction cache is used, the CPU puts the L1 instruction cache and the decoders to “sleep,” making the CPU to save energy and to run cooler.
The branch prediction unit was redesigned and the Branch Target Buffer (BTB) size was doubled in comparison to Nehalem, plus it now uses a compression technique to allow even more data to be stored. Branch prediction is a circuit that tries to guess the next steps of a program in advance, loading to inside the CPU the instructions it thinks the CPU will try to load next. If it hits it right, the CPU won’t waste time loading these instructions from memory, as they will be already inside the CPU. Increasing the size of the BTB allows this circuit to load even more instructions in advance, improving the CPU performance.
The scheduler used in the Sandy Bridge microarchitecture is similar to the one used in the Nehalem microarchitecture, with six dispatch ports, three ports used by execution units and three ports used by memory operations.
Although this configuration is the same, the Sandy Bridge microarchitecture has more execution units: while the Nehalem microarchitecture has 12 of them, the Sandy Bridge has 15, see Figure 2. According to Intel, they were redesigned in order to improve floating-point (i.e., math operations) performance.
Each execution unit is connected to the instruction scheduler using a 128-bit datapath. In order to execute the new AVX instructions, which carry 256-bit data, instead of adding 256-bit datapaths and 256-bit units to the CPU, two execution units are “merged” (i.e., used at the same time), as you can see in Figure 3.
After an instruction is executed, it isn’t copied back to the re-order buffer as it happened in previous Intel architectures, but rather indicated in a list that it is done. This way the CPU saves bits and improves efficiency.
Another difference is on the memory ports. The Nehalem microarchitecture has one load, one store address and one store data units, each one attached to an individual dispatch port. This means that Nehalem-based processors can load from the L1 data cache 128 bits of data per cycle.
In the Sandy Bridge microarchitecture, the load and the store address units can be used either as a load unit or a store address unit. This change allows two times more data to be loaded from the L1 data cache at the same time (using two 128-bit units at the same time instead of only one), thus improving performance. This way, Sandy Bridge-based processors can load 256 bits of data from the L1 data cache per cycle.
A while ago, AMD proposed an SSE5 instruction set. However, Intel decided to create its own implementation of what would be the SSE5 instructions, called AVX (Advanced Vector Extensions).
These instructions use the same SIMD (Single Instruction, Multiple Data) concept introduced with the MMX instruction set and used by the SSE (Streaming SIMD Extensions) instructions. This concept consists in using a single big register to store several small-sized data and then process all data with a single instruction, speeding up processing.
The AVX instruction set adds 12 new instructions and increases the size of the XMM registers from 128 bits to 256 bits.
Detailed information about the new AVX instruction set can be found here (look for the Intel Advanced Vector Extensions Programming Reference).
Sandy Bridge-based processors will have a ring architecture for the internal components of the CPU to talk to each other. When a component wants to “talk” with another component, it puts the information in the ring and the ring will move this information until it reaches its destination. Components don’t talk to each other directly, they have to use the ring. Components that use the ring include the CPU cores, each L3 memory cache (which is now called Last Level Cache, or LLC, and is not unified, see Figure 5), the system agent (integrated memory controller, PCI Express controller, power control unit, and display), and the graphics controller.
In Figure 5 you can see the ring (black line) with its “stops” (red boxes). It is important to understand that the ring is physically located over the memory caches (imagine a ski lift where each red box is a stop) – since the illustration is bi-dimensional, you may have the impression that the ring wires run inside the cache, which is not the case.
Also, each last level cache isn’t tied to a particular CPU core. Any core can use any of the caches. For example, in Figure 5, we have a quad-core CPU with four last level caches. Core 1 isn’t linked to cache 1; it can use any of the caches. This also means that any CPU core can access data that is stored in any of the caches.
There are actually four rings: data ring, request ring, acknowledge ring and snoop ring. They run at the same clock rate as the CPU internal clock. It is based on the QPI (QuickPath Interconnect) protocol, the same one used by socket LGA1366 CPUs to talk to the chipset.
Each component decides when to use the ring, if empty, and the ring always choose the shortest path to the destination.
Turbo Boost is a technology that automatically overclocks the CPU when the CPU “asks” for more processing power. In the Sandy Bridge microarchitecture this technology was revised in order to allow the CPU to exceed its TDP (Thermal Design Power) for up to 25 seconds – i.e., to dissipate more heat than officially allowed. This is possible because the heatsink and components are still cold. See Figure 6.
Also, the CPU cores and graphics controller “share” TDP between them. For example, if the graphics core isn’t dissipating a lot of heat, this gives extra TDP to the CPU cores to use, allowing them to run at a higher clock rate and at a TDP higher than the official rating (labeled “Specified Core Power” in Figure 7), if applications are demanding more processing power, of course. See Figure 7.
The graphics processor integrated in Sandy Bridge-based processors will have a DirectX 10.1 engine. As explained in the first page of this tutorial, it will be available in the same silicon chip as the rest of the CPU, instead of being available at a separate chip but “glued” together with the CPU inside the same package.
In Figure 8, you have an overall look at the Sandy Bridge graphics processor.
The number of execution units (“processors”) will depend on the CPU (e.g. Core i5 CPUs will have more execution units than Core i3 parts). Sandy Bridge processors can have up to 12 graphics execution units.
If you pay close attention in Figure 8, you will see that “Display” and “Graphics” are in separate parts of the CPU. This can be read as “2D” and “3D,” and helps the CPU to save energy by turning off the graphics processor when you are not playing games.
Another important innovation is that the graphics engine can use the Last Level Cache (LLC, formerly known as L3 memory cache) to store data, especially textures. This improves 3D performance, as the graphics engine doesn’t need to go to the RAM to fetch for data, it can load data directly from the cache (if it is already there, of course).