Inside the AMD Bulldozer Architecture
By Gabriel Torres on August 24, 2010
AMD is unveiling today the new processor architecture that will be used in their new CPUs starting in 2011. Codenamed Bulldozer, this architecture is completely different from the current AMD64 architecture that AMD has been using since the introduction of the very first Athlon 64 CPU back in 2003. In this tutorial we will give you an in-depth explanation of how this new architecture looks like and how it works.
For a better understanding of how the Bulldozer architecture compares to the AMD64 architecture, we suggest you to read our Inside the AMD64 Architecture tutorial before continuing.
The Bulldozer architecture will inherit some features introduced with the AMD64 architecture, such as the integrated memory controller and the use of the HyperTransport bus for communication between the CPU and the chipset.
Bulldozer is the codename for the architecture, not for a specific processor. As usually happens, AMD will first release processors targeted to the server market based on this new architecture, then for the high-end desktop market, then for the mainstream desktop segment, and finally for the entry-level market.
Although AMD didn’t say any specifics of the CPUs that will be launched, they mentioned that the first desktop CPUs based on the Bulldozer architecture will require a new CPU socket, called AM3+, which will also be compatible with current socket AM3 processors. Socket AM3+ CPUs, however, won’t be compatible with socket AM3 motherboards.
The Bulldozer architecture will have an equivalent of the Intel Turbo Boost technology, allowing the CPU to overclock itself if you are running CPU-intensive programs and if the thermal dissipation is still within specs.Before talking about the internals of the Bulldozer architecture, let’s first talk about the instruction sets supported by this new architecture.
The Bulldozer architecture, besides being compatible with the standard x86 instructions, will support the following additional instruction sets:
But what does that mean? Let’s see.
SSE4.1 and SSE4.2
Finally AMD CPUs will support SSE4 instructions. Currently AMD CPUs don’t support these instruction sets, which increase speed in multimedia applications (image and video processing) that support it. Current AMD CPUs support a proprietary instruction set called SSE4a, which isn’t the same thing as SSE4.
AVX (Advanced Vector Extensions)
A while ago, AMD proposed an SSE5 instruction set. Because Intel decided to create its own implementation of what would be the SSE5 instructions, called AVX (Advanced Vector Extensions), AMD added this instruction set to the Bulldozer architecture.
The AVX instructions will also be supported by forthcoming CPUs from Intel based in their Sandy Bridge architecture, and use the same SIMD (Single Instruction, Multiple Data) concept introduced with the MMX instruction set and used by the SSE (Streaming SIMD Extensions) instructions. This concept consists in using a single big register to store several small-sized data and then process all data with a single instruction, speeding up processing.
The AVX instruction set adds 12 new instructions and increases the size of the XMM registers from 128 bits to 256 bits.
In the Bulldozer architecture, AMD decided to add some of the instructions they had originally proposed for the SSE5 instruction set. Therefore, the AVX implementation in the Bulldozer architecture is more complete than Intel’s. These additional instructions are called the XOP and FMA4 instructions, and a detailed description can be found here. In their Bulldozer presentations AMD is announcing the AVX instruction set as “also” having the FMAC (Fused Multiply Accumulate) subset, but this subset of instructions is actually part of the XOP instructions. The “AMD 4-operand form” being announced in the AMD presentations is simply the new format used by the XOP instructions and mentioning this is also completely redundant.
AES (Advanced Encryption Standard)
This instruction set is already being used in the new Intel CPUs based on the “Westmere” architecture and newer (except Core i3), and consists of six new instructions to deal specifically with encryption. Intel calls this instruction set AES-NI. A detailed description of these instructions can be found here.
LWP (Light Weight Profiling)
The LWP instructions allow programs to easily monitor software performance, which will help developers to fine-tune programs for best performance, for example. This additional instruction set has six new instructions, and a detailed description can be found here.
AMD decided to take a completely different approach in the new Bulldozer architecture. They decided to create a “dual-core” module that shares some resources (the front-end engine, the floating-point unit, and the L2 memory cache, see Figure 1) and, therefore, are not completely independent from each other.
According to AMD this was done in order to optimize the CPU and, at the same time, cut costs. The optimization comes from the fact that on a typical multi-core CPU several units inside the CPU remain idle, and these units could be combined in the Bulldozer architecture. And since the CPU will have less units, it can be smaller, which reduces the amount of material necessary to build the CPU, reducing costs. Having less units also help saving energy and reducing the amount of generated heat.
So while AMD will call a CPU that has one of these modules a “dual-core” CPU, in reality the CPU isn’t true a dual-core product, because there aren’t two complete and complete CPUs inside the product. The “dual-core” name in this case will be used for marketing purposes, to make sure the consumer understands that although this Bulldozer-based CPU isn’t a true “dual-core” model, it should perform like one.
Going further, for making a “quad-core” CPU, AMD will get two of these blocks and put together, so while physically speaking the processor has actually two “CPUs” inside (two of the building blocks shown in Figure 1), and not four, AMD will still call it a “quad-core” product. In Figure 2, you can see how an “eight-core” CPU based on the Bulldozer architecture would look like.
Let’s now take an in-depth look at the Fetch and Decode units used on the Bulldozer architecture.
The Fetch unit is in charge of getting the next instruction to be decoded from the RAM or memory cache. For further information we suggest you to read our How a CPU Works and How the Memory Cache Works tutorials.
As shown in the previous page, the Fetch unit is shared by the two “cores” available in each Bulldozer module. The L1 instruction cache is also shared by the two “cores,” because it is an essential part of the fetch unit, but each CPU “core” has its own L1 data cache. Interesting enough AMD has already announced that the L1 instruction cache used in the Bulldozer architecture is a two-way set associative 64 KB cache, the same configuration used by CPUs based on the AMD64 architecture, with the obvious difference that while AMD64 CPUs have one L1 memory cache per core, Bulldozer-based CPUs will have one L1 memory cache per each pair of cores. However, the data cache used by each “core” will be of only 16 KB, which is considerably smaller than the 64 KB per core currently used by CPUs based on the AMD64 architecture.
At this moment AMD hasn’t made public the size of Bulldozer’s BTBs (Branch Target Buffers), which is a small memory that lists all identified branches in the program, used by the branch prediction mechanism of the CPU.
The sizes of the TLBs (Translation Look-aside Buffers), on the other hand, have been disclosed, as you can see in Figure 3. These buffers are a small memory to help the conversion between virtual addresses and physical addresses, used mainly by the virtual memory circuit (virtual memory, also known as swap file, is a technique where the CPU simulates that it has more RAM memory than you actually have installed by using a file in the hard drive).
PC programs are written using x86 instructions, but nowadays the CPU Execution unit only understands proprietary RISC-like instructions. So the Decode unit is in charge of converting the x86 instructions provided by the program running into these RISC-like microinstructions, which are the kind of instruction understood by the Execution unit of the CPU. The Bulldozer architecture has four decoders, but at this moment AMD didn’t give a lot of information on what kind of instructions each decoder can handle. Usually at least one of these decoders handle exclusively complex instructions, using the provided microcode ROM (in the slide “Ucode” should be read as “µcode,” or “microcode”). The decoding of complex instructions take several clock cycles to be completed, because they are converted into several microinstructions. Simple instructions, however, are usually converted in only one clock cycle because they are translated into a single microinstruction. Usually processor manufacturers optimize their CPUs to decode the most common instructions as fast as possible, in just one clock cycle.
After the instructions are decoded, they are sent to the appropriate scheduler, integer or floating-point. The Bulldozer architecture has only one floating-point unit, which is shared between the two “cores” available. On the other hand, it has two completely independent integer units, the so-called “cores.”
Each integer engine has four Execution units, labeled as:
It also has a Load/Store unit (“Ld/ST”), which is in charge of getting from the memory or storing in the memory a data requested by an instruction. Usually this unit is drawn side-by-side with the units listed above, but somehow in this presentation AMD decided to draw it separately.
The Bulldozer architecture uses an out-of-order execution engine, like AMD64 CPUs and Intel CPUs since the Pentium Pro (P6 architecture). Because not all execution engines can process all kinds of instructions, if there wasn’t an out-of-order engine some of the execution units would be idle sometimes. Let’s say the next instruction to be executed is an integer division, but the unit that is able to process this kind of instruction is busy processing another instruction. Instead of waiting for this unit to be free, the scheduler will look for an instruction that can be executed right away in one of the other units, if they are free, of course. So the role of the scheduler is to keep all execution unit as busy as possible.
After integer instructions are executed, they are sent to the Retire unit, where the CPU will put them back in the correct order.
The floating point unit also has four Execution units, labeled as:
The Bulldozer architecture will have a shared L2 memory cache for each two “cores.” An L3 memory cache will be available, shared between all “cores.” The L2 memory cache will be a 16-way set associative cache, with a 1,024-entry TLB (Translation Look-aside Buffer).
AMD added some interesting features for managing power in their Bulldozer architecture, the most important being “power gating,” which allows the CPU to simply cut the power from unused CPU units to save power. It can also completely turn off any unused CPU “core.” AMD also added a feature to measure CPU activity to estimate power being dissipated. The phrase “Hardware uses higher frequency when power limit allows” is an indication that AMD is adding a technology similar to Intel’s Turbo Boost, which automatically overclocks the CPU if thermal dissipation remains within specs.