Inside AMD64 Architecture
By Gabriel Torres on May 16, 2006
In this tutorial we will explain you how AMD64 architecture – which is used by Athlon 64, Athlon 64 X2, Athlon 64 FX, Opteron, Turion 64 and some Sempron models – works in an easy to follow language. You will learn exactly how this architecture – also known as K8 or hammer – works so you will be able to compare it more precisely to competitor architectures from Intel.
In order to continue you need to have read our tutorial “How a CPU Works”. In this tutorial we explain the basics about how a CPU works. In the present tutorial we are assuming that you have already read it, so if you didn’t, please take a moment to read it before continuing, otherwise you may find yourself a little bit lost. Actually we can consider the present tutorial as a sequel to our How a CPU Works tutorial. You may also be interested in reading tutorials about other CPUs, so you can compare AMD architecture to Intel’s: Inside Pentium 4 Architecture, Inside Pentium M Architecture and Inside Intel Core Microarchitecture.
The main difference between AMD64 architecture and the design of other CPUs – including previous CPUs from AMD, like Athlon XP and the original Athlon – is that the memory controller is embedded in the CPU, and not on the north bridge chip (the main chip on the motherboard chipset). So, on motherboards targeted to CPUs based on AMD64 architecture the “north bridge” chip is just a bridge between the CPU and the graphics bus of choice (AGP or PCI Express) and the south bridge chip. Since this “north bridge” is simpler to be made, some manufacturers have single-chip chipset models for AMD64 CPUs.
Since the memory controller is embedded in the CPU, the memory capacity – including memory types supported and support for dual channel – is defined by the CPU and not by the north bride (i.e., by the motherboard), as it happens with CPUs based on other architectures. A side effect of this architecture is that motherboards for AMD64 CPUs don’t have noticeable performance difference between them, as all of them use the very same memory controller (the one inside the CPU). This statement is only true for motherboards without on-board video, as motherboards with embedded graphics have an on-board video controller, which is outside the CPU, so its performance varies depending on the video controller used.
In Figure 1, you can see the architecture used by other CPUs, while in Figure 2 you can see the architecture used by AMD64 CPUs.
We can say that the north bridge chip is embedded inside the CPU. On the motherboard you will find a bridge chip, which will make the proper interface between the HyperTransport bus (i.e., the CPU), the video card bus (AGP or PCI Express x16) and the south bridge. Sometimes the chipset manufacturer builds this bridge chip and the south bridge in just one chip. This is what is called “single-chip solution”.
The memory controlled embedded in the AMD64 processors can drive up to four memory modules per channel. So on a dual-channel system it can drive eight memory modules. The number of sockets available on the motherboard is limited by the motherboard manufacturer design.
The communication between AMD64 CPUs and the bridge chip is made through a bus called HyperTransport. The HyperTransport speed depends on the CPU model. Typical values are of 3,200 MB/s (a.k.a. “800 MHz”, “1,600 MHz” or “6,400 MB/s”) or 4,000 MB/s (a.k.a. “1,000 MHz”, “2,000 MHz” or “8,000 MB/s”). For better understanding the HyperTransport bus, please read our tutorial on the subject.
AMD64 CPUs can have more than one HyperTransport bus. While all AMD64 CPUs targeted to desktops and notebooks – Athlon 64, Athlon 64 FX, Athlon 64 X2, Sempron and Turion 64 – have only one HyperTransport bus, AMD64 models for servers and workstations – Opteron – can have more than one HyperTransport bus.
Opteron CPUs from 1xx series don’t support multiprocessing and thus have only one HyperTransport bus and work just like shown in Figure 2. Opteron CPUs from 2xx series support multiprocessing up to two CPUs and have two HyperTransport busses. Opteron CPUs from 8xx series support multiprocessing up to eight CPUs and have three HyperTransport busses. These extra busses are used to interconnect the CPUs, see Figures 3 through 5.
AMD approach to multiprocessing is also something very interesting to notice. Since each CPU has its own memory controller, each CPU accesses its own memory modules. For example, on a quad-Opteron system with 4 GB RAM, each CPU has 1 GB RAM for itself. On Xeon system, for example, the 4 GB would be shared by all CPUs. Also, since each CPU can drive up to four memory modules per channel, the quad-CPU system shown in Figure 4 could directly drive up to 32 memory modules (eight per CPU). The motherboard manufacturer, however, is who defines the number of sockets available on the motherboard (i.e., saying that a quad-Opteron system can have up to 32 memory modules does not translate into saying that all quad-Opteron systems have 32 memory sockets).
On the figures above you can see an “I/O” labeled. This “I/O” can represent any kind of bridge: it could be a regular south bridge; it could be an AGP or PCI Express x16 bridge for graphics; it could be a PCI-X or PCI Express for general purpose add-on cards, etc.
How the HyperTransport busses are connected inside the CPU is shown in Figure 6. AMD64 CPUs have a “crossbar”, which is in charge of routing data and commands from and to the CPU, memory and the HyperTransport busses. The System Request Interface (SRI) is also known as System Request Queue (SRQ), while APIC stands for Advanced Programmable Interrupt Controller. The diagram considers a dual-core CPU.
AMD64 CPUs are available in different socket types. You must use a motherboard with the same socket type your CPU has. Different socket types exist because of the different memory controller specs available. So far you can find AMD64 CPUs using the following sockets:
On the pictures below you can see the physical difference between socket 754 and socket 939, which are the two most popular socket types for AMD64 CPUs.
When it was released with Athlon 64, AMD64 architecture brought a new 64-bit mode for x86 instructions. This mode is called x86-64 by AMD and what it does is to expand the existing 32-bit registers into 64-bit ones. All AMD64 CPUs have sixteen 64-bit general purpose registers when running under x86-64 mode. Under this mode the CPU address bus is also expanded from 32 to 40 bits, enabling the CPU to directly access up to 1 TB of RAM (2^40). Also under this mode the CPU can access up to 256 TB of virtual memory (2^48). Virtual memory is a technique that allows the CPU to simulate more RAM memory that the computer has by creating a file on the hard disk drive called swap file.
Intel copied all these features, so they are not an exclusive feature from AMD anymore. However, while all AMD64 CPUs support the x86-64 mode (the exception goes for the early socket 754 Sempron CPUs), not all recent CPUs from Intel support it.
To use this mode, however, it is necessary to run a 64-bit operating system. Don’t expect to access more than 4 GB RAM with an Athlon 64 running regular Windows XP, for example, since regular Windows XP runs under 32-bit mode. For a more detailed explanation about the 64-bit mode, read our tutorial on this subject.
As you may have noticed, AMD64 memory controller works with DDR or DDR2 technology. Dual Data Rate technology works by transferring two data per clock cycle. So when using your Athlon 64 with DDR400/PC3200 memories, for example, the CPU accesses them at 200 MHz and not at 400 MHz (DDR and DDR2 memories are rated with double the real clock rate they use).
All AMD64 CPUs have one 64 KB L1 instruction cache and one 64 KB L1 data cache. The L2 memory cache varies according to the CPU model. On dual-core CPUs the L2 cache is separated, i.e., each core has its own L2 memory cache. On the latest Intel CPUs (Core Duo and Core 2 Duo) the CPU has only one L2 cache, which is shared by both cores (Intel claims that this approach improves performance).
Pipeline is a list of all stages a given instruction must go through in order to be fully executed. AMD64 architecture uses a 12-stage pipeline for executing integer instructions and a 17-stage pipeline for executing floating-point ones. So it takes 12 or 17 steps for a given instruction to be executed on AMD64 CPUs. AMD previous architecture – K7, which was used by the original Athlon, Athlon XP and some Sempron models – had a 10-stage pipeline. Pentium 4 pipeline has 20 stages and Pentium 4 “Prescott” pipeline has 31 stages. Intel went back and the forthcoming Core 2 Duo processor will have a 14-stage pipeline.
Let’s study AMD64’s integer pipeline. It is based on K7 architecture pipeline, the main difference the decoder stages that were broken in several different stages, probably to allow AMD64 CPUs to achieve a higher clock rate.
Here is a basic explanation of each stage, which explains how a given instruction is processed by processors based on AMD64 architecture. If you think this is too complex for you, don’t worry. This is just a summary of what we will be explaining in the next pages.
On AMD64 architecture the datapath between the L2 memory cache and L1 data cache is 128-bit wide. On Intel’s 7th generation CPUs (Pentium 4) this datapath is of 256 bits and on Intel’s 6th generation CPUs (Pentium Pro, Pentium II, Pentium III and Pentium M) this datapath is of 64 bits.
The L1 instruction cache of AMD64 processors include a pre-decode logic, i.e., each byte store inside the L1 instruction cache has some bits to mark the start and the end of each instruction. Since x86 instructions don’t have a fixed length (they can have anything from 1 byte to 15 bytes*), the process of detecting where each instruction starts and ends is very important to the CPU decoder.
* You can find yourself quite lost by this statement, since you were always told that x86 architecture uses 32-bit (i.e., 4 bytes) instructions, so further explanation is necessary in order to clarify this affirmation.
Inside the CPU what is considered an instruction is the instruction opcode (the machine language equivalent of the assembly language instruction) plus all its required data. This is because in order to be executed, the instruction must enter the execution engine “completed”, i.e., together with all its required data. Also, the size of each x86 instruction opcode is variable and not fixed at 32 bits, as you may think. For example, an instruction like mov eax, (32-bit data), which stores the (32-bit data) into the CPU’s EAX register is considered internally as a 40-bit length instruction (mov eax translates into a 8-bit opcode plus the 32 bits from its data). Actually, having instruction with several different lengths is what characterizes a CISC (Complex Instruction Set Computing) instruction set.
If you want to learn more about this subject, read AMD64 Architecture Programmer’s Manual Vol. 3: General Purpose and System Instructions.
The L1 instruction cache provides 76 extra bits to the fetch unit: 52 pre-decode bits, eight parity bits and 16 branch selectors. The branch selector bits are used by the CPU to try to predict branches on the program running.
So in reality the L1 instruction cache is bigger than the 64 KB announced, since it stores pre-decode and branching information. In fact, the real size of AMD64 L1 instruction cache is of 102 KB (64 KB instruction cache + 4 KB parity + 26 KB pre-decode data + 8 KB branch data).
AMD64 architecture uses a 2,048-entry BTB (Branch Target Buffer), the same size used on AMD’s previous architecture, K7. BTB is a small memory that lists all identified branches on the program. Pentium 4’s BTB is of 4,096 entries while on 6th generation processors from Intel this buffer was of 512 entries.
Another branching register, BHT (Branch History Table) – which AMD calls GHBC (Global History Bimodal Counter) – has 16,384 entries on AMD64 architecture, while on Pentium 4 it is of 4,096 entries, the same size of the BHT found on AMD K7 architecture. This two-bit register is used to track down conditional branches: "strongly taken", "taken", "not taken" and "strongly not taken".
AMD CPUs use a hybrid CISC/RISC architecture since their 5th generation CPUs (namely K5). Intel started using this approach only from their 6th generation CPUs on. The processor must accept CISC instructions, also known as x86 instructions, since all software available today is written using this kind of instructions. A RISC-only CPU couldn’t be create for the PC because it wouldn’t run software we have available today, like Windows and Office.
So, the solution used by all processors available on the market today from both AMD and Intel is to use a CISC/RISC decoder. Internally the CPU processes RISC-like instructions, but its front-end accepts only CISC x86 instructions.
CISC x86 instructions are referred as “instructions” as the internal RISC instructions are referred as “microinstructions”, “micro-op”, “µops” or “ROP”. AMD64 architecture has a third instruction type, called macro-op or “MOP”, which is the instruction resulted from the instruction decoder. AMD64 deals internally with macro-ops. When the macro-op reaches the appropriate scheduler, it is further decoded into micro-ops and then these micro-ops are executed. If you pay attention this is somewhat what Intel is doing on their new Core architecture, with their macro-fusion feature. However, while macro-fusion on Core-based processors only works with branch instructions, on AMD64 the use of macro-ops is done for all instructions.
The RISC microinstructions, however, cannot be accessed directly, so we couldn’t create software based on these instructions to bypass the decoder. Also, each CPU uses its own RISC instructions, which are not public documented and are incompatible with microinstructions from other CPUs. I.e., AMD64 microinstructions are different from Pentium 4 microinstructions, which are different from AMD’s K7 architecture microinstructions.
Depending on the complexity of the x86 instruction, it has to be converted into several RISC microinstructions.
On AMD64 architecture x86 instructions can be converted into macro-ops using three different ways: using a simple decoder, called DirectPath Single, which translates one common x86 instruction into a single macro-op; using also a simple decoder, called DirectPath Double, which translates one x86 instruction into two macro-ops; or using a complex decoder, called DirectPath Vector, which translates one complex x86 instruction into several macro-ops. The DirectPath Vector has to call a ROM memory (called Microcode Sequencer) to convert the x86 instruction.
Here is how the AMD64 decoder works. On Pick stage, also known as Scan, the CPU looks and separates the instructions present in its Instruction Byte Buffer, deciding which path to use: DirectPath or VectorPath.
Then comes the Decode stage, which is broken into two steps, where the x86 instructions are actually converted into macro-ops. This stage is equivalent of the Align stage found on K7 processors. The maximum decoder output rate is of six macro-ops per clock cycle, three for DirectPath and three for VectorPath.
The macro-ops go to the Pack stage (which is the equivalent of the Decode 1 stage on K7 architecture), where the macro-ops are packed together, so three macro-ops are sent to the next stage, pack/decode, which does some more decoding and sends the macro-ops to the Instruction Control Unit, which is the name given by AMD to what Intel calls Reorder Buffer (ROB).
As mentioned, the Instruction Control Unit is the Reorder Buffer of AMD64 processors. Here the macro-ops can be picked and sent to the schedulers out-of-order, i.e., not in the same order the instructions appeared on the program that is being executed. For example, if the program has something like this:
AMD64 architecture has three integer execution engines and three floating-point execution engines. If it hadn’t an out-of-order execution engine, its floating-point engines would be idle when running this program, since the forth instruction is also an integer instruction and can’t be executed at the same time because all three execution engines are already being used. Since it implements out-of-order execution, the fifth instruction, the first FP instruction, can be sent to execution together with the first one, increasing the CPU performance. In fact, since it has three FPUs, both FP instructions available on this program could be dispatched at the same time. The goal of the scheduler it to keep all CPU execution engines busy all the time.
The reorder buffer available on AMD64 architecture has 72 entries and what is quite interesting is that each integer execution engine has its own scheduler with its own buffer (8 entries each). The FP execution units have only one 36-entry scheduler. So AMD64 has a total of four schedulers, the same amount available on Pentium 4.
The reorder buffer is also in charge of register renaming. CISC x86 architecture has only eight 32-bit registers (EAX, EBX, ECX, EDX, EBP, ESI, EDI and ESP). This number is simply too low, especially because modern CPUs can execute code out-of-order, what would “kill” the contents of a given register, crashing the program.
So, at this stage, the processor changes the name and contents of the registers used by the program into one of the 96 internal registers available, allowing the instruction to run at the same time of another instruction that uses the exact same standard register, or even out-of-order, i.e., this allows the second instruction to run before the first instruction even if they mess with the same register.
AMD64 architecture has 96 internal registers, while Pentium 4 has 128. Intel’s 6th generation processors (like Pentium II and Pentium III) there were only 40 internal registers. It is interesting to note how AMD did a trick on AMD64 architecture to achieve those 96 registers. They simply created a result field on each one of the 72 reorder buffer entries for storing the results of each instruction (this isn’t available on Pentium 4; Pentium 4 needs to allocate an internal register for storing the results each time an instruction is executed). Plus its register file (or IFFRF, Integer Future File and Register File, as AMD calls it) has 40 entries (since 16 of them stores the “correct” value for each x86 register, they cannot be used). So while the correct answer for “how many internal registers does AMD architecture have?” is 40, the effective number is 96 due to this architectural difference.
AMD64 architecture has three integer execution units (a.k.a. ALU, Arithmetic and Logic Unit, or IEU, Integer Execution Unit), three address generation units (AGU) and three floating-point units (FPUs). It has one integer unit more than Pentium 4. The maximum instruction dispatch rate is of six instructions per clock cycle, the same amount found on Pentium 4.
As you can see in Figure 15, there are certain FP instructions that can only be processed in a specific FPU. FPAD stands for floating-point addition instructions, like ADDPS (which, by the way, is a SSE instruction), while FMUL stands for floating-point multiplication instructions, like MULPS (which, by the way, is a SSE instruction).