In this tutorial we will explain you how Pentium 4 works in an easy to follow language. You will learn exactly how its architecture works so you will be able to compare it more precisely to previous processors from Intel and competitors from AMD.
Pentium 4 and new Celeron processors use Intel’s seventh generation architecture, also called Netburst. Its overall look you can see in Figure 1. Don’t get scared. We will explain deeply what this diagram is about.
In order to continue, however, you need to have read our tutorial “How a CPU Works”. In this tutorial we explain the basics about how a CPU works. In the present tutorial we are assuming that you have already read it, so if you didn’t, please take a moment to read it before continuing, otherwise you may find yourself a little bit lost. Actually we can consider the present tutorial as a sequel to our How a CPU Works tutorial.
Here are the basic differences between Pentium 4 architecture and the architecture from other CPUs:
- Externally, Pentium 4 transfers four data per clock cycle. This technique is called QDR (Quad Data Rate) and makes the local bus to have a performance four times its actual clock rate, see table below. In Figure 1 this is shown on “3.2 GB/s System Interface”; since this slide was produced when the very first Pentium 4 was released, it mentions the “400 MHz” system bus.
|Real Clock||Performance||Transfer Rate|
|100 MHz||400 MHz||3.2 GB/s|
|133 MHz||533 MHz||4.2 GB/s|
|200 MHz||800 MHz||6.4 GB/s|
|266 MHz||1,066 MHz||8.5 GB/s|
- The datapath between the L2 memory cache (“L2 cache and control” in Figure 1) and L1 data cache (“L1 D-Cache and D-TLB” in Figure 1) is 256-bit wide. On previous processors from Intel this datapath was of only 64 bits. So this communication can be four times faster than processors from previous generations when running at the same clock. The datapath between L2 memory cache (“L2 cache and control” in Figure 1) and the pre-fetch unit (“BTB & I-TLB” in Figure 1), however, continues to be 64-bit wide.
- The L1 instruction cache was relocated. Instead of being before the fetch unit, the L1 instruction cache is now after the decode unit, with a new name, “Trace Cache”. This trace cache can hold up to 12 K microinstructions. Since each microinstruction is 100-bit wide, the trace cache is of 150 KB (12 K x 100 / 8). On of the most common mistakes people make when commenting Pentium 4 architecture is saying that Pentium 4 doesn’t have any instruction cache at all. That’s absolutely not true. It is there, but with a different name and a different location.
- On Pentium 4 there are 128 internal registers, on Intel’s 6th generation processors (like Pentium II and Pentium III) there were only 40 internal registers. These registers are in the Register Renaming Unit (a.k.a. RAT, Register Alias Table, shown as “Rename/Alloc” in Figure 1).
- Pentium 4 has five execution units working in parallel and two units for loading and storing data on RAM memory.
Of course this is just a summary for those who already has some knowledge on the architecture from other processors. If all this look like Greek to you, don’t worry. We will explain everything you need to know about Pentium 4 architecture in an easy to follow language in the next pages.
- 1. Introduction
- 2. Pentium 4 Pipeline
- 3. Memory Cache and Fetch Unit
- 4. Decoder
- 5. Allocator and Register Renamer
- 6. Scheduler
- 7. Dispatch and Execution Units