Inside Pentium 4 Architecture
By Gabriel Torres on October 18, 2005


Introduction

In this tutorial we will explain you how Pentium 4 works in an easy to follow language. You will learn exactly how its architecture works so you will be able to compare it more precisely to previous processors from Intel and competitors from AMD.

Pentium 4 and new Celeron processors use Intel’s seventh generation architecture, also called Netburst. Its overall look you can see in Figure 1. Don’t get scared. We will explain deeply what this diagram is about.

In order to continue, however, you need to have read our tutorial “How a CPU Works”. In this tutorial we explain the basics about how a CPU works. In the present tutorial we are assuming that you have already read it, so if you didn’t, please take a moment to read it before continuing, otherwise you may find yourself a little bit lost. Actually we can consider the present tutorial as a sequel to our How a CPU Works tutorial.

Pentium 4 Architecture
click to enlarge
Figure 1: Pentium 4 block diagram.

Here are the basic differences between Pentium 4 architecture and the architecture from other CPUs:

Real Clock

Performance

Transfer Rate

100 MHz 

400 MHz 

3.2 GB/s

133 MHz 

533 MHz

 4.2 GB/s

200 MHz

800 MHz

6.4 GB/s

266 MHz

1,066 MHz

8.5 GB/s

Of course this is just a summary for those who already has some knowledge on the architecture from other processors. If all this look like Greek to you, don’t worry. We will explain everything you need to know about Pentium 4 architecture in an easy to follow language in the next pages.

Pentium 4 Pipeline

Pipeline is a list of all stages a given instruction must go through in order to be fully executed. On 6th generation Intel processors, like Pentium III, their pipeline had 11 stages. Pentium 4 has 20 stages! So, on a Pentium 4 processor a given instruction takes much longer to be executed then on a Pentium III, for instance! If you take the new 90 nm Pentium 4 generation processors, codenamed “Prescott”, the case is even worse because they use a 31-stage pipeline! Holy cow!

This was done in order to increase the processor clock rate. By having more stages each individual stage can be constructed using fewer transistors. With fewer transistors is easier to achieve higher clock rates. In fact, Pentium 4 is only faster than Pentium III because it works at a higher clock rate. Under the same clock rate, a Pentium III CPU would be faster than a Pentium 4 because of the size of the pipeline.

Because of that, Intel has already announced that their 8th generation processors will use Pentium M architecture, which is based on Intel’s 6th generation architecture (Pentium III architecture) and not on Netburst (Pentium 4) architecture. This arquitecture, called Core, can be studied on our Inside Core Microarchitecture tutorial.

In Figure 2, you can see Pentium 4 20-stage pipeline. So far Intel didn’t disclosure Prescott’s 31-stage pipeline, so we can’t talk about it.

Pentium 4 Architecture
click to enlarge
Figure 2: Pentium 4 pipeline.

Here is a basic explanation of each stage, which explains how a given instruction is processed by Pentium 4 processors. If you think this is too complex for you, don’t worry. This is just a summary of what we will be explaining in the next pages.

Memory Cache and Fetch Unit

Pentium 4’s L2 memory cache can be of 256 KB, 512 KB, 1 MB or 2 MB, depending on the model. L1 data cache is of 8 KB or 16 KB (on 90 nm models).

As we explained before, the L1 instruction cache was moved from before the fetch unit to after the decode unit using a new name, “trace cache”. So, instead of storing program instructions to be loaded by the fetch unit, the trace cache stores microinstructions already decoded by the decode unit. The trace cache can store up to 12 K microinstructions and since Pentium 4 microinstructions are 100-bit wide, the trace cache is of 150 KB (12,288 x 100 / 8).

The idea behind this architecture is really interesting. In the case of a loop on the program (a loop is a part of a program that needs to be repeated several times), the instructions to be executed will be already decoded, because they are stored already decoded on the trace cache. On other processors, the instructions need to be loaded from L1 instruction cache and decoded again, even if they were decoded a few moments before.

The trace cache also has its own BTB (Branch Target Buffer) of 512 entries. BTB is a small memory that lists all identified branches on the program.

As for the fetch unit, its BTB was increased to 4,096 entries. On Intel 6th generation processors, like Pentium III, this buffer was of 512 entries and on Intel 5th generation processors, like the first Pentium processor, this buffer was of 256 entries only.

In Figure 3 you see the block diagram for what we were discussing. TLB means Translation Lookaside Buffer.

Pentium 4 Architecture
click to enlarge
Figure 3: Fetch and decode units and trace cache.

Decoder

Since previous generation (6th generation), Intel processors use a hybrid CISC/RISC architecture. The processor must accept CISC instructions, also known as x86 instructions, since all software available today is written using this kind of instructions. A RISC-only CPU couldn’t be create for the PC because it wouldn’t run software we have available today, like Windows and Office.

So, the solution used by all processors available on the market today from both Intel and AMD is to use a CISC/RISC decoder. Internally the CPU processes RISC-like instructions, but its front-end accepts only CISC x86 instructions.

CISC x86 instructions are referred as “instructions” as the internal RISC instructions are referred as “microinstructions” or “µops”.

These RISC microinstructions, however, cannot be accessed directly, so we couldn’t create software based on these instructions to bypass the decoder. Also, each CPU uses its own RISC instructions, which are not public documented and are incompatible with microinstructions from other CPUs. I.e., Pentium III microinstructions are different from Pentium 4 microinstructions, which are different from Athlon 64 microinstructions.

Depending on the complexity of the x86 instruction, it has to be converted into several RISC microinstructions.

Pentium 4 decoder can decode one x86 instruction per clock cycle, as long as the instruction decodes in up to four microinstructions. If the x86 instruction to be decoded is complex and will be translated in more than four microinstructions, it is routed to a ROM memory (“Microcode ROM” in Figure 3) that has a list of all complex instructions and how they should be translated. This ROM memory is also called MIS (Microcode Instruction Sequencer).

As we said earlier, after being decoded microinstructions are sent to the trace cache, and from there they go to a microinstructions queue. The trace cache can put up to three microinstructions on the queue per clock cycle, however Intel doesn’t tell the depth (size) of this queue.

From there, the instructions go to the Allocator and Register Renamer. The queue can also deliver up to three microinstructions per clock cycle to the allocator.

Allocator and Register Renamer

What the allocator does:

After that the microinstruction goes to the register renaming stage. CISC x86 architecture has only eight 32-bit registers (EAX, EBX, ECX, EDX, EBP, ESI, EDI and ESP). This number is simply too low, especially because modern CPUs can execute code out-of-order, what would “kill” the contents of a given register, crashing the program.

So, at this stage, the processor changes the name and contents of the registers used by the program into one of the 128 internal registers available, allowing the instruction to run at the same time of another instruction that uses the exact same standard register, or even out-of-order, i.e., this allows the second instruction to run before the first instruction even if they mess with the same register.
 
It is interesting to note that Pentium 4 has actually 256 internal registers, 128 registers for integer instructions and 128 registers for floating point and SSE instructions.

Pentium 4 renamer is capable of processing three microinstructions per clock cycle.

From the renamer the microinstructions go to a queue, according to its type: memory queue, for memory-related microinstructions, or Integer/Floating Point Queue, for all other instruction types.

Pentium 4 Architecture
click to enlarge
Figure 4: Allocator and Register Renamer.

Scheduler

The scheduler has four units. It analyses each microinstruction and puts it on a scheduler unit according to its type:

So, the scheduler sorts the microinstructions according to their type. Then the scheduler can dispatch each microinstruction directly to the correct execution unit to be processed.

The scheduler is the heart of the out-of-order engine on Pentium 4. Until now all microinstructions were delivered on the same order they were decoded. On the scheduler the microinstructions can be dispatched on a totally different order to the execution engines. The goal of the scheduler it to keep all CPU execution engines busy all the time.

The execution units are connected to the scheduler through four dispatch ports, numbered 0 through 3, as you can see in Figure 5.

Pentium 4 Architecture
click to enlarge
Figure 5: Scheduler units.

Dispatch and Execution Units

As we’ve seen, Pentium 4 has four dispatch ports numbered 0 through 3. Each port is connected to one, two or three execution units, as you can see in Figure 6.

Pentium 4 Architecture
click to enlarge
Figure 6: Dispatch and execution units.

The units marked as “clock x2” can execute two microinstructions per clock cycle. Ports 0 and 1 can send two microinstructions per clock cycle to these units. So the maximum number of microinstructions that can be dispatched per clock cycle is six:

Keep in mind that complex instructions may take several clock cycles to be processed. Let’s take an example of port 1, where the complete floating point unit is located. While this unit is processing a very complex instruction that takes several clock ticks to be executed, port 1 dispatch unit won’t stall: it will keep sending simple instructions to the ALU (Arithmetic and Logic Unit) while the FPU is busy.

So, even thought the maximum dispatch rate is six microinstructions, actually the CPU can have up to seven microinstructions being processed at the same time.
 
Actually that’s why ports 0 and 1 have more then one execution unit attached. If you pay attention, Intel put on the same port one fast unit together with at least one complex (and slow) unit. So, while the complex unit is busy processing data, the other unit can keep receiving microinstructions from its corresponding dispatch port. As we mentioned before, the idea is to keep all execution units busy all the time.

The two double-speed ALUs can process two microinstructions per clock cycle. The other units need at least one clock cycle to process the microinstructions they receive. So, Pentium 4 architecture is optimized for simple instructions.

As you can see in Figure 6, dispatch ports 2 and 3 are dedicated to memory operations: load (read data from memory) and store (write data to memory), respectively. As for memory operation, it is interesting to note that port 0 is also used during store operations (see Figure 5 and the list of operations in Figure 6). On such operations, port 3 is used to send the memory address, while port 0 is used to send the data to be stored at this address. This data can be generated by either the ALU or the FPU, depending on the kind of data to be stored (integer or floating point/SSE).

In Figure 6 you have a complete list of the kinds of instructions each execution unit deals with. FXCH and LEA (Load Effective Address) are two x86 instructions. Actually Intel’s implementation for FXCH instruction on Pentium 4 caused a great deal of surprise to all experts, because on processors from previous generation (Pentium III) and processors from AMD this instruction can be executed at zero clock cycle, while on Pentium 4 it takes some clock cycles to be executed.

That’s it. If you want to compare Pentium 4 architecture to Athlon 64's, read our Inside AMD64 Architecture tutorial.

Originally at http://www.hardwaresecrets.com/article/Inside-Pentium-4-Architecture/235


© 2004-14, Hardware Secrets, LLC. All Rights Reserved.

Total or partial reproduction of the contents of this site, as well as that of the texts available for downloading, be this in the electronic media, in print, or any other form of distribution, is expressly forbidden. Those who do not comply with these copyright laws will be indicted and punished according to the International Copyrights Law.

We do not take responsibility for material damage of any kind caused by the use of information contained in Hardware Secrets.