Inside Pentium 4 Architecture

[nextpage title=”Introduction”]

In this tutorial we will explain you how Pentium 4 works in an easy to follow language. You will learn exactly how its architecture works so you will be able to compare it more precisely to previous processors from Intel and competitors from AMD.

Pentium 4 and new Celeron processors use Intel’s seventh generation architecture, also called Netburst. Its overall look you can see in Figure 1. Don’t get scared. We will explain deeply what this diagram is about.

In order to continue, however, you need to have read our tutorial “How a CPU Works”. In this tutorial we explain the basics about how a CPU works. In the present tutorial we are assuming that you have already read it, so if you didn’t, please take a moment to read it before continuing, otherwise you may find yourself a little bit lost. Actually we can consider the present tutorial as a sequel to our How a CPU Works tutorial.

Figure 1: Pentium 4 block diagram.

Here are the basic differences between Pentium 4 architecture and the architecture from other CPUs:

Externally, Pentium 4 transfers four data per clock cycle. This technique is called QDR (Quad Data Rate) and makes the local bus to have a performance four times its actual clock rate, see table below. In Figure 1 this is shown on “3.2 GB/s System Interface”; since this slide was produced when the very first Pentium 4 was released, it mentions the “400 MHz” system bus.

Real Clock	Performance	Transfer Rate
100 MHz	400 MHz	3.2 GB/s
133 MHz	533 MHz	4.2 GB/s
200 MHz	800 MHz	6.4 GB/s
266 MHz	1,066 MHz	8.5 GB/s

The datapath between the L2 memory cache (“L2 cache and control” in Figure 1) and L1 data cache (“L1 D-Cache and D-TLB” in Figure 1) is 256-bit wide. On previous processors from Intel this datapath was of only 64 bits. So this communication can be four times faster than processors from previous generations when running at the same clock. The datapath between L2 memory cache (“L2 cache and control” in Figure 1) and the pre-fetch unit (“BTB & I-TLB” in Figure 1), however, continues to be 64-bit wide.
The L1 instruction cache was relocated. Instead of being before the fetch unit, the L1 instruction cache is now after the decode unit, with a new name, “Trace Cache”. This trace cache can hold up to 12 K microinstructions. Since each microinstruction is 100-bit wide, the trace cache is of 150 KB (12 K x 100 / 8). On of the most common mistakes people make when commenting Pentium 4 architecture is saying that Pentium 4 doesn’t have any instruction cache at all. That’s absolutely not true. It is there, but with a different name and a different location.
On Pentium 4 there are 128 internal registers, on Intel’s 6th generation processors (like Pentium II and Pentium III) there were only 40 internal registers. These registers are in the Register Renaming Unit (a.k.a. RAT, Register Alias Table, shown as “Rename/Alloc” in Figure 1).
Pentium 4 has five execution units working in parallel and two units for loading and storing data on RAM memory.

Of course this is just a summary for those who already has some knowledge on the architecture from other processors. If all this look like Greek to you, don’t worry. We will explain everything you need to know about Pentium 4 architecture in an easy to follow language in the next pages.

[nextpage title=”Pentium 4 Pipeline”]

Pipeline is a list of all stages a given instruction must go through in order to be fully executed. On 6th generation Intel processors, like Pentium III, their pipeline had 11 stages. Pentium 4 has 20 stages! So, on a Pentium 4 processor a given instruction takes much longer to be executed then on a Pentium III, for instance! If you take the new 90 nm Pentium 4 generation processors, codenamed “Prescott”, the case is even worse because they use a 31-stage pipeline! Holy cow!

This was done in order to increase the processor clock rate. By having more stages each individual stage can be constructed using fewer transistors. With fewer transistors is easier to achieve higher clock rates. In fact, Pentium 4 is only faster than Pentium III because it works at a higher clock rate. Under the same clock rate, a Pentium III CPU would be faster than a Pentium 4 because of the size of the pipeline.

Because of that, Intel has already announced that their 8th generation processors will use Pentium M architecture, which is based on Intel’s 6th generation architecture (Pentium III architecture) and not on Netburst (Pentium 4) architecture. This arquitecture, called Core, can be studied on our Inside Core Microarchitecture tutorial.

In Figure 2, you can see Pentium 4 20-stage pipeline. So far Intel didn’t disclosure Prescott’s 31-stage pipeline, so we can’t talk about it.

Figure 2: Pentium 4 pipeline.

Here is a basic explanation of each stage, which explains how a given instruction is processed by Pentium 4 processors. If you think this is too complex for you, don’t worry. This is just a summary of what we will be explaining in the next pages.

TC Nxt IP: Trace cache next instruction pointer. This stage looks at branch target buffer (BTB) for the next microinstruction to be executed. This step takes two stages.
TC Fetch: Trace cache fetch. Loads, from the trace cache, this microinstruction. This step takes two stages.
Drive: Sends the microinstruction to be processed to the resource allocator and register renaming circuit.
Alloc: Allocate. Checks which CPU resources will be needed by the microinstruction – for example, the memory load and store buffers.
Rename: If the program uses one of the eight standard x86 registers it will be renamed into one of the 128 internal registers present on Pentium 4. This step takes two stages.
Que: Queue. The microinstructions are put in queues according to their types (for example, integer or floating point). They are held in the queue until there is an open slot of the same type in the scheduler.
Sch: Schedule. Microinstructions are scheduled to be executed according to its type (integer, floating point, etc). Before arriving to this stage, all instructions are in order, i.e., on the same order they appear on the program. At this stage, the scheduler re-orders the instructions in order to keep all execution units full. For example, if there is one floating point unit going to be available, the scheduler will look for a floating point instruction to send it to this unit, even if the next instruction on the program is an integer one. The scheduler is the heart of the out-of-order engine of Intel 7th generation processors. This step takes three stages.
Disp: Dispatch. Sends the microinstructions to their corresponding execution engines. This step takes two stages.
RF: Register file. The internal registers, stored in the instructions pool, are read. This step takes two stages.
Ex: Execute. Microinstructions are executed.
Flgs: Flags. The microprocessor flags are updated.
Br Ck: Branch check. Checks if the branch taken by the program is the same predicted by the branch prediction circuit.
Drive: Sends the results of this check to the branch target buffer (BTB) present on the processor’s entrance.

[nextpage title=”Memory Cache and Fetch Unit”]

Pentium 4’s L2 memory cache can be of 256 KB, 512 KB, 1 MB or 2 MB, depending on the model. L1 data cache is of 8 KB or 16 KB (on 90 nm models).

As we explained before, the L1 instruction cache was moved from before the fetch unit to after the decode unit using a new name, “trace cache”. So, instead of storing program instructions to be loaded by the fetch unit, the trace cache stores microinstructions already decoded by the decode unit. The trace cache can store up to 12 K microinstructions and since Pentium 4 microinstructions are 100-bit wide, the trace cache is of 150 KB (12,288 x 100 / 8).

The idea behind this architecture is really interesting. In the case of a loop on the program (a loop is a part of a program that needs to be repeated several times), the instructions to be executed will be already decoded, because they are stored already decoded on the trace cache. On other processors, the instructions need to be loaded from L1 instruction cache and decoded again, even if they were decoded a few moments before.

The trace cache also has its own BTB (Branch Target Buffer) of 512 entries. BTB is a small memory that lists all identified branches on the program.

As for the fetch unit, its BTB was increased to 4,096 entries. On Intel 6th generation processors, like Pentium III, this buffer was of 512 entries and on Intel 5th generation processors, like the first Pentium processor, this buffer was of 256 entries only.

In Figure 3 you see the block diagram for what we were discussing. TLB means Translation Lookaside Buffer.

Figure 3: Fetch and decode units and trace cache.

[nextpage title=”Decoder”]

Since previous generation (6th generation), Intel processors use a hybrid CISC/RISC architecture. The processor must accept CISC instructions, also known as x86 instructions, since all software available today is written using this kind of instructions. A RISC-only CPU couldn’t be create for the PC because it wouldn’t run software we have available today, like Windows and Office.

So, the solution used by all processors available on the market today from both Intel and AMD is to use a CISC/RISC decoder. Internally the CPU processes RISC-like instructions, but its front-end accepts only CISC x86 instructions.

CISC x86 instructions are referred as “instructions” as the internal RISC instructions are referred as “microinstructions” or “µops”.

These RISC microinstructions, however, cannot be accessed directly, so we couldn’t create software based on these instructions to bypass the decoder. Also, each CPU uses its own RISC instructions, which are not public documented and are incompatible with microinstructions from other CPUs. I.e., Pentium III microinstructions are different from Pentium 4 microinstructions, which are different from Athlon 64 microinstructions.

Depending on the complexity of the x86 instruction, it has to be converted into several RISC microinstructions.

Pentium 4 decoder can decode one x86 instruction per clock cycle, as long as the instruction decodes in up to four microinstructions. If the x86 instruction to be decoded is complex and will be translated in more than four microinstructions, it is routed to a ROM memory (“Microcode ROM” in Figure 3) that has a list of all complex instructions and how they should be translated. This ROM memory is also called MIS (Microcode Instruction Sequencer).

As we said earlier, after being decoded microinstructions are sent to the trace cache, and from there they go to a microinstructions queue. The trace cache can put up to three microinstructions on the queue per clock cycle, however Intel doesn’t tell the depth (size) of this queue.

From there, the instructions go to the Allocator and Register Renamer. The queue can also deliver up to three microinstructions per clock cycle to the allocator.

[nextpage title=”Allocator and Register Renamer”]

What the allocator does:

Reserves one of the 126 reorder buffers (ROB) to the current microinstruction, in order to keep track of the microinstruction completion status. This allows the microinstructions to be executed out-of-order, since the CPU will be able to put them in order again by using this table.
Reserves on of the 128 register files (RF) in order to store there the data resulted from the microinstruction processing.
If the microinstruction is a load or a store, i.e., it will read (load) or write (store) data from/to RAM memory, it will reserve one of the 48 load buffers or one of the 24 store buffers according.
Reserves an entry on the memory or general queue, depending on the kind of microinstruction it is.

After that the microinstruction goes to the register renaming stage. CISC x86 architecture has only eight 32-bit registers (EAX, EBX, ECX, EDX, EBP, ESI, EDI and ESP). This number is simply too low, especially because modern CPUs can execute code out-of-order, what would “kill” the contents of a given register, crashing the program.

So, at this stage, the processor changes the name and contents of the registers used by the program into one of the 128 internal registers available, allowing the instruction to run at the same time of another instruction that uses the exact same standard register, or even out-of-order, i.e., this allows the second instruction to run before the first instruction even if they mess with the same register.

It is interesting to note that Pentium 4 has actually 256 internal registers, 128 registers for integer instructions and 128 registers for floating point and SSE instructions.

Pentium 4 renamer is capable of processing three microinstructions per clock cycle.

From the renamer the microinstructions go to a queue, according to its type: memory queue, for memory-related microinstructions, or Integer/Floating Point Queue, for all other instruction types.

Figure 4: Allocator and Register Renamer.

[nextpage title=”Scheduler”]

The scheduler has four units. It analyses each microinstruction and puts it on a scheduler unit according to its type:

Memory scheduler unit: for memory-related microinstructions. These are the microinstructions that came from the memory microinstruction queue.
Fast scheduler unit: this unit is for simple microinstructions.
Slow / General FP scheduler unit: this unit is for other microinstructions and for complex floating point microinstructions.
Simple FP scheduler unit: this unit is for simple floating point microinstructions.

So, the scheduler sorts the microinstructions according to their type. Then the scheduler can dispatch each microinstruction directly to the correct execution unit to be processed.

The scheduler is the heart of the out-of-order engine on Pentium 4. Until now all microinstructions were delivered on the same order they were decoded. On the scheduler the microinstructions can be dispatched on a totally different order to the execution engines. The goal of the scheduler it to keep all CPU execution engines busy all the time.

The execution units are connected to the scheduler through four dispatch ports, numbered 0 through 3, as you can see in Figure 5.

Figure 5: Scheduler units.

[nextpage title=”Dispatch and Execution Units”]

As we’ve seen, Pentium 4 has four dispatch ports numbered 0 through 3. Each port is connected to one, two or three execution units, as you can see in Figure 6.

Figure 6: Dispatch and execution units.

The units marked as “clock x2” can execute two microinstructions per clock cycle. Ports 0 and 1 can send two microinstructions per clock cycle to these units. So the maximum number of microinstructions that can be dispatched per clock cycle is six:

Two microinstructions on port 0;
Two microinstructions on port 1;
One microinstruction on port 2;
One microinstruction on port 3.

Keep in mind that complex instructions may take several clock cycles to be processed. Let’s take an example of port 1, where the complete floating point unit is located. While this unit is processing a very complex instruction that takes several clock ticks to be executed, port 1 dispatch unit won’t stall: it will keep sending simple instructions to the ALU (Arithmetic and Logic Unit) while the FPU is busy.

So, even thought the maximum dispatch rate is six microinstructions, actually the CPU can have up to seven microinstructions being processed at the same time.

Actually that’s why ports 0 and 1 have more then one execution unit attached. If you pay attention, Intel put on the same port one fast unit together with at least one complex (and slow) unit. So, while the complex unit is busy processing data, the other unit can keep receiving microinstructions from its corresponding dispatch port. As we mentioned before, the idea is to keep all execution units busy all the time.

The two double-speed ALUs can process two microinstructions per clock cycle. The other units need at least one clock cycle to process the microinstructions they receive. So, Pentium 4 architecture is optimized for simple instructions.

As you can see in Figure 6, dispatch ports 2 and 3 are dedicated to memory operations: load (read data from memory) and store (write data to memory), respectively. As for memory operation, it is interesting to note that port 0 is also used during store operations (see Figure 5 and the list of operations in Figure 6). On such operations, port 3 is used to send the memory address, while port 0 is used to send the data to be stored at this address. This data can be generated by either the ALU or the FPU, depending on the kind of data to be stored (integer or floating point/SSE).

In Figure 6 you have a complete list of the kinds of instructions each execution unit deals with. FXCH and LEA (Load Effective Address) are two x86 instructions. Actually Intel’s implementation for FXCH instruction on Pentium 4 caused a great deal of surprise to all experts, because on processors from previous generation (Pentium III) and processors from AMD this instruction can be executed at zero clock cycle, while on Pentium 4 it takes some clock cycles to be executed.

That’s it. If you want to compare Pentium 4 architecture to Athlon 64’s, read our Inside AMD64 Architecture tutorial.

Inside Pentium 4 Architecture

For Performance

Everything you need to know

Reader Interactions

Leave a Reply Cancel reply

Footer

For Performance

Everything you need to know