Pentium M Pipeline
Pipeline is a list of all stages a given instruction must go through in order to be fully executed. Intel didn’t disclosure Pentium M’s pipelines, so we will talk about Pentium III’s. Pentium M’s pipeline has probably more stages than Pentium III’s, but analyzing Pentium III’s will give you a good idea on how Pentium M’s architecture work.
Just to remember, Pentium 4 pipeline has 20 stages and the pipeline of newer Pentium 4 CPUs based on “Prescott” core has 31 stages!
In Figure 1, you can see Pentium III’s 11-stage pipeline.
Here is a basic explanation of each stage, which explains how a given instruction is processed by P6-class processors. If you think this is too complex for you, don’t worry. This is just a summary of what we will be explaining in the next pages.
- IFU1: Loads one line (32 bytes, i.e., 256 bits) from L1 instruction cache and stores it in the Instruction Streaming Buffer.
- IFU2: Identifies the instructions boundaries within 16 bytes (128 bits). Since x86 instructions don’t have a fixed length this stage marks where each instruction starts and ends within the loaded 16 bytes. If there is any branch instruction within these 16 bytes, its address is stored at the Branch Target Buffer (BTB), so the CPU can later use this information on its branch prediction circuit.
- IFU3: Marks to which instruction decoder unit each instruction must be sent. There are three different instruction decoder units, as we will explain later.
- DEC1: Decodes the x86 instruction into a RISC microinstruction (a.k.a. micro-op). Since the CPU has three instructions decode units, it is possible to decode up to three instructions at the same time.
- DEC2: Sends the micro-ops to the Decoded Instruction Queue, which is capable to store up to six micro-ops. If the instruction was converted in more than six micro-ops, this stage must be repeated in order to catch the missing micro-ops.
- RAT: Since P6 microarchitecture implements out-of-order execution (OOO), the value of a given register could be altered by an instruction executed before its “correct” (i.e., original) place in the program flow, corrupting the data needed by another instruction. So, to solve this kind of conflict, at this stage the original register used by the instruction is changed to one of the 40 internal registers that P6 microarchitecture has.
- ROB: At this stage three micro-ops are loaded into the Reorder Buffer (ROB). If all data necessary for the execution of a micro-op are available and if there is an open slot at the Reservation Station micro-op queue, then the micro-op is moved to this queue.
- DIS: If the micro-op wasn’t sent to the Reservation Station micro-op queue, this is done at this stage. The micro-op is sent to the proper execution unit.
- EX: The micro-op is executed at the proper execution unit. Usually each micro-op needs only one clock cycle to be executed.
- RET1: Checks at the Reorder Buffer if there is any micro-op that can be flagged as “executed”.
- RET2: When all micro-ops related to the previous x86 instruction were already removed from the Reorder Buffer and all micro-ops related to the current x86 instruction were executed, these micro-ops are removed from the Reorder Buffer and the x86 registers are updated (the inverse process done at RAT stage). The retirement process must be done in order. Up to three micro-ops can be removed from the Reorder Buffer per clock cycle.
Don’t worry if all this sounded confusing to you. We will explain all this better in the next pages.