Inside Pentium M Architecture
By Gabriel Torres on January 4, 2006


Introduction

In this tutorial we will explain you how Pentium M CPU works in an easy to follow language. Since all new CPUs from Intel will use Pentium M’s architecture, studying this architecture is very important to understand the architecture of Core Solo e Core Duo (Yonah) CPUs and also to understand the foundation layer for the forthcoming Core microarchitecture, to be used by Merom, Conroe and Woodcrest CPUs. In this tutorial you will learn exactly how its architecture works so you will be able to compare it more precisely to other processors from Intel and competitors from AMD.

Pentium M is based on Intel’s 6th generation architecture, a.k.a. P6, the same used by Pentium Pro, Pentium II and Pentium III CPUs and not on Pentium 4’s as you may think, being originally targeted to mobile computers. You may think of Pentium M as an enhanced Pentium III. Pay attention to not confuse Pentium M with Pentium 4 M or with Pentium III M, which are different CPUs. Read our tutorial All Pentium M Models to learn about all Pentium M versions released so far.

Several times Pentium M is called Centrino. Actually, Centrino is when you have a laptop with a Pentium M CPU, an Intel 855 or 915 chipset and Intel/PRO wireless LAN. So, if you have a laptop based on Pentium M but without Intel/PRO wireless LAN, for example, it cannot be called Centrino.

In this tutorial we will basically explain how P6 architecture works and what’s new on Pentium M compared to Pentium III. So, in this tutorial you will also learn how Pentium Pro, Pentium II, Pentium III and Celeron (models based on P6 architecture, i.e., slot 1 and socket 370 ones) processors work.

In order to continue, however, you need to have read our tutorial “How a CPU Works”. In this tutorial we explain the basics about how a CPU works. In the present tutorial we are assuming that you have already read it, so if you didn’t, please take a moment to read it before continuing, otherwise you may find yourself a little bit lost. Actually we can consider the present tutorial as a sequel to our How a CPU Works tutorial. It is also a good idea to read our Inside Pentium 4 Architecture tutorial, just for understanding how Pentium M differs from Pentium 4.

Before going further, let’s see the main differences between Pentium M and Pentium III CPUs:

Real Clock

Performance

Transfer Rate

100 MHz

400 MHz

3.2 GB/s

133 MHz

533 MHz

4.2 GB/s

Let’s now talk more in depth about Pentium M’s architecture.

Pentium M Pipeline

Pipeline is a list of all stages a given instruction must go through in order to be fully executed. Intel didn’t disclosure Pentium M’s pipelines, so we will talk about Pentium III’s. Pentium M’s pipeline has probably more stages than Pentium III’s, but analyzing Pentium III’s will give you a good idea on how Pentium M’s architecture work.

Just to remember, Pentium 4 pipeline has 20 stages and the pipeline of newer Pentium 4 CPUs based on “Prescott” core has 31 stages!

In Figure 1, you can see Pentium III’s 11-stage pipeline.

Pentium M
click to enlarge
Figure 1: Pentium III pipeline.

Here is a basic explanation of each stage, which explains how a given instruction is processed by P6-class processors. If you think this is too complex for you, don’t worry. This is just a summary of what we will be explaining in the next pages.

Don’t worry if all this sounded confusing to you. We will explain all this better in the next pages.

Memory Cache and Fetch Unit

As we mentioned, Pentium M’s L2 memory cache can be of 1 MB (130 nm models, a.k.a. “Banias” core) or of 2 MB (90 nm models, a.k.a. “Dothan” core), while it has two L1 memory caches, one of 32 KB for instructions and another of 32 KB for data.

The fetch unit is divided into three stages, as we explained in the previous page. In Figure 2, you can see how Pentium M’s fetch unit works.

Pentium M Fetch Unit
click to enlarge
Figure 2: Fetch unit.

As we mentioned before, the fetch unit loads one line (32 bytes = 256 bits) into its Instruction Streaming Buffer. Then the Instruction Length Decoder identifies the instructions boundaries within 16 bytes (128 bits). Since x86 instructions don’t have a fixed length this stage marks where each instruction starts and ends within the loaded 128 bits. If there is any branch instruction within these 128 bits, its address is stored at the Branch Target Buffer (BTB), so the CPU can later use this information on its branch prediction circuit. The BTB has 512 entries.

Then the Decoder Alignment Stage marks to which instruction decoder unit each instruction must be sent. There are three different instruction decoder units, as we will explain in the next page.

Instruction Decoder and Register Renaming

Since the introduction of P6 architecture with Pentium Pro Intel processors use a hybrid CISC/RISC architecture. The processor must accept CISC instructions, also known as x86 instructions, since all software available today is written using this kind of instructions. A RISC-only CPU couldn’t be create for the PC because it wouldn’t run software we have available today, like Windows and Office.

So, the solution used by all processors available on the market today from both Intel and AMD is to use a CISC/RISC decoder. Internally the CPU processes RISC-like instructions, but its front-end accepts only CISC x86 instructions.

CISC x86 instructions are referred as “instructions” as the internal RISC instructions are referred as “microinstructions”, “micro-ops” or “µops”.

These RISC microinstructions, however, cannot be accessed directly, so we couldn’t create software based on these instructions to bypass the decoder. Also, each CPU uses its own RISC instructions, which are not public documented and are incompatible with microinstructions from other CPUs. I.e., Pentium M microinstructions are different from Pentium 4 microinstructions, which are different from Athlon 64 microinstructions.

Depending on the complexity of the x86 instruction, it has to be converted into several RISC microinstructions.

Pentium M instruction decoder works like shown in Figure 3. As you can see, there are three decoders and a Micro Instruction Sequencer (MIS). Two decoders are optimized for simple instructions, which are the most used ones. This kind of instruction is converted in just one micro-op. One decoder is optimized for complex x86 instructions, which can be converted in up to four micro-ops. If the x86 instruction is too complex, i.e., it converts into more than four micro-ops, it is sent to the Micro Instruction Sequencer, which is a ROM memory containing a list of micro-ops that should replace the given x86 instruction.

Pentium M Instruction Decoder
click to enlarge
Figure 3: Instruction Decoder and Register Renaming.

The instruction decoder can convert up to three x86 instructions per clock cycle, one complex at Decoder 0 and two simple at decoders 1 and 2, feeding the Decoded Instruction Queue with up to six micro-ops per clock cycle, scenario that is reached when Decoder 0 sends four micro-ops and the other two decoders send one micro-op each – or when the MIS is used. Very complex x86 instructions that use the Micro Instruction Sequencer can delay several clock cycles to be decoded, depending on how many micro-ops will be generated from the conversion. Keep in mind that the Decoded Instruction Queue can hold only up to six micro-ops, so if more than six micro-ops are generated by the decoder plus MIS, another clock cycle is needed to send the current micro-ops present in the queue to the Register Allocation Table (RAT), empty the queue and accept the micro-ops that didn’t “fit” before.

Pentium M uses a new concept to the P6 architecture that is called micro-op fusion. On Pentium M the decoder unit fuses two micro-ops into one. They will be de-fused only to be executed, at the execution stage.

On P6 architecture, each microinstruction is 118-bit long. Pentium M instead of working with 118-bit micro-ops works with 236-bit long micro-ops that are in fact two 118-bit micro-ops.

Keep in mind that the micro-ops continue to be 118-bit long; what changed is that they are transported in groups of two.

This idea behind this approach was to save energy and increase performance. It is faster to send one 236-bit micro-op than two 118-bit micro-ops. Also the CPU will consume less power, since less micro-ops will be circulating inside of it.

Fused micro-ops are then sent to the Register Allocation Table (RAT). CISC x86 architecture has only eight 32-bit registers (EAX, EBX, ECX, EDX, EBP, ESI, EDI and ESP). This number is simply too low, especially because modern CPUs can execute code out-of-order, what would “kill” the contents of a given register, crashing the program.

So, at this stage, the processor changes the name and contents of the registers used by the program into one of the 40 internal registers available (each one of them is 80-bit wide, thus accepting both integer and floating-point data), allowing the instruction to run at the same time of another instruction that uses the exact same standard register, or even out-of-order, i.e., this allows the second instruction to run before the first instruction even if they mess with the same register.

Reorder Buffer

So far the x86 instructions and the micro-ops resulted from them are transferred between the CPU stages in the same order they appear on the program being run.

Arriving at the ROB, micro-ops can be loaded and executed out-of-order by the execution units. After being executed, the instructions are sent back to the Reorder Buffer. Then at the Retirement stage, executed micro-ops are pulled out of the Reorder Buffer at the same order they entered it, i.e., they are removed in order. In Figure 4 you can have a better idea on how this works.

Pentium M Reorder Buffer
click to enlarge
Figure 4: How the Reorder Buffer works.

In Figure 4 we simplified the Reservation Station and the execution units for a better understanding of the Reorder Buffer. We will talk about these two stages in depth in the next page.

Reservation Station and Execution Units

As we mentioned before, Pentium M uses fused micro-ops (i.e., carries two micro-ops together) from the Decode Unit up to the dispatch ports located on the Reservation Station. The Reservation Station dispatches each micro-op individually (defused).

Pentium M has five dispatch ports numbered 0 through 4 located on its Reservation Station. Each port is connected to one or more execution units, as you can see in Figure 5.

Pentium M Execution Units
click to enlarge
Figure 5: Reservation Station and execution units.

Here is a small explanation of each execution unit found on this CPU:

Keep in mind that complex instructions may take several clock cycles to be processed. Let’s take an example of port 0, where the floating point unit (FPU) is located. While this unit is processing a very complex instruction that takes several clock ticks to be executed, port 0 won’t stall: it will keep sending simple instructions to the IEU while the FPU is busy.

So, even thought the maximum dispatch rate is five microinstructions per clock cycle, actually the CPU can have up to twelve microinstructions being processed at the same time.

As we mentioned, on instructions that ask the CPU to read a data stored at a given RAM memory address, the Store Address Unit and the Store Data Unit are used at the same time, one for calculating the address and the other for reading the data.
 
Actually that’s why ports 0 and 1 have more then one execution unit attached. If you pay attention, Intel put on the same port one fast unit together with at least one complex (and slow) unit. So, while the complex unit is busy processing data, the other unit can keep receiving microinstructions from its corresponding dispatch port. As we mentioned before, the idea is to keep all execution units busy all the time.

As we explained, after each micro-op is executed, it returns to the Reorder Buffer, where its flag is set to “executed”. Then at the Retirement Stage the micro-ops that have their “executed” flag on are removed from the Reorder Buffer on its original order (i.e., the order they were decoded) and then the x86 registers are updated (the inverse step of register renaming stage). Up to three micro-ops can be removed from the Reorder Buffer per clock cycle. After this the instruction was fully executed.

Enhanced SpeedStep Technology

SpeedStep Technology was created to increase battery life and was first introduced with Pentium III M processor. This first version of SpeedStep Technology allowed the CPU to switch between two clock frequencies on the fly: Low Frequency Mode (LFM), which maximized battery life, and High Frequency Mode (HFM), which allowed you to run your CPU at its maximum speed. The CPU had two clock multiplier ratios and what it did was to change the ratio it was using. The LFM ratio was factory-lock and you couldn’t change it.

Pentium M introduced Enhanced SpeedStep Technology, which goes beyond that, by having several other clock and voltage configurations between LFM (which is fixed at 600 MHz) and HFM (which is the CPU full clock).

Just to give you a real example, the clock/voltage configuration table for a 1.6 GHz Pentium M based on 130 nm technology is the following:

Voltage

Clock

1.484 V

1.6 GHz

1.42 V

1.4 GHz

1.276 V

1.2 GHz

1.164 V

1 GHz

1.036 V

800 MHz

0.956 V 

600 MHz

Each Pentium M model has its own voltage/clock table. It is very interesting to notice that it is not only about lowering the clock rate when you don’t need so much processing power from your laptop, but also about lowering its voltage, which helps a lot to lower battery consumption.

Enhanced SpeedStep Technology works by monitoring specific MSRs (Model Specific Registers) from the CPU called Performance Counters. With this information, the CPU can lower or raise its clock/voltage depending on CPU usage. Simply put, if you increase CPU usage, it will increase its voltage/clock, if you lower the CPU usage, it will lower its voltage/clock.

Enhanced SpeedStep was just one of the several enhancements done on Pentium M microarchitecture in order to increase battery life.

A good example was done on the execution units. On other processors, the same power line feeds all execution units. So it is not possible to turn off an idle execution unit on Pentium 4, for example. On Pentium M execution units have different power lines, making the CPU capable of turning off idle execution units. For example, Pentium M detects in advance if a given instruction is an integer one (“regular instruction”), disabling the units and datapaths not needed to process that instruction, if they are idle, of course.

Originally at http://www.hardwaresecrets.com/article/Inside-Pentium-M-Architecture/270


© 2004-14, Hardware Secrets, LLC. All Rights Reserved.

Total or partial reproduction of the contents of this site, as well as that of the texts available for downloading, be this in the electronic media, in print, or any other form of distribution, is expressly forbidden. Those who do not comply with these copyright laws will be indicted and punished according to the International Copyrights Law.

We do not take responsibility for material damage of any kind caused by the use of information contained in Hardware Secrets.