Since the introduction of P6 architecture with Pentium Pro Intel processors use a hybrid CISC/RISC architecture. The processor must accept CISC instructions, also known as x86 instructions, since all software available today is written using this kind of instructions. A RISC-only CPU couldn’t be create for the PC because it wouldn’t run software we have available today, like Windows and Office.
So, the solution used by all processors available on the market today from both Intel and AMD is to use a CISC/RISC decoder. Internally the CPU processes RISC-like instructions, but its front-end accepts only CISC x86 instructions.
CISC x86 instructions are referred as ”instructions“ as the internal RISC instructions are referred as ”microinstructions“, ”micro-ops“ or ”µops“.
These RISC microinstructions, however, cannot be accessed directly, so we couldn’t create software based on these instructions to bypass the decoder. Also, each CPU uses its own RISC instructions, which are not public documented and are incompatible with microinstructions from other CPUs. I.e., Pentium M microinstructions are different from Pentium 4 microinstructions, which are different from Athlon 64 microinstructions.
Depending on the complexity of the x86 instruction, it has to be converted into several RISC microinstructions.
Pentium M instruction decoder works like shown in Figure 3. As you can see, there are three decoders and a Micro Instruction Sequencer (MIS). Two decoders are optimized for simple instructions, which are the most used ones. This kind of instruction is converted in just one micro-op. One decoder is optimized for complex x86 instructions, which can be converted in up to four micro-ops. If the x86 instruction is too complex, i.e., it converts into more than four micro-ops, it is sent to the Micro Instruction Sequencer, which is a ROM memory containing a list of micro-ops that should replace the given x86 instruction.
click to enlarge
Figure 3: Instruction Decoder and Register Renaming.
The instruction decoder can convert up to three x86 instructions per clock cycle, one complex at Decoder 0 and two simple at decoders 1 and 2, feeding the Decoded Instruction Queue with up to six micro-ops per clock cycle, scenario that is reached when Decoder 0 sends four micro-ops and the other two decoders send one micro-op each – or when the MIS is used. Very complex x86 instructions that use the Micro Instruction Sequencer can delay several clock cycles to be decoded, depending on how many micro-ops will be generated from the conversion. Keep in mind that the Decoded Instruction Queue can hold only up to six micro-ops, so if more than six micro-ops are generated by the decoder plus MIS, another clock cycle is needed to send the current micro-ops present in the queue to the Register Allocation Table (RAT), empty the queue and accept the micro-ops that didn’t ”fit“ before.
Pentium M uses a new concept to the P6 architecture that is called micro-op fusion. On Pentium M the decoder unit fuses two micro-ops into one. They will be de-fused only to be executed, at the execution stage.
On P6 architecture, each microinstruction is 118-bit long. Pentium M instead of working with 118-bit micro-ops works with 236-bit long micro-ops that are in fact two 118-bit micro-ops.
Keep in mind that the micro-ops continue to be 118-bit long; what changed is that they are transported in groups of two.
This idea behind this approach was to save energy and increase performance. It is faster to send one 236-bit micro-op than two 118-bit micro-ops. Also the CPU will consume less power, since less micro-ops will be circulating inside of it.
Fused micro-ops are then sent to the Register Allocation Table (RAT). CISC x86 architecture has only eight 32-bit registers (EAX, EBX, ECX, EDX, EBP, ESI, EDI and ESP). This number is simply too low, especially because modern CPUs can execute code out-of-order, what would ”kill“ the contents of a given register, crashing the program.
So, at this stage, the processor changes the name and contents of the registers used by the program into one of the 40 internal registers available (each one of them is 80-bit wide, thus accepting both integer and floating-point data), allowing the instruction to run at the same time of another instruction that uses the exact same standard register, or even out-of-order, i.e., this allows the second instruction to run before the first instruction even if they mess with the same register.