As mentioned, the Instruction Control Unit is the Reorder Buffer of AMD64 processors. Here the macro-ops can be picked and sent to the schedulers out-of-order, i.e. not in the same order the instructions appeared on the program that is being executed. For example, if the program has something like this:
Integer
Integer
Integer
Integer
Integer
FP
Integer
FP
AMD64 architecture has three integer execution engines and three floating-point execution engines. If it hadn’t an out-of-order execution engine, its floating-point engines would be idle when running this program, since the forth instruction is also an integer instruction and can’t be executed at the same time because all three execution engines are already being used. Since it implements out-of-order execution, the fifth instruction, the first FP instruction, can be sent to execution together with the first one, increasing the CPU performance. In fact, since it has three FPUs, both FP instructions available on this program could be dispatched at the same time. The goal of the scheduler it to keep all CPU execution engines busy all the time.
The reorder buffer available on AMD64 architecture has 72 entries and what is quite interesting is that each integer execution engine has its own scheduler with its own buffer (8 entries each). The FP execution units have only one 36-entry scheduler. So AMD64 has a total of four schedulers, the same amount available on Pentium 4.
The reorder buffer is also in charge of register renaming. CISC x86 architecture has only eight 32-bit registers (EAX, EBX, ECX, EDX, EBP, ESI, EDI and ESP). This number is simply too low, especially because modern CPUs can execute code out-of-order, what would “kill” the contents of a given register, crashing the program.
So, at this stage, the processor changes the name and contents of the registers used by the program into one of the 96 internal registers available, allowing the instruction to run at the same time of another instruction that uses the exact same standard register, or even out-of-order, i.e. this allows the second instruction to run before the first instruction even if they mess with the same register.
AMD64 architecture has 96 internal registers, while Pentium 4 has 128. Intel’s 6th generation processors (like Pentium II and Pentium III) there were only 40 internal registers. It is interesting to note how AMD did a trick on AMD64 architecture to achieve those 96 registers. They simply created a result field on each one of the 72 reorder buffer entries for storing the results of each instruction (this isn’t available on Pentium 4; Pentium 4 needs to allocate an internal register for storing the results each time an instruction is executed). Plus its register file (or IFFRF, Integer Future File and Register File, as AMD calls it) has 40 entries (since 16 of them stores the “correct” value for each x86 register, they cannot be used). So while the correct answer for “how many internal registers does AMD architecture have?” is 40, the effective number is 96 due to this architectural difference.

click to enlarge
Figure 14: AMD64 reorder buffer and schedulers.