Inside Intel Nehalem Microarchitecture
By Gabriel Torres on August 26, 2008 Page 7 of 7

Other Features

Now that we covered all main features brought by the new Nehalem core, we are going to explain a little bit more about two important features, HyperThreading and the optimization done to deal with unaligned SSE instructions.

HyperThreading technology allows each CPU core to be recognized as two CPUs. Thus if you have a Core i7 with four cores, the operating system will recognize it as having eight cores. This technology is based on the fact that when the CPU core is running there are certain circuits inside that are idle and thus can be used. Originally released for the Pentium 4 CPU this is the first time this technology is available on a 6th generation Intel CPU. This technology is also called SMT or Simultaneous Multi-Threading (SMT). This technology does not provide the same performance gain as if “real” CPU cores were used instead (i.e. a CPU with 8 cores is faster than a CPU with 4 cores and HT technology, provided that they both work under the same clock rate and are based on the same architecture), however you are gaining these extra “CPU cores” for free.

There are two kinds of SSE instructions that access memory, aligned and unaligned (also called misaligned). Aligned instructions required the requested data to be inside 16-byte (128-bits) address boundaries, while unaligned instructions don’t. See Figure 9 for an illustration.

Aligned vs unaligned (misaligned) instructions
click to enlarge
Figure 9: Aligned vs. unaligned instructions.

O.k. we know that this sounds cryptic for you, so let’s translate into English.

Imagine a system with dual-channel memory. The memory controller will access the memory 128 bits at a time. So the memory will be divided into 128-bit (16 bytes) blocks. So in theory the address that you request must start at the beginning of each block, so you can make a 128-bit read (or write) and get what you want at just one request. This is the aligned request shown on top of Figure 9.

But suppose that you issue a command to read a data from the memory but instead of using the first address inside the block you ask for the address in the middle of the block. Since you are requesting a 128-bit data, what will happen is that half of the data will be on the first block and the other half of the data will be on the next block – this is shown on the bottom of Figure 9. Since the data you requested will be split into two different blocks the memory controller will have to read two memory blocks, not just one as it happened on the previous example. On the first read you will get back half of the data you want and on the second read you will get the remaining of the data.

Although aligned requests are more efficient they are more difficult for programmers because they need to know the memory organization. Because of that most programmers end up using only unaligned instructions.

Previous Intel CPUs were optimized for aligned instructions and unaligned ones were slower and were translated into multiple micro-ops – in other words, unaligned instructions were easier for the programmer but ran slower. Nehalem-based CPUs are optimized for unaligned instructions, achieving the same speed as aligned instructions. The slide on Figure 10 summarizes this.

Nehalem Core i7
click to enlarge
Figure 10: Nehalem is optimized for unaligned SSE instructions.


Originally at http://www.hardwaresecrets.com/article/535/7Pages (7): 1 2 3 4 5 6 7 »

© 2004-8, Hardware Secrets, LLC. All Rights Reserved.

Total or partial reproduction of the contents of this site, as well as that of the texts available for downloading, be this in the electronic media, in print, or any other form of distribution, is expressly forbidden. Those who do not comply with these copyright laws will be indicted and punished according to the International Copyrights Law.

We do not take responsibility for material damage of any kind caused by the use of information contained in Hardware Secrets.