Radeon HD 2900 XT runs at 740 MHz and access its 512 MB GDDR3 memory at 825 MHz (1.65 GHz DDR), using a new 512-bit memory interface, with boosts the memory maximum theoretical transfer rate to 105.6 GB/s – Radeon X1950 XTX has a memory maximum transfer rate of 64 GB/s and GeForce 8800 GTX, of 86.4 GB/s, but the new GeForce 8800 Ultra reaches 103.6 GB/s.
Its unified shader architecture has 320 shader units or ”streaming processors“ – GeForce 8800 GTX has 128.
In Figure 1 you can have an overall look at the architecture used by Radeon HD 2900 XT.

click to enlarge
Figure 1: Radeon HD 2900 XT architecture.
In Figure 2 you can have a more in-depth look at how it works. As you can see, it has a dispatch unit that can send up to eight shader instructions to the streaming processors and up to two vertex or texture instructions per clock cycle. And as we will explain below, each one of these shader instructions can actually represent up to six instructions.

click to enlarge
Figure 2: Inside Radeon HD 2900 XT architecture.
The streaming processors are divided into four main groups (called ”SIMD arrays“) with 80 processors each, each group connected to two ports of the dispatch unit. These groups are subdivided into 16 units, each unit containing five streaming processors and one branch processing unit. The architecture of each one of these units can be seen in Figure 3.

click to enlarge
Figure 3: Architecture of each streaming processor unit, containing five processors each.
These units are superscalar, meaning that each streaming processor can be processing several instructions in parallel at the same time. All five processors deal with multiply-add instructions, which are the most common instruction type, while only one (the first one in Figure 3) can also deal with transcendental instructions as well, i.e., log and trigonometric instructions like SIN, COS, LOG, EXP, etc. It is very interesting to note that each streaming processor is, in fact, a small 32-bit floating-point unit.
Another very interesting thing is that each instruction sent to each unit packs six instructions (five math instructions plus one flow control instruction) into a single instruction. So instead of having to send up to six separated instructions to each unit, the dispatch unit can fill all six execution units with just one big instruction. This concept is called VLIW (Very Long Instruction Word).