Generally speaking, direct assembly coding can outperform the use of intrinsics. Nevertheless, for fairness of comparison with algorithms coded in C,we use the provided intrinsics. Our experimental results use single precision 32-bit float- ing point values as the element data type, unless otherwise mentioned. Since SSE and SSE2 registers are 128 bits, this choice means that S = 4. Our Pentium4 machine runs at GHz, has 1GB of Rambus RDRAM, and uses the RedHat Linux operating system. We use Intel's C++ compiler with the highest optimization level. GNU's g++ compiler gives similar results for algorithms without SIMD instruc- tions, but g++ does not have intrinsics for Pentium SIMD instructions. In this.