[Test] Simple x87 vs SSE2 Performance Test With Matrix Multiplication

2010/07/11 JeGX

SSE/SSE2 instruction set

In this news, we learnt that the current version of the PhysX engine was compiled with the x87 instruction set and is not using modern sets like SSE2. I found here a simple ready-to-use matrix multiplication code sample that will allow us to see the speed difference between x87 and SSE/SSE2 sets.

I compiled the following code sample with visual c++ 2005:

#include 
#include 
#pragma comment(lib, "winmm.lib")

#define DIM 4

void mul_mat(double **mat1, double **mat2, int sz)
{
  int ii, jj, kk;
  for (ii = 0; ii < sz; ii++)
  {
    double temp[DIM];
    for (jj = 0; jj < sz; jj++)
    {
      temp[jj] = 0.;
      for (kk = 0; kk < sz; kk++)
	temp[jj] += mat1[ii][kk]*mat2[kk][jj];
    }
    for (jj = 0; jj < sz; jj++)
      mat1[ii][jj] = temp[jj];
  }
}

int main()
{
  const int sz = 4;
  const int num = 1000000;
  int ii, jj;

  double **mat1 = new double*[sz];
  for (ii = 0; ii < sz; ii++)
    mat1[ii] = new double[sz];
  double **mat2 = new double*[sz];
  for (ii = 0; ii < sz; ii++)
    mat2[ii] = new double[sz];

  for (ii = 0; ii < sz; ii++)
    for (jj = 0; jj < sz; jj++)
    {
      mat1[ii][jj] = double(ii)/sz*double(jj)/sz;
      mat2[ii][jj] = double(ii)/sz*double(jj)/sz;
    }

  printf("\nStarting matrix multiplication loop...");
  DWORD start = timeGetTime();
    
  //
  // Main matrix loop:
  //
  for (ii = 0; ii < num; ii++)
    mul_mat(mat1, mat2, sz);
  
  DWORD end = timeGetTime();
  printf("\nElapsed time: %d ms\n", end-start);
	
  return 0;
}

As the author of this code says it, it’s a very artificial scenario, but it shows well the difference between the differents math instruction sets.

I limited the main loop counter to 1 million of matrix multiplications which is enough.

In the vs2005 project properties, I set Optimization to Minimize Size (/O1) and changed the Enchanced Instruction Set to test the different sets.

You can download the test pack here:
[download#141#image]

There are three exe in the pack: x87_test.exe, sse_test.exe and sse2_test.exe. I added a batch file for each exe in order to have a pause at the end.

Test 1 – Instruction Set: No Set or x87
– Elapsed time: 2373 ms

Test 2 – Instruction Set: SSE
– Elapsed time: 2368 ms

Test 3 – Instruction Set: SSE2
– Elapsed time: 1112 ms

No doubt, SSE2 is the way to get fast math. A recompilation of the PhysX engine with SSE2 instruction set would be very nice. But as I said in this news, a simple recompilation might lead to some incorrect calculation results, so NVIDIA will have to test such a recompilation before, which may take some time…

Here is the assembly output of the core of matrix multiplication of the different instruction sets:

temp[jj] += mat1[ii][kk]*mat2[kk][jj];

Set: No Set

00431A89  mov         eax,dword ptr [ii] 
00431A8C  mov         ecx,dword ptr [mat1] 
00431A8F  mov         edx,dword ptr [ecx+eax*4] 
00431A92  mov         eax,dword ptr [kk] 
00431A95  mov         ecx,dword ptr [mat2] 
00431A98  mov         eax,dword ptr [ecx+eax*4] 
00431A9B  mov         ecx,dword ptr [kk] 
00431A9E  mov         esi,dword ptr [jj] 
00431AA1  fld         qword ptr [edx+ecx*8] 
00431AA4  fmul        qword ptr [eax+esi*8] 
00431AA7  mov         edx,dword ptr [jj] 
00431AAA  fadd        qword ptr temp[edx*8] 
00431AAE  mov         eax,dword ptr [jj] 
00431AB1  fstp        qword ptr temp[eax*8] 
00431AB5  jmp         mul_mat+68h (431A78h)

Set: SSE Set

00431A89  mov         eax,dword ptr [ii] 
00431A8C  mov         ecx,dword ptr [mat1] 
00431A8F  mov         edx,dword ptr [ecx+eax*4] 
00431A92  mov         eax,dword ptr [kk] 
00431A95  mov         ecx,dword ptr [mat2] 
00431A98  mov         eax,dword ptr [ecx+eax*4] 
00431A9B  mov         ecx,dword ptr [kk] 
00431A9E  mov         esi,dword ptr [jj] 
00431AA1  fld         qword ptr [edx+ecx*8] 
00431AA4  fmul        qword ptr [eax+esi*8] 
00431AA7  mov         edx,dword ptr [jj] 
00431AAA  fadd        qword ptr temp[edx*8] 
00431AAE  mov         eax,dword ptr [jj] 
00431AB1  fstp        qword ptr temp[eax*8] 
00431AB5  jmp         mul_mat+68h (431A78h)

Set: SSE2 Set

00431A91  mov         eax,dword ptr [ii] 
00431A94  mov         ecx,dword ptr [mat1] 
00431A97  mov         edx,dword ptr [ecx+eax*4] 
00431A9A  mov         eax,dword ptr [kk] 
00431A9D  mov         ecx,dword ptr [mat2] 
00431AA0  mov         eax,dword ptr [ecx+eax*4] 
00431AA3  mov         ecx,dword ptr [kk] 
00431AA6  mov         esi,dword ptr [jj] 
00431AA9  movsd       xmm0,mmword ptr [edx+ecx*8] 
00431AAE  mulsd       xmm0,mmword ptr [eax+esi*8] 
00431AB3  mov         edx,dword ptr [jj] 
00431AB6  addsd       xmm0,mmword ptr temp[edx*8] 
00431ABC  mov         eax,dword ptr [jj] 
00431ABF  movsd       mmword ptr temp[eax*8],xmm0 
00431AC5  jmp         mul_mat+70h (431A80h)

No Set and SSE set have the same code (fp87 instructions????) but SSE2 code really uses another codepath with the use of SSE registers such as xmm0.

22 thoughts on “[Test] Simple x87 vs SSE2 Performance Test With Matrix Multiplication”

Michael 2010/07/11 at 13:14

In my experience auto-vectorization (compiler generates SSE code automatically) doesn’t really work that well. This means that even if you tell the compiler to generate SSE(2/3/4) you won’t always get SSE instructions. SSE code needs to be coded by hand (preferably with intrinsics) and often requires some modification to adapt the code to SIMD instructions.
Only in very simple scenarios can the compiler do this automatically.
Psolord 2010/07/11 at 14:27

My results Core i7 stock

X87:2090

SSE:2050

SSE2:1170
Psolord 2010/07/11 at 14:27

I meant core i7 860 stock 😛
fmoreira 2010/07/11 at 16:04

some months ago I also did some tests.
I tested for matrix sum, subtraction and multiplication by a scalar, for both 64 a 32bits targets with compiler optimizations and hand-tuned assembly.

turned out that the fastest results showed up with compiler optimizations in 64bits mode.
here are the sepcs of the machine that I used:
Processor: Intel Core 2 Duo T7700 @ 2.4GHz
Memory: 4GB DDR2 667MHz
OS: Windows 7 x64 – Professional

the compiler was the one from visual studio 2010 (beta 2 in that time)
and here is the source code with the screen shots of the tests:
http://paginas.fe.up.pt/~ei06120/rs/lib/exe/fetch.php?media=downloads:amun_engine-matrix_profiling.zip
Rares 2010/07/11 at 16:07

PhII @3.63 GHz:

x87 : 546ms
SSE : 546ms
SSE2: 811ms

WTF ? Pls explain…
Rares 2010/07/11 at 16:10

PhII X4 @ 3.63 GHz ;). Sorry…
Mars_999 2010/07/11 at 17:45

Why doesn’t anyone use SSE3/4, vs. SSE2? I mean SSE2 is old as is SSE and x87. Or is it because SSE2 matrix math is just as fast as SSE3/4’s?

Thanks!
JeGX Post Author2010/07/11 at 20:30

@Mars_999: good question! For my part, SSE and SSE2 are the only instructions sets available with vs2005. And I just looked at vs2010 and only SSE/SSE2 are available too. Actually, I think the Intel C++ Compiler is required to get SSE3/4…
Pingback: Anonymous
Mars_999 2010/07/11 at 23:47

So we are forced to do hand coded assembly for SSE3/4 in MSVC++? Does GCC support it?
trinitrotoluene 2010/07/12 at 00:02

quote: No Set and SSE set have the same code (fp87 instructions????) but SSE2 code really uses another code path with the use of SSE registers such as xmm0.

This is normal, SSE instructions set only support 32 bits floating point numbers (float) not 64 bits (double) like SSE2. So the compiler have no choice to use the fp87 instructions set in this case.
jK 2010/07/12 at 01:31

@Mars_999: SSE3 & SSE4 are _extensions_ of the older SSE implementations, so they don’t replace SSE2 functions with faster ones, instead they add new ones. So it depends on your code if any SSE is feasible for you. And Vector/Matrix math is mostly covered with SSE & SSE2.
fmoreira 2010/07/12 at 12:25

@Mars_999: you are not forced to use assembly for SSE3/4.x in MSVC++ you can use the intrinsics functions.
check the header xmm.h
Korvin77 2010/07/12 at 12:38

as I noticed that MSVC 2010 (and 2008) used SSE2 even with No Set. I used a very simple code, like cycle with x=sin(i*M_PI/180), and I always had SSE.
Rosario Leonardi 2010/07/12 at 18:07

I changed a little bit the code (removed the windows part) and tested on a puny atom.
Instead of using the timeGetTime function I have used the linux time command, the result is the time that the program spend in user space. This will count also the initialization but is on a 0.00001% of the whole time. An acceptable error.

Test1: No SSE No optimization
gcc matrixTest.cpp -o matrixNoSse -lstdc++
result -> 8.757s

Test2: No SSE with optimization
gcc matrixTest.cpp -O3 -o matrixNoSse -lstdc++
result -> 3.004s

Test3: SSE2 No optimization
gcc matrixTest.cpp -msse2 -mfpmath=sse -o matrixSse -lstdc++
result -> 5.875s

gcc matrixTest.cpp -O3 -msse2 -mfpmath=sse -o matrixSse -lstdc++
result -> 2.996s

Similar results if I compile enabling SSE3 and SSE4

A big improvement without optimization, but with optimization with or without SSE is basically the same time.
Some explanation? Probably the atom exec the sse instruction quite slowly.
By the way, watching the assembly the SSE version is a lot shorter. 🙂
float 2010/07/14 at 08:58

AMD Athlon X2 2.2Ghz
x87: 964ms
SSE: 964ms
SSE2: 1312ms

I can’t see boost here
rocco 2010/07/15 at 09:58

e7400 stock

x87 2511

sse 2545

sse2 1170
Wault 2010/07/20 at 01:39

With a Phenom II X4 p40BE @ 3GHz :
X87 : 668 ms
SSE : 667 ms
SSE 2: 993 ms

Like Rares, I would like to understand !?
Are the Phenoms absolutely bad with SSE 2 (and that would explain why they are not as good as the Intel CPU on games or video compression for exemples) ?
Pingback: [GPU Tool] FluidMark 1.2.2 Updated With PhysX SDK 2.8.4.4 - 3D Tech News, Pixel Hacking, Data Visualization and 3D Programming - Geeks3D.com
sanyix 2010/10/20 at 18:23

With Athlon (1) X2 @ 2,3ghz (brisbane g2)
x87: 978
SSE: 996
SSE2: 1287

LOL?
sanyix 2011/11/30 at 15:15

With athlon II X4 3,1ghz
x87: 630
SSE: 650
SSE2:960

That not looks too good… i think there is a problem in the code, because if SSE2 is really slower than x87 it would be completely unnecessary, and shouldn’t be even implemented to amd cpus. But it’s implemented and used nearly in everything, and intel disabled SSE2 usage in their compiler when the compiled program runs on amd, and obviously if it would be really slower they wouldn’t disable it, because their goal was to make programs run slower on amds, and not faster.
Norragak 2012/04/03 at 20:02

LOL, didn’t you AMD users noticed that your ‘slower’ CPUs, even an Athlon X2 @2.2Ghz had been 2x faster than the ‘faster’ Intel CPUs such like a Core i7 860 @stock when doing x87?

My Core 2 Duo @2.1GHz results:
x87: 3469 ms
SSE: 3416 ms
SSE2: 1569 ms

Comments are closed.