Performance – GCC & auto-vectorization bug when using unsigned int loop counters

Why the following code is faster when using int type for loop counters instead of unsigned int type ???

The code

The following code performs the multiplication of two squared matrices m1 and m2 :

double t;
double* m1 = new double[dim*dim];
double* m2 = new double[dim*dim];
double* m3 = new double[dim*dim];

// fill matrices m1 and m2 here

for(unsigned int i = 0 ; i < dim ; ++i) {
  for(unsigned int k = 0 ; k < dim ; k++) {
    t = m1[i*dim + k];
    for(unsigned int j = 0 ; j < dim ; ++j) {
      m3[i*dim + j] += t * m2[k*dim + j];
    }
  }
}

The code is compiled with gcc 4.7 (also tried 4.8) using -std=c++11 and -O3 flags and runs in around 920 ms on my Core i7.

The problem

BUT, what happens if we change the type of the integer values from unsigned int to int ???
The same code runs in around 620 ms...
Of course, an operation (arithmetical or logical) on unsigned int and on int costs the same, so the difference in performance should not be related to these use of one type or the other.

The explanation

By looking at the assembly code generated for each version, we can identify a difference in the most internal loop.
In the version using unsigned int we have the following :

	movsd	xmm0, QWORD PTR [r12+rsi*8]
	lea	rcx, [rbp+0+rcx*8]
	mulsd	xmm0, xmm1
	addsd	xmm0, QWORD PTR [rcx]
	movsd	QWORD PTR [rcx], xmm0

and in the version using int we have :

	movsd	xmm3, QWORD PTR [rax+rdx]
	movsd	xmm1, QWORD PTR [rcx+rdx]
	movhpd	xmm3, QWORD PTR [rdx+8+rax]
	movhpd	xmm1, QWORD PTR [rcx+8+rdx]
	movapd	xmm0, xmm1
	movapd	xmm1, xmm3
	mulpd	xmm1, xmm2
	addpd	xmm0, xmm1
	movlpd	QWORD PTR [rcx+rdx], xmm0
	movhpd	QWORD PTR [rcx+8+rdx], xmm0

In this last version, we note that the compiler has vectorized the multiply and add operations (mulpd, addpd), thus performing two operations at the same time, while it generates a scalar version when using unsigned int loop counters (mulsd, addsd). Note that, the compiler enables auto-vectorization by default when using the -O3 optimization flag.
This can be confirmed by enabling the verbose mode with the -ftree-vectorizer-verbose=2 flag.
When trying to compile the version using unsigned int counters we obtain the following output:

...
28: not vectorized: not suitable for gather ...
... note: vectorized 0 loops in function.

while the version using int counters we obtain:

Analyzing loop at ...
Vectorizing loop at ...
28: LOOP VECTORIZED.
... note: vectorized 1 loops in function.

Note that, when using long or unsigned long for loop counters, the compiler vectorize the inner loop in both cases...
The problem is known since 2011 in gcc 4.6 and still exists in gcc 4.8.
Bug 48052 - loop not vectorized if index is "unsigned int"

Conclusion

Use int loop counters !!! or vectorize your code by hand using intrinsics...