An introduction to auto-vectorization with LLVM

Most modern general purpose CPUs have a vector processing unit (VPU). This unit contains vector registers that can hold multiple integers or floats, and instructions that operate on vector registers. Given two vector registers containing N floats each †, the VPU can perform a single instruction to operate on these two vector registers and store the result in another vector register. Without a VPU, this operation would require N instructions operating on two floating-point registers each relying on the floating-point unit (FPU), or arithmetic logic unit (ALU) in the case of integers.

Major instruction set architectures have vector extensions, and corresponding instructions for the VPU. Examples include x86's SSE/AVX, AArch64's SVE/Neon, and the latest entrant is RISC-V's RVV. In order to generate vector instructions, the programmer could adapt their code to use standardized vector intrinsics. To illustrate, suppose the programmer had the following source code:

void saxpy_golden(size_t n, const float a, const float *x, float *y) {
  for (size_t iv = 0; iv < n; ++iv)
    y[iv] = a * x[iv] + y[iv];
}

If they wanted to use RISC-V vector intrinsics to vectorize it, this is how they would adapt the code:

void saxpy_vec(size_t n, const float a, const float *x, float *y) {
  for (size_t vl; n > 0; n -= vl, x += vl, y += vl) {
    vl = __riscv_vsetvl_e32m8(n);
    vfloat32m8_t vx = __riscv_vle32_v_f32m8(x, vl);
    vfloat32m8_t vy = __riscv_vle32_v_f32m8(y, vl);
    __riscv_vse32_v_f32m8(y, __riscv_vfmacc_vf_f32m8(vy, a, vx, vl), vl);
  }
}

As is evident, this process of hand-vectorizing code is very difficult and error-prone for non-trivial programs. The process is analogous to writing assembly instead of writing the source language and letting the compiler generate assembly. Indeed, compilers today auto-vectorize code, and for the purposes of illustration, let us use Clang/LLVM. First, the source language is lowered to target-independent LLVM IR by Clang, which is the frontend. Next, the vectorizers in the middle-end of LLVM operate on this LLVM IR without vectors in them, and produce LLVM IR with vectors in them. The vectorized LLVM IR is finally lowered to target-specific assembly by the backend of LLVM.

Let us take the same example of saxpy_golden, and see how the loop looks in LLVM IR without vectorization:

loop:
  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
  %gep.x = getelementptr inbounds float, ptr %x, i64 %iv
  %load.x = load float, ptr %gep.x
  %gep.y = getelementptr inbounds float, ptr %y, i64 %iv
  %load.y = load float, ptr %gep.y
  %fmuladd = tail call float @llvm.fmuladd.f32(
    float %a, float %load.x, float %load.y)
  store float %fmuladd, ptr %gep.y
  %iv.next = add nuw i64 %iv, 1
  %exitcond = icmp eq i64 %iv.next, %n
  br i1 %exitcond, label %exit, label %loop

The first line, loop:, is like a label in C, to which branch instructions (br) can jump to. The next line is creating a new variable %iv using a phi to select either 0 when jumping from the entry block (not shown), or %iv.next when jumping from the back-edge of the loop: this is the induction variable of the loop. Since x and y are pointers (marked by the type ptr), we see getelementptr instructions that act as the address computations: so, %gep.x is %x + %iv, and %gep.y is %y + %iv. The load instructions derefence the pointer: so, %load.x is x[iv] and load.y is y[iv]; both are of type f32, or float. @llvm.fmuladd.f32 is the computation, which does a a * x[iv] + y[iv], and the store writes the result %fmuladd back into %gep.y. The last three instructions make the loop block a loop: %iv.next = %iv + 1 is the "next value" of the loop induction variable, and %exitcond is a boolean (marked by the type i1, integer with bitwidth 1) that acts as the exit-condition of the loop. %exitcond is computed as an integer-equal-comparison between %iv.next and %n, where n is the total number of iterations of the loop. Finally, there is a branch instruction that jumps to either the exit block (not shown), or back to the loop block (this is the back-edge of the loop), depending on the exit condtion.

This is how it changes after vectorization:

loop.preaheder:
  %n.and.minus4 = and i64 %n, -4
  %a.vec.tmp = insertelement <4 x float> poison, float %a, i64 0
  %a.vec = shufflevector <4 x float> %a.vec.tmp,
    <4 x float> poison, <4 x i32> zeroinitializer
  br label %loop

loop:
  %iv = phi i64 [ 0, %loop.preaheder ], [ %iv.next, %loop ]
  %gep.x = getelementptr inbounds float, ptr %x, i64 %iv
  %load.x = load <4 x float>, ptr %gep.x
  %gep.y = getelementptr inbounds float, ptr %y, i64 %iv
  %load.y = load <4 x float>, ptr %gep.y
  %fmuladd = tail call <4 x float> @llvm.fmuladd.v4f32(
    <4 x float> %a.vec, <4 x float> %load.x, <4 x float> %load.y)
  store <4 x float> %fmuladd, ptr %gep.y
  %iv.next = add nuw i64 %iv, 4
  %exitcond = icmp eq i64 %iv.next, %n.and.minus4
  br i1 %exitcond, label %exit, label %loop

The loop now has a loop.preheader block containing an %n.and.minus4 to handle the case when n is not a multiple of four. It also contains an insertelement and shufflevector, which essentially fill the vector %a.vec (that has type <4 x float>) with four copies of %a. The difference in the loop is that %iv.next is now %iv + 4, since it operates on four floats at a time, cutting the number of iterations by a factor of four. The load instructions are now loading four floats at a time, and the vector variant of @llvm.fmuladd.f32, which is @llvm.fmuladd.v4f32, does a %a.vec * %load.x + %load.y, operating on a vector of four floats at a time.

The vectorizer in LLVM is quite sophisticated, and it is fascinating how it manages to vectorize non-trivial examples. To walk through a simple but non-trivial example, consider:

for (size_t i = 0; i < n; ++i)
  y[i] += a * x[i] + i % 2 ? 0 : y[i];

Inspecting the output, we can see what the vectorizer has done. First, we see a new phi:

%vec.ind = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %vector.ph ],
  [ %vec.ind.next, %vector.body ]

It has created a new vector induction variable vector, corresponding to i, with four elements, since it has decided to vectorize with a factor of 4. Since %vec.ind is an integer, and the vectors are float, it has done a integer-to-floating-point conversion:

%10 = uitofp <4 x i64> %9 to <4 x float>

For the conditional itself, it has used a select to choose either 0 or the loaded value:

%15 = select <4 x i1> %12, <4 x float> zeroinitializer,
  <4 x float> %wide.load1

For the rest of this article, let us inspect some factors that could get in the way when the loop is expected to be vectorized, omitting cases where the loop is either obviously impossible or unprofitable to vectorize.

  1. The iteration count. In most real-world code, the exact iteration count is unknown and the auto-vectorizers can deal with this. Auto-vectorizers can also deal with non-power-of-2 iteration counts by inserting a scalar epilogue with the remaining iterations. However, if the iteration count of the loop is computable and too small, the loop might get unrolled instead of vectorized.
  2. Aliasing information. If there isn't sufficient information to determine whether memory accesses in the loop are aliased to each other, runtime checks may be generated, or auto-vectorization might fail completely because insertion of runtime checks make it unprofitable to vectorize.

To illustrate, let's modify saxpy_golden as follows:

for (size_t i = 0; i < 3; ++i)
  y[i] += a * x[i];

Then, running clang -O3 -fno-unroll-loops -mllvm -vectorize-loops=false -emit-llvm -S test.c -o test.ll followed by opt -passes=loop-vectorize -debug-only=loop-vectorize test.ll outputs:

LV: Found a loop with a very small trip count.
  This loop is worth vectorizing only if no
  scalar iteration overheads are incurred.
LV: Found trip count: 3
LV: Not vectorizing

The overhead it is talking about is runtime checks it needs to add. Adding restrict to the pointers would remove the overhead, and allow it to be vectorized:

void saxpy_golden(const float a, const float *restrict x, float *restrict y)

Even if we were to ramp up the iteration count to 17, unrolling the loop would be preferred over vectorization. Forcing it not to unroll loops using -fno-unroll-loops as before results in successful vectorization. The final output contains a scalar version of the loop as well, marked by scalar.ph, which is required as 17 isn't divisible by 4.

  1. The memory access pattern. A good rule of thumb write loops with uniform row-major access, and avoid complex indexing arithmetic.

The vectorizer has no issues with the following case:

void test(size_t n, size_t m, const float a,
          const float **restrict x, float **restrict y) {
  for (size_t i = 0; i < n; ++i)
    for (size_t j = i; j < m; ++j)
      y[i][j] += a * x[i][j];
}

Now, let's change the assignment statement in the loop to:

y[i][j] += a * x[j][i];

This is unfortunately not vectorized by LLVM, but there is no reason this is not theoretically vectorizable. The reason for this is that an analysis that is used to prove legality of vectorization in LLVM, called LoopAccessAnalysis, is unable to find the bounds of the memory access. Running opt -passes=loop-vectorize test.ll -disable-output -debug-only=loop-accesses outputs:

LAA: We can't vectorize because we can't find the array bounds.
  1. Conditionals within the loop. Certain conditionals are possible to vectorize by either using select, or masking values in the vector using the condition. There is no way to tell if a certain conditional will be vectorized in advance, and depends on whether the pattern is implemented in LLVM's vectorizer.

LLVM is able to vectorize the example discussed previously:

for (size_t i = 0; i < n; ++i)
  y[i] += a * x[i] + i % 2 ? 0 : y[i];

Unfortunately, it is unable to vectorize this:

size_t rdx = 3;
for (size_t i = 0; i < n; ++i)
  rdx = (y[i] > x[i]) ? i : rdx;
return rdx;

The reason is as good as "it is unimplemented":

LV: Not vectorizing: Found an unidentified PHI
  %10 = phi i64 [ %16, %8 ], [ 3, %.preheader ]
  1. Whether the loop is canonical. If, for instance, the loop has multiple exits, LLVM cannot auto-vectorize it.

LLVM struggles with early-exit loops, although there is no reason this is not possible to vectorize:

size_t i = 0;
for (; i < n; ++i)
  if (y[i] > x[i]) {
    y[i] = x[i];
    break;
  }
return i;

The reason is as good as "it is unimplemented":

LV: Found an outside user for :   %.lcssa = phi float [ %11, %6 ]
LV: Not vectorizing: could not determine number of loop iterations.
  1. Function calls within the loop. Except for calls to LLVM intrinsics (like the @llvm.fmuladd in the example discussed previously), the auto-vectorizer will fail if there are function calls within the loop that it didn't inline.

LLVM cannot vectorize:

for (size_t i = 0; i < n; ++i)
  printf("%f\n", a * x[i]);

AArch64 and RISC-V don't include a vector instruction for printf in their instruction set, and this is a fundamental limitation of the instruction set.

LV: Not vectorizing: Found a non-intrinsic callsite
  %13 = tail call i32 (ptr, ...) @printf(
    ptr noundef nonnull dereferenceable(1) @.str,
    double noundef %12)

An example of a vector instruction that's included in both AArch64 and RISC-V is lrint, and vectorizing a program with it will use the vector variant of the @llvm.lrint intrinsic in the target-independent LLVM IR. The output can be inspected on Godbolt.

void test(size_t n, const float a, const float *x, long *y) {
  for (size_t i = 0; i < n; ++i)
    y[i] += lrint(a * x[i]);
}