Deficiencies in LLVM's LoopVectorize

To establish common terminology, let us first broadly sketch the pieces of infrastructure that LoopVectorize depends on. First, there is ScalarEvolution that is a core middle-end analysis for analyzing scalars that evolve in loops, i.e. scalars that depend on the loop induction variable. Next, we need an analysis to determine whether or not something is legal to vectorize, depending on the memory access patterns, and to insert runtime checks in ambiguous cases: this is LoopAccessAnalysis. Next, if the loop is legal to vectorize, the VPlan infrastructure kicks in that makes several candidate plans for vectorizing the loop, querying the CostModel analysis to find out the cost of each plan. Finally, LoopVectorize chooses the best plan from the candidate plans, and if the vectorization is profitable with this plan, it executes the VPlan, emitting the final LLVM IR.

Now, LoopVectorize is quite sophisticated, and is worked on actively. It performs roughly on-par with GCC’s equivalent transform. However, it suffers from several deficiencies, generating sub-optimal vector code, or missing opportunities for vectorization.

The reason for sub-optimal code is the logic within VPlan, or within LoopVectorize. In this category, there is one big deficiency: LoopVectorize cannot vectorize outer loops.

The reasons for missed opportunities broadly fall into five buckets:

Missing patterns in LoopVectorize. For example, loops with conditionals involving the induction variable aren't vectorized.
Deficiencies in ScalarEvolution. For example, std::reverse isn't vectorized.
Deficiencies in LoopAccessAnalysis. The analysis relies on ad-hoc logic on ScalarEvolution expressions to determine dependent accesses, which is certainly sub-optimal.
Lack of supporting transforms. GCC has a transform which distributes a loop to remove dependent accesses, so it can be vectorized. LLVM's LoopDistribute is just a stub that has hardly been worked on since it was checked in, and is turned off by default.
Certain patterns that arise due to the ordering of transforms, that are very difficult to fix. See this dicussion for an example.

There is also one big project in progress that could have a significant impact on several parts of LLVM, especially LoopAccessAnalysis. The goal of the project is to get a Presburger library within LLVM, and replace the ad-hoc logic in LoopAccessAnalysis with a principled approach to computing dependencies. The library would operate on affine expressions (that is to say, linear expressions with integer division, and this arithmetic is called Presburger arithmetic) with a constraints (say, equal to 0), and would solve systems of these constraints. To learn more about Presburger arithmetic, see this tutorial.