Any modern compiler has three main components: a front-end that parses the source language, and translates it to a middle-end intermediate representation, a middle-end that operates on this IR, performing target-independent optimizations with a bit of target-specific information, and a backend that takes the final optimized middle-end IR and emits target-specific assembly. In this article, we illustrate the operation of a the main components of a compiler's backend, namely instruction selection, instruction scheduling, and register allocation, using LLVM as the reference compiler. LLVM is especially good for the purposes of illustration, as it has, in addition to a clean human-readable middle-end IR, a relatively clean human-readable backend IR. We use the RISC-V target for the purposes of concrete illustration, chosen once again for its clean design.
Let us start with a simple C program as our guiding example, and have a look at each step of the backend lowering after running the entire middle-end optimization pipeline, all the way up to the final assembly.
#define N 64
void matmul(const float *restrict x, const float *restrict y,
float *restrict res) {
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
for (int k = 0; k < N; ++k)
res[N * i + j] += x[N * i + k] * y[N * k + j];
}
First, we run the front-end, and the full middle-end optimization pipeline to obtain the optimized LLVM IR that's handed off to the backend. Running clang --target=riscv64 -march=rv64gc -O3 -emit-llvm -S emits:
define void @matmul(ptr noalias %x, ptr noalias %y, ptr noalias %res) {
entry:
br label %ph
ph:
%iv.outer = phi i64 [ 0, %entry ], [ %iv.outer.next, %common.exit ]
%shl = shl nuw nsw i64 %iv.outer, 6
br label %common.ph
exit:
ret void
common.ph:
%iv.middle = phi i64 [ 0, %ph ], [ %iv.middle.next, %inner.loop.exit ]
%or = or disjoint i64 %iv.middle, %shl
%gep.res = getelementptr inbounds float, ptr %res, i64 %or
%load.res = load float, ptr %gep.res
%gep.y = getelementptr float, ptr %y, i64 %iv.middle
br label %inner.loop
common.exit:
%iv.outer.next = add nuw nsw i64 %iv.outer, 1
%exit.cond.outer = icmp eq i64 %iv.outer.next, 64
br i1 %exit.cond.outer, label %exit, label %ph
inner.loop.exit:
store float %conv.next, ptr %gep.res
iv.middle.next = add nuw nsw i64 %iv.middle, 1
%exit.cond.middle = icmp eq i64 %iv.middle.next, 64
br i1 %exit.cond.middle, label %common.exit, label %common.ph
inner.loop:
%iv = phi i64 [ 0, %common.ph ], [ %iv.next, %inner.loop ]
%conv = phi float [ %load.res, %common.ph ], [ %conv.next, %inner.loop ]
%or = or disjoint i64 %iv, %shl
%gep.x = getelementptr inbounds float, ptr %x, i64 %or
%load.x = load float, ptr %gep.x
%shl.iv = shl nuw nsw i64 %iv, 6
%gep.y.2 = getelementptr float, ptr %gep.y, i64 %shl.iv
%load.y = load float, ptr %gep.y.2
%conv.next = tail call float @llvm.fmuladd.f32(
float %load.x, float %load.y, float %conv)
%iv.next = add nuw nsw i64 %iv, 1
%exit.cond = icmp eq i64 %iv.next, 64
br i1 %exit.cond, label %inner.loop.exit, label %inner.loop
}
The astute reader will note that we have strategically omitted the vector extension of RISC-V (we use -march=rv64gc instead of -march=rv64gcv) to avoid unnecessarily complicating the exposition due to the vectorization passes in LLVM. Auto-vectorization has already been discussed previously.
The first thing to notice is that the body of the function is cleanly separated into basic blocks: entry, ph, common.ph, inner.loop, inner.loop.exit, common.exit, and exit. Next, notice that the IR is in single-static-assignment (SSA) form, which means is that variables like %iv are only ever defined once. Since this is a loop that requires incrementing the induction variable, a fresh %iv.next variable is created, and %iv is a phi node that selects between the values 0 when jumping into the loop block from entry, and %iv.next when jumping to the inner.loop block from the back-edge of the inner loop. All the instructions (load, getelementptr, store, or, shl etc.) are target-independent instructions that are used throughout the LLVM middle-end. The target-specific instructions will only be seen after instruction selection.
The first step in the lowering process is instruction selection. We can look at the output of this step by running llc --print-after-isel:
Function Live Ins: $x10 in %18, $x11 in %19, $x12 in %20
bb.0 (%ir-block.3):
successors: %bb.1(0x80000000); %bb.1(100.00%)
liveins: $x10, $x11, $x12
%20:gpr = COPY $x12
%19:gpr = COPY $x11
%18:gpr = COPY $x10
%22:gpr = COPY $x0
%21:gpr = COPY %22:gpr
bb.1 (%ir-block.4):
successors: %bb.3(0x80000000); %bb.3(100.00%)
%0:gpr = PHI %18:gpr, %bb.0, %9:gpr, %bb.4
%1:gpr = PHI %21:gpr, %bb.0, %8:gpr, %bb.4
%2:gpr = nuw nsw SLLI %1:gpr, 6
%24:gpr = COPY $x0
%23:gpr = COPY %24:gpr
PseudoBR %bb.3
bb.2 (%ir-block.7):
PseudoRET
bb.3 (%ir-block.8):
successors: %bb.6(0x80000000); %bb.6(100.00%)
%3:gpr = PHI %19:gpr, %bb.1, %11:gpr, %bb.5
%4:gpr = PHI %23:gpr, %bb.1, %10:gpr, %bb.5
%25:gpr = nuw nsw SLLI %4:gpr, 2
%26:gpr = ADD killed %25:gpr, %19:gpr
%27:gpr = LUI 4
%5:gpr = ADD killed %26:gpr, killed %27:gpr
%28:gpr = OR %4:gpr, %2:gpr
%29:gpr = SLLI killed %28:gpr, 2
%6:gpr = ADD %20:gpr, killed %29:gpr
%7:fpr32 = FLW %6:gpr, 0
PseudoBR %bb.6
bb.4 (%ir-block.15):
successors: %bb.2(0x04000000), %bb.1(0x7c000000);
%bb.2(3.12%), %bb.1(96.88%)
%8:gpr = nuw nsw ADDI %1:gpr, 1
%9:gpr = ADDI %0:gpr, 256
%33:gpr = ADDI $x0, 64
BEQ %8:gpr, killed %33:gpr, %bb.2
PseudoBR %bb.1
bb.5 (%ir-block.18):
successors: %bb.4(0x04000000), %bb.3(0x7c000000);
%bb.4(3.12%), %bb.3(96.88%)
FSW %15:fpr32, %6:gpr, 0
%10:gpr = nuw nsw ADDI %4:gpr, 1
%11:gpr = ADDI %3:gpr, 4
%32:gpr = ADDI $x0, 64
BEQ %10:gpr, killed %32:gpr, %bb.4
PseudoBR %bb.3
bb.6 (%ir-block.21):
successors: %bb.5(0x04000000), %bb.6(0x7c000000);
%bb.5(3.12%), %bb.6(96.88%)
%12:gpr = PHI %0:gpr, %bb.3, %17:gpr, %bb.6
%13:gpr = PHI %3:gpr, %bb.3, %16:gpr, %bb.6
%14:fpr32 = PHI %7:fpr32, %bb.3, %15:fpr32, %bb.6
%30:fpr32 = FLW %12:gpr, 0
%31:fpr32 = FLW %13:gpr, 0
%15:fpr32 = nofpexcept FMADD_S killed %30:fpr32,
killed %31:fpr32, %14:fpr32, 7, implicit $frm
%16:gpr = ADDI %13:gpr, 256
%17:gpr = ADDI %12:gpr, 4
BEQ %16:gpr, %5:gpr, %bb.5
PseudoBR %bb.6
What we see is a relatively straight-forward translation of the middle-end LLVM IR to machine IR (MIR), preserving the basic block structure and SSA-form. The purpose of instruction selection is to select a combination of instructions available on the target that optimally represents the middle-end semantics. Its two primary functions are to replace middle-end SSA variables by virtual registers, and to replace middle-end instructions by target-specific instructions.
Each virtual register has a suffix of either gpr (general-purpose register) or fpr (floating-point register). There are also physical registers in this MIR: $x0, $x10, $x11, $x12, $frm. Per the RISC-V calling convention, the three arguments to the function are in general-purpose registers $x10, $x11, and $x12 (pretty-printed as a0, a1, and a2). $x0 is a special register on RISC-V that always holds the value 0. $frm is the floating-point rounding-mode register. None of these can be virtual registers. A minor point to notice is the annotations on the virtual registers, killed and implicit. The operands of the FMADD_S, %30:fpr32 and %31:fpr32, are marked as killed, meaning that this is the final definition/use of those registers (these annotations are preliminary: the final annotations will be determined at a later stage). The register $frm is marked as implicit, meaning that it implicitly uses this register, and that this argument will be absent in the final assembly. The virtual registers must be turned into physical registers by the register allocator before emitting the final assembly.
Middle-end instructions like @llvm.fmuladd have been replaced by target-specific instructions like FMADD_S (pretty-printed as fmadd.s), and several pseudo-instructions like PseudoBR, PseudoRET, and COPY have been inserted in the process. The pseudo-instructions must be lowered to real instructions before emitting the final assembly.
There is some minor additional information in this MIR: the live-ins, or registers that are live at the point of entry, and the branch frequency information: bb.6 is going to be the successor basic-block of bb.6 97% of the time.
Although the instruction selection infrastructure, SelectionDAG †, is target-independent, we can see that our MIR is clearly specialized for RISC-V. All the information about the registers available on RISC-V are defined in RISCVRegisterInfo.td. The RISC-V instructions themselves come out of the specification RISCVInstrInfo.td, and the target-specific code that specifies how middle-end instructions should be lowered to instructions in this specification via SelectionDAG is RISCVISelLowering.cpp.
In order to understand the various steps after instruction selection, let us inspect the diffs produced at each step by running llc -print-changed=diff.
The first significant pass to run after the instruction selection is early-machinelicm, which is just a backend version of the middle-end loop invariant code motion transform. Next, we have livevars, which analyzes live variables, and marks registers as killed aggressively.
phi-node-elimination runs next, which removes PHI nodes, leaving the MIR in SSA form by inserting pseudo COPY instructions:
Function Live Ins: $x10 in %18, $x11 in %19, $x12 in %20
bb.0 (%ir-block.3):
successors: %bb.1(0x80000000); %bb.1(100.00%)
liveins: $x10, $x11, $x12
%20:gpr = COPY killed $x12
%19:gpr = COPY killed $x11
%18:gpr = COPY killed $x10
%22:gpr = COPY $x0
%21:gpr = COPY killed %22:gpr
%27:gpr = LUI 4
%32:gpr = ADDI $x0, 64
+ %36:gpr = COPY killed %18:gpr
+ %37:gpr = COPY killed %21:gpr
bb.1 (%ir-block.4):
successors: %bb.3(0x80000000); %bb.3(100.00%)
- %0:gpr = PHI %18:gpr, %bb.0, %9:gpr, %bb.4
- %1:gpr = PHI %21:gpr, %bb.0, %8:gpr, %bb.4
+ %1:gpr = COPY killed %37:gpr
+ %0:gpr = COPY killed %36:gpr
%2:gpr = nuw nsw SLLI %1:gpr, 6
%24:gpr = COPY $x0
%23:gpr = COPY killed %24:gpr
+ %38:gpr = COPY %19:gpr
+ %39:gpr = COPY killed %23:gpr
PseudoBR %bb.3
bb.2 (%ir-block.7):
PseudoRET
bb.3 (%ir-block.8):
successors: %bb.6(0x80000000); %bb.6(100.00%)
- %3:gpr = PHI %19:gpr, %bb.1, %11:gpr, %bb.5
- %4:gpr = PHI %23:gpr, %bb.1, %10:gpr, %bb.5
+ %4:gpr = COPY killed %39:gpr
+ %3:gpr = COPY killed %38:gpr
%25:gpr = nuw nsw SLLI %4:gpr, 2
%26:gpr = ADD killed %25:gpr, %19:gpr
%5:gpr = ADD killed %26:gpr, %27:gpr
%28:gpr = OR %4:gpr, %2:gpr
%29:gpr = SLLI killed %28:gpr, 2
%6:gpr = ADD %20:gpr, killed %29:gpr
%7:fpr32 = FLW %6:gpr, 0
+ %40:gpr = COPY %0:gpr
+ %41:gpr = COPY %3:gpr
+ %42:fpr32 = COPY killed %7:fpr32
PseudoBR %bb.6
bb.4 (%ir-block.15):
successors: %bb.2(0x04000000), %bb.1(0x7c000000);
%bb.2(3.12%), %bb.1(96.88%)
%8:gpr = nuw nsw ADDI killed %1:gpr, 1
%9:gpr = ADDI killed %0:gpr, 256
- BEQ %8:gpr, %32:gpr, %bb.2
+ %36:gpr = COPY killed %9:gpr
+ %37:gpr = COPY %8:gpr
+ BEQ killed %8:gpr, %32:gpr, %bb.2
PseudoBR %bb.1
bb.5 (%ir-block.18):
successors: %bb.4(0x04000000), %bb.3(0x7c000000);
%bb.4(3.12%), %bb.3(96.88%)
FSW killed %15:fpr32, killed %6:gpr, 0
%10:gpr = nuw nsw ADDI killed %4:gpr, 1
%11:gpr = ADDI killed %3:gpr, 4
- BEQ %10:gpr, %32:gpr, %bb.4
+ %38:gpr = COPY killed %11:gpr
+ %39:gpr = COPY %10:gpr
+ BEQ killed %10:gpr, %32:gpr, %bb.4
PseudoBR %bb.3
bb.6 (%ir-block.21):
successors: %bb.5(0x04000000), %bb.6(0x7c000000);
%bb.5(3.12%), %bb.6(96.88%)
- %12:gpr = PHI %0:gpr, %bb.3, %17:gpr, %bb.6
- %13:gpr = PHI %3:gpr, %bb.3, %16:gpr, %bb.6
- %14:fpr32 = PHI %7:fpr32, %bb.3, %15:fpr32, %bb.6
+ %14:fpr32 = COPY killed %42:fpr32
+ %13:gpr = COPY killed %41:gpr
+ %12:gpr = COPY killed %40:gpr
%30:fpr32 = FLW %12:gpr, 0
%31:fpr32 = FLW %13:gpr, 0
%15:fpr32 = nofpexcept FMADD_S killed %30:fpr32,
killed %31:fpr32, killed %14:fpr32,
7, implicit $frm
%16:gpr = ADDI killed %13:gpr, 256
%17:gpr = ADDI killed %12:gpr, 4
- BEQ %16:gpr, %5:gpr, %bb.5
+ %40:gpr = COPY killed %17:gpr
+ %41:gpr = COPY %16:gpr
+ %42:fpr32 = COPY %15:fpr32
+ BEQ killed %16:gpr, %5:gpr, %bb.5
PseudoBR %bb.6
The next pass liveintervals undoes many of the killed annotations inserted by livevars, by performing a more sophisticated analysis to determine the actual virtual registers killed in the loop-nest.
The final preparatory step before instruction scheduling and register allocation is register-coalescer, which eliminates many of the COPY instructions, taking the MIR out of SSA form:
Function Live Ins: $x10 in %18, $x11 in %19, $x12 in %20
bb.0 (%ir-block.3):
successors: %bb.1(0x80000000); %bb.1(100.00%)
liveins: $x10, $x11, $x12
%20:gpr = COPY $x12
%19:gpr = COPY $x11
- %18:gpr = COPY $x10
- %22:gpr = COPY $x0
- %21:gpr = COPY %22:gpr
+ %0:gpr = COPY $x10
%27:gpr = LUI 4
%32:gpr = ADDI $x0, 64
- %36:gpr = COPY %18:gpr
- %37:gpr = COPY %21:gpr
+ %1:gpr = COPY $x0
bb.1 (%ir-block.4):
successors: %bb.3(0x80000000); %bb.3(100.00%)
- %1:gpr = COPY %37:gpr
- %0:gpr = COPY %36:gpr
%2:gpr = nuw nsw SLLI %1:gpr, 6
- %24:gpr = COPY $x0
- %23:gpr = COPY %24:gpr
- %38:gpr = COPY %19:gpr
- %39:gpr = COPY %23:gpr
+ %3:gpr = COPY %19:gpr
+ %4:gpr = COPY $x0
PseudoBR %bb.3
bb.2 (%ir-block.7):
PseudoRET
bb.3 (%ir-block.8):
successors: %bb.6(0x80000000); %bb.6(100.00%)
- %4:gpr = COPY %39:gpr
- %3:gpr = COPY %38:gpr
%25:gpr = nuw nsw SLLI %4:gpr, 2
%26:gpr = ADD %25:gpr, %19:gpr
%5:gpr = ADD %26:gpr, %27:gpr
%28:gpr = OR %4:gpr, %2:gpr
%29:gpr = SLLI %28:gpr, 2
%6:gpr = ADD %20:gpr, %29:gpr
- %7:fpr32 = FLW %6:gpr, 0
+ %42:fpr32 = FLW %6:gpr, 0
%40:gpr = COPY %0:gpr
%41:gpr = COPY %3:gpr
- %42:fpr32 = COPY %7:fpr32
PseudoBR %bb.6
bb.4 (%ir-block.15):
successors: %bb.2(0x04000000), %bb.1(0x7c000000);
%bb.2(3.12%), %bb.1(96.88%)
- %8:gpr = nuw nsw ADDI %1:gpr, 1
- %9:gpr = ADDI %0:gpr, 256
- %36:gpr = COPY %9:gpr
- %37:gpr = COPY %8:gpr
- BEQ %8:gpr, %32:gpr, %bb.2
+ %1:gpr = nuw nsw ADDI %1:gpr, 1
+ %0:gpr = ADDI %0:gpr, 256
+ BEQ %1:gpr, %32:gpr, %bb.2
PseudoBR %bb.1
bb.5 (%ir-block.18):
successors: %bb.4(0x04000000), %bb.3(0x7c000000);
%bb.4(3.12%), %bb.3(96.88%)
- FSW %15:fpr32, %6:gpr, 0
- %10:gpr = nuw nsw ADDI %4:gpr, 1
- %11:gpr = ADDI %3:gpr, 4
- %38:gpr = COPY %11:gpr
- %39:gpr = COPY %10:gpr
- BEQ %10:gpr, %32:gpr, %bb.4
+ FSW %42:fpr32, %6:gpr, 0
+ %4:gpr = nuw nsw ADDI %4:gpr, 1
+ %3:gpr = ADDI %3:gpr, 4
+ BEQ %4:gpr, %32:gpr, %bb.4
PseudoBR %bb.3
bb.6 (%ir-block.21):
successors: %bb.5(0x04000000), %bb.6(0x7c000000);
%bb.5(3.12%), %bb.6(96.88%)
- %14:fpr32 = COPY %42:fpr32
- %13:gpr = COPY %41:gpr
- %12:gpr = COPY %40:gpr
- %30:fpr32 = FLW %12:gpr, 0
- %31:fpr32 = FLW %13:gpr, 0
- %15:fpr32 = nofpexcept FMADD_S %30:fpr32,
- %31:fpr32, %14:fpr32, 7, implicit $frm
- %16:gpr = ADDI %13:gpr, 256
- %17:gpr = ADDI %12:gpr, 4
- %40:gpr = COPY %17:gpr
- %41:gpr = COPY %16:gpr
- %42:fpr32 = COPY %15:fpr32
- BEQ %16:gpr, %5:gpr, %bb.5
+ %30:fpr32 = FLW %40:gpr, 0
+ %31:fpr32 = FLW %41:gpr, 0
+ %42:fpr32 = nofpexcept FMADD_S %30:fpr32,
+ %31:fpr32, %42:fpr32, 7, implicit $frm
+ %41:gpr = ADDI %41:gpr, 256
+ %40:gpr = ADDI %40:gpr, 4
+ BEQ %41:gpr, %5:gpr, %bb.5
PseudoBR %bb.6
The purpose of the instruction scheduler, machine-scheduler, is to re-order instructions to minimize hazards and stalls. The optimal schedule depends on micro-architectural details, and an example of a scheduler descriptor is RISCVSchedSiFive7.td, for the SiFive 7 CPU. In our invocation of Clang, we did not specify a particular CPU, so some generic information for RISC-V is used. The optimal schedule can be seen in the following diff:
Function Live Ins: $x10 in %18, $x11 in %19, $x12 in %20
bb.0 (%ir-block.3):
successors: %bb.1(0x80000000); %bb.1(100.00%)
liveins: $x10, $x11, $x12
%20:gpr = COPY $x12
%19:gpr = COPY $x11
%0:gpr = COPY $x10
+ %1:gpr = COPY $x0
%27:gpr = LUI 4
%32:gpr = ADDI $x0, 64
- %1:gpr = COPY $x0
bb.1 (%ir-block.4):
successors: %bb.3(0x80000000); %bb.3(100.00%)
+ %4:gpr = COPY $x0
%2:gpr = nuw nsw SLLI %1:gpr, 6
%3:gpr = COPY %19:gpr
- %4:gpr = COPY $x0
PseudoBR %bb.3
bb.2 (%ir-block.7):
PseudoRET
bb.3 (%ir-block.8):
successors: %bb.6(0x80000000); %bb.6(100.00%)
- %25:gpr = nuw nsw SLLI %4:gpr, 2
- %26:gpr = ADD %25:gpr, %19:gpr
- %5:gpr = ADD %26:gpr, %27:gpr
%28:gpr = OR %4:gpr, %2:gpr
%29:gpr = SLLI %28:gpr, 2
%6:gpr = ADD %20:gpr, %29:gpr
%42:fpr32 = FLW %6:gpr, 0
+ %25:gpr = nuw nsw SLLI %4:gpr, 2
+ %26:gpr = ADD %25:gpr, %19:gpr
+ %5:gpr = ADD %26:gpr, %27:gpr
%40:gpr = COPY %0:gpr
%41:gpr = COPY %3:gpr
PseudoBR %bb.6
bb.4 (%ir-block.15):
successors: %bb.2(0x04000000), %bb.1(0x7c000000);
%bb.2(3.12%), %bb.1(96.88%)
%1:gpr = nuw nsw ADDI %1:gpr, 1
%0:gpr = ADDI %0:gpr, 256
BEQ %1:gpr, %32:gpr, %bb.2
PseudoBR %bb.1
bb.5 (%ir-block.18):
successors: %bb.4(0x04000000), %bb.3(0x7c000000);
%bb.4(3.12%), %bb.3(96.88%)
FSW %42:fpr32, %6:gpr, 0
%4:gpr = nuw nsw ADDI %4:gpr, 1
%3:gpr = ADDI %3:gpr, 4
BEQ %4:gpr, %32:gpr, %bb.4
PseudoBR %bb.3
bb.6 (%ir-block.21):
successors: %bb.5(0x04000000), %bb.6(0x7c000000);
%bb.5(3.12%), %bb.6(96.88%)
%30:fpr32 = FLW %40:gpr, 0
%31:fpr32 = FLW %41:gpr, 0
%42:fpr32 = nofpexcept FMADD_S %30:fpr32, %31:fpr32,
%42:fpr32, 7, implicit $frm
%41:gpr = ADDI %41:gpr, 256
%40:gpr = ADDI %40:gpr, 4
BEQ %41:gpr, %5:gpr, %bb.5
PseudoBR %bb.6
We're now ready to run the greedy register allocator †, whose purpose is to assign every virtual register a physical register, while re-using the limited number of physical registers on the machine, and minimizing register spills. The actual replacement is done by virtregrewriter:
Function Live Ins: $x10, $x11, $x12
bb.0 (%ir-block.3):
successors: %bb.1(0x80000000); %bb.1(100.00%)
liveins: $x10, $x11, $x12
- %20:gpr = COPY $x12
- %19:gpr = COPY $x11
- %0:gpr = COPY $x10
- %1:gpr = COPY $x0
- %27:gpr = LUI 4
- %32:gpr = ADDI $x0, 64
+ renamable $x16 = COPY $x0
+ renamable $x17 = LUI 4
+ renamable $x5 = ADDI $x0, 64
bb.1 (%ir-block.4):
successors: %bb.3(0x80000000); %bb.3(100.00%)
- %4:gpr = COPY $x0
- %2:gpr = nuw nsw SLLI %1:gpr, 6
- %3:gpr = COPY %19:gpr
+ liveins: $x5, $x10, $x11, $x12, $x16, $x17
+ renamable $x29 = COPY $x0
+ renamable $x6 = nuw nsw SLLI renamable $x16, 6
+ renamable $x28 = COPY renamable $x11
PseudoBR %bb.3
bb.2 (%ir-block.7):
; predecessors: %bb.4
PseudoRET
bb.3 (%ir-block.8):
successors: %bb.6(0x80000000); %bb.6(100.00%)
- %28:gpr = OR %4:gpr, %2:gpr
- %29:gpr = SLLI %28:gpr, 2
- %6:gpr = ADD %20:gpr, %29:gpr
- %42:fpr32 = FLW %6:gpr, 0
- %25:gpr = nuw nsw SLLI %4:gpr, 2
- %26:gpr = ADD %25:gpr, %19:gpr
- %5:gpr = ADD %26:gpr, %27:gpr
- %40:gpr = COPY %0:gpr
- %41:gpr = COPY %3:gpr
+ liveins: $x5, $x6, $x10, $x11, $x12, $x16, $x17, $x28, $x29
+ renamable $x13 = OR renamable $x29, renamable $x6
+ renamable $x13 = SLLI killed renamable $x13, 2
+ renamable $x7 = ADD renamable $x12, killed renamable $x13
+ renamable $f15_f = FLW renamable $x7, 0
+ renamable $x13 = nuw nsw SLLI renamable $x29, 2
+ renamable $x13 = ADD killed renamable $x13, renamable $x11
+ renamable $x14 = ADD killed renamable $x13, renamable $x17
+ renamable $x15 = COPY renamable $x10
+ renamable $x13 = COPY renamable $x28
PseudoBR %bb.6
bb.4 (%ir-block.15):
successors: %bb.2(0x04000000), %bb.1(0x7c000000);
%bb.2(3.12%), %bb.1(96.88%)
- %1:gpr = nuw nsw ADDI %1:gpr, 1
- %0:gpr = ADDI %0:gpr, 256
- BEQ %1:gpr, %32:gpr, %bb.2
+ liveins: $x5, $x10, $x11, $x12, $x16, $x17
+ renamable $x16 = nuw nsw ADDI killed renamable $x16, 1
+ renamable $x10 = ADDI killed renamable $x10, 256
+ BEQ renamable $x16, renamable $x5, %bb.2
PseudoBR %bb.1
bb.5 (%ir-block.18):
successors: %bb.4(0x04000000), %bb.3(0x7c000000);
%bb.4(3.12%), %bb.3(96.88%)
- FSW %42:fpr32, %6:gpr, 0
- %4:gpr = nuw nsw ADDI %4:gpr, 1
- %3:gpr = ADDI %3:gpr, 4
- BEQ %4:gpr, %32:gpr, %bb.4
+ liveins: $x5, $x6, $x7, $x10, $x11, $x12, $x16,
+ $x17, $x28, $x29, $f15_f
+ FSW killed renamable $f15_f, killed renamable $x7, 0
+ renamable $x29 = nuw nsw ADDI killed renamable $x29, 1
+ renamable $x28 = ADDI killed renamable $x28, 4
+ BEQ renamable $x29, renamable $x5, %bb.4
PseudoBR %bb.3
bb.6 (%ir-block.21):
successors: %bb.5(0x04000000), %bb.6(0x7c000000);
%bb.5(3.12%), %bb.6(96.88%)
- %30:fpr32 = FLW %40:gpr, 0
- %31:fpr32 = FLW %41:gpr, 0
- %42:fpr32 = nofpexcept FMADD_S %30:fpr32, %31:fpr32,
- %42:fpr32, 7, implicit $frm
- %41:gpr = ADDI %41:gpr, 256
- %40:gpr = ADDI %40:gpr, 4
- BEQ %41:gpr, %5:gpr, %bb.5
+ liveins: $x5, $x6, $x7, $x10, $x11, $x12, $x13, $x14,
+ $x15, $x16, $x17, $x28, $x29, $f15_f
+ renamable $f14_f = FLW renamable $x15, 0
+ renamable $f13_f = FLW renamable $x13, 0
+ renamable $f15_f = nofpexcept FMADD_S killed renamable $f14_f,
+ killed renamable $f13_f, killed renamable $f15_f,
+ 7, implicit $frm
+ renamable $x13 = ADDI killed renamable $x13, 256
+ renamable $x15 = ADDI killed renamable $x15, 4
+ BEQ renamable $x13, renamable $x14, %bb.5
PseudoBR %bb.6
Before we emit the assembly, there are two pending steps: first, basic blocks need to be re-ordered, so that the final structure is flat and there are no unnecessary jumps; second, pseudo-instructions like COPY need to be replaced by real instructions. The final MIR after branch-folder and postrapseduos is:
Function Live Ins: $x10, $x11, $x12
bb.0 (%ir-block.3):
successors: %bb.1(0x80000000); %bb.1(100.00%)
liveins: $x10, $x11, $x12
$x16 = ADDI $x0, 0
renamable $x17 = LUI 4
renamable $x5 = ADDI $x0, 64
bb.1 (%ir-block.4):
successors: %bb.2(0x80000000); %bb.2(100.00%)
liveins: $x5, $x10, $x11, $x12, $x16, $x17
$x29 = ADDI $x0, 0
renamable $x6 = nuw nsw SLLI renamable $x16, 6
$x28 = ADDI $x11, 0
bb.2 (%ir-block.8):
successors: %bb.3(0x80000000); %bb.3(100.00%)
liveins: $x5, $x6, $x10, $x11, $x12, $x16, $x17, $x28, $x29
renamable $x13 = OR renamable $x29, renamable $x6
renamable $x13 = SLLI killed renamable $x13, 2
renamable $x7 = ADD renamable $x12, killed renamable $x13
renamable $f15_f = FLW renamable $x7, 0
renamable $x13 = nuw nsw SLLI renamable $x29, 2
renamable $x13 = ADD killed renamable $x13, renamable $x11
renamable $x14 = ADD killed renamable $x13, renamable $x17
$x15 = ADDI $x10, 0
$x13 = ADDI $x28, 0
bb.3 (%ir-block.21):
successors: %bb.4(0x04000000), %bb.3(0x7c000000);
%bb.4(3.12%), %bb.3(96.88%)
liveins: $x5, $x6, $x7, $x10, $x11, $x12, $x13, $x14,
$x15, $x16, $x17, $x28, $x29, $f15_f
renamable $f14_f = FLW renamable $x15, 0
renamable $f13_f = FLW renamable $x13, 0
renamable $f15_f = nofpexcept FMADD_S killed renamable $f14_f,
killed renamable $f13_f, killed renamable $f15_f,
7, implicit $frm
renamable $x13 = ADDI killed renamable $x13, 256
renamable $x15 = ADDI killed renamable $x15, 4
BNE renamable $x13, renamable $x14, %bb.3
bb.4 (%ir-block.18):
successors: %bb.5(0x04000000), %bb.2(0x7c000000);
%bb.5(3.12%), %bb.2(96.88%)
liveins: $x5, $x6, $x7, $x10, $x11, $x12, $x16, $x17, $x28, $x29, $f15_f
FSW killed renamable $f15_f, killed renamable $x7, 0
renamable $x29 = nuw nsw ADDI killed renamable $x29, 1
renamable $x28 = ADDI killed renamable $x28, 4
BNE renamable $x29, renamable $x5, %bb.2
bb.5 (%ir-block.15):
successors: %bb.6(0x04000000), %bb.1(0x7c000000);
%bb.6(3.12%), %bb.1(96.88%)
liveins: $x5, $x10, $x11, $x12, $x16, $x17
renamable $x16 = nuw nsw ADDI killed renamable $x16, 1
renamable $x10 = ADDI killed renamable $x10, 256
BNE renamable $x16, renamable $x5, %bb.1
bb.6 (%ir-block.7):
PseudoRET
The assembly is produced from the final MIR by AsmPrinter, which doesn't do much more than pretty-printing registers ($x10 is a0, $x11 is a1), and instructions (FMADD_S is fmadd.s), while dropping auxiliary information. This output can be checked against the RISC-V ISA manual.
matmul: # @matmul
# %bb.0:
li a6, 0
lui a7, 4
li t0, 64
.LBB0_1: # =>This Loop Header: Depth=1
# Child Loop BB0_2 Depth 2
# Child Loop BB0_3 Depth 3
li t4, 0
slli t1, a6, 6
mv t3, a1
.LBB0_2: # Parent Loop BB0_1 Depth=1
# => This Loop Header: Depth=2
# Child Loop BB0_3 Depth 3
or a3, t4, t1
slli a3, a3, 2
add t2, a2, a3
flw fa5, 0(t2)
slli a3, t4, 2
add a3, a3, a1
add a4, a3, a7
mv a5, a0
mv a3, t3
.LBB0_3: # Parent Loop BB0_1 Depth=1
# Parent Loop BB0_2 Depth=2
# => This Inner Loop Header: Depth=3
flw fa4, 0(a5)
flw fa3, 0(a3)
fmadd.s fa5, fa4, fa3, fa5
addi a3, a3, 256
addi a5, a5, 4
bne a3, a4, .LBB0_3
# %bb.4: # in Loop: Header=BB0_2 Depth=2
fsw fa5, 0(t2)
addi t4, t4, 1
addi t3, t3, 4
bne t4, t0, .LBB0_2
# %bb.5: # in Loop: Header=BB0_1 Depth=1
addi a6, a6, 1
addi a0, a0, 256
bne a6, t0, .LBB0_1
# %bb.6:
ret