A tour of the LLVM backend

Any modern compiler has three main components: a front-end that parses the source language, and translates it to a middle-end intermediate representation, a middle-end that operates on this IR, performing target-independent optimizations with a bit of target-specific information, and a backend that takes the final optimized middle-end IR and emits target-specific assembly. In this article, we illustrate the operation of a the main components of a compiler's backend, namely instruction selection, instruction scheduling, and register allocation, using LLVM as the reference compiler. LLVM is especially good for the purposes of illustration, as it has, in addition to a clean human-readable middle-end IR, a relatively clean human-readable backend IR. We use the RISC-V target for the purposes of concrete illustration, chosen once again for its clean design.

Let us start with a simple C program as our guiding example, and have a look at each step of the backend lowering after running the entire middle-end optimization pipeline, all the way up to the final assembly.

#define N 64

void matmul(const float *restrict x, const float *restrict y,
            float *restrict res) {
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
      for (int k = 0; k < N; ++k)
        res[N * i + j] += x[N * i + k] * y[N * k + j];
}

First, we run the front-end, and the full middle-end optimization pipeline to obtain the optimized LLVM IR that's handed off to the backend. Running clang --target=riscv64 -march=rv64gc -O3 -emit-llvm -S emits:

define void @matmul(ptr noalias %x, ptr noalias %y, ptr noalias %res) {
entry:
  br label %ph

ph:
  %iv.outer = phi i64 [ 0, %entry ], [ %iv.outer.next, %common.exit ]
  %shl = shl nuw nsw i64 %iv.outer, 6
  br label %common.ph

exit:
  ret void

common.ph:
  %iv.middle = phi i64 [ 0, %ph ], [ %iv.middle.next, %inner.loop.exit ]
  %or = or disjoint i64 %iv.middle, %shl
  %gep.res = getelementptr inbounds float, ptr %res, i64 %or
  %load.res = load float, ptr %gep.res
  %gep.y = getelementptr float, ptr %y, i64 %iv.middle
  br label %inner.loop

common.exit:
  %iv.outer.next = add nuw nsw i64 %iv.outer, 1
  %exit.cond.outer = icmp eq i64 %iv.outer.next, 64
  br i1 %exit.cond.outer, label %exit, label %ph

inner.loop.exit:
  store float %conv.next, ptr %gep.res
  iv.middle.next = add nuw nsw i64 %iv.middle, 1
  %exit.cond.middle = icmp eq i64 %iv.middle.next, 64
  br i1 %exit.cond.middle, label %common.exit, label %common.ph

inner.loop:
  %iv = phi i64 [ 0, %common.ph ], [ %iv.next, %inner.loop ]
  %conv = phi float [ %load.res, %common.ph ], [ %conv.next, %inner.loop ]
  %or = or disjoint i64 %iv, %shl
  %gep.x = getelementptr inbounds float, ptr %x, i64 %or
  %load.x = load float, ptr %gep.x
  %shl.iv = shl nuw nsw i64 %iv, 6
  %gep.y.2 = getelementptr float, ptr %gep.y, i64 %shl.iv
  %load.y = load float, ptr %gep.y.2
  %conv.next = tail call float @llvm.fmuladd.f32(
               float %load.x, float %load.y, float %conv)
  %iv.next = add nuw nsw i64 %iv, 1
  %exit.cond = icmp eq i64 %iv.next, 64
  br i1 %exit.cond, label %inner.loop.exit, label %inner.loop
}

The astute reader will note that we have strategically omitted the vector extension of RISC-V (we use -march=rv64gc instead of -march=rv64gcv) to avoid unnecessarily complicating the exposition due to the vectorization passes in LLVM. Auto-vectorization has already been discussed previously.

The first thing to notice is that the body of the function is cleanly separated into basic blocks: entry, ph, common.ph, inner.loop, inner.loop.exit, common.exit, and exit. Next, notice that the IR is in single-static-assignment (SSA) form, which means is that variables like %iv are only ever defined once. Since this is a loop that requires incrementing the induction variable, a fresh %iv.next variable is created, and %iv is a phi node that selects between the values 0 when jumping into the loop block from entry, and %iv.next when jumping to the inner.loop block from the back-edge of the inner loop. All the instructions (load, getelementptr, store, or, shl etc.) are target-independent instructions that are used throughout the LLVM middle-end. The target-specific instructions will only be seen after instruction selection.

The first step in the lowering process is instruction selection. We can look at the output of this step by running llc --print-after-isel:

Function Live Ins: $x10 in %18, $x11 in %19, $x12 in %20

bb.0 (%ir-block.3):
  successors: %bb.1(0x80000000); %bb.1(100.00%)
  liveins: $x10, $x11, $x12
  %20:gpr = COPY $x12
  %19:gpr = COPY $x11
  %18:gpr = COPY $x10
  %22:gpr = COPY $x0
  %21:gpr = COPY %22:gpr

bb.1 (%ir-block.4):
  successors: %bb.3(0x80000000); %bb.3(100.00%)
  %0:gpr = PHI %18:gpr, %bb.0, %9:gpr, %bb.4
  %1:gpr = PHI %21:gpr, %bb.0, %8:gpr, %bb.4
  %2:gpr = nuw nsw SLLI %1:gpr, 6
  %24:gpr = COPY $x0
  %23:gpr = COPY %24:gpr
  PseudoBR %bb.3

bb.2 (%ir-block.7):
  PseudoRET

bb.3 (%ir-block.8):
  successors: %bb.6(0x80000000); %bb.6(100.00%)
  %3:gpr = PHI %19:gpr, %bb.1, %11:gpr, %bb.5
  %4:gpr = PHI %23:gpr, %bb.1, %10:gpr, %bb.5
  %25:gpr = nuw nsw SLLI %4:gpr, 2
  %26:gpr = ADD killed %25:gpr, %19:gpr
  %27:gpr = LUI 4
  %5:gpr = ADD killed %26:gpr, killed %27:gpr
  %28:gpr = OR %4:gpr, %2:gpr
  %29:gpr = SLLI killed %28:gpr, 2
  %6:gpr = ADD %20:gpr, killed %29:gpr
  %7:fpr32 = FLW %6:gpr, 0
  PseudoBR %bb.6

bb.4 (%ir-block.15):
  successors: %bb.2(0x04000000), %bb.1(0x7c000000);
              %bb.2(3.12%), %bb.1(96.88%)
  %8:gpr = nuw nsw ADDI %1:gpr, 1
  %9:gpr = ADDI %0:gpr, 256
  %33:gpr = ADDI $x0, 64
  BEQ %8:gpr, killed %33:gpr, %bb.2
  PseudoBR %bb.1

bb.5 (%ir-block.18):
  successors: %bb.4(0x04000000), %bb.3(0x7c000000);
              %bb.4(3.12%), %bb.3(96.88%)
  FSW %15:fpr32, %6:gpr, 0
  %10:gpr = nuw nsw ADDI %4:gpr, 1
  %11:gpr = ADDI %3:gpr, 4
  %32:gpr = ADDI $x0, 64
  BEQ %10:gpr, killed %32:gpr, %bb.4
  PseudoBR %bb.3

bb.6 (%ir-block.21):
  successors: %bb.5(0x04000000), %bb.6(0x7c000000);
              %bb.5(3.12%), %bb.6(96.88%)
  %12:gpr = PHI %0:gpr, %bb.3, %17:gpr, %bb.6
  %13:gpr = PHI %3:gpr, %bb.3, %16:gpr, %bb.6
  %14:fpr32 = PHI %7:fpr32, %bb.3, %15:fpr32, %bb.6
  %30:fpr32 = FLW %12:gpr, 0
  %31:fpr32 = FLW %13:gpr, 0
  %15:fpr32 = nofpexcept FMADD_S killed %30:fpr32,
              killed %31:fpr32, %14:fpr32, 7, implicit $frm
  %16:gpr = ADDI %13:gpr, 256
  %17:gpr = ADDI %12:gpr, 4
  BEQ %16:gpr, %5:gpr, %bb.5
  PseudoBR %bb.6

What we see is a relatively straight-forward translation of the middle-end LLVM IR to machine IR (MIR), preserving the basic block structure and SSA-form. The purpose of instruction selection is to select a combination of instructions available on the target that optimally represents the middle-end semantics. Its two primary functions are to replace middle-end SSA variables by virtual registers, and to replace middle-end instructions by target-specific instructions.

Each virtual register has a suffix of either gpr (general-purpose register) or fpr (floating-point register). There are also physical registers in this MIR: $x0, $x10, $x11, $x12, $frm. Per the RISC-V calling convention, the three arguments to the function are in general-purpose registers $x10, $x11, and $x12 (pretty-printed as a0, a1, and a2). $x0 is a special register on RISC-V that always holds the value 0. $frm is the floating-point rounding-mode register. None of these can be virtual registers. A minor point to notice is the annotations on the virtual registers, killed and implicit. The operands of the FMADD_S, %30:fpr32 and %31:fpr32, are marked as killed, meaning that this is the final definition/use of those registers (these annotations are preliminary: the final annotations will be determined at a later stage). The register $frm is marked as implicit, meaning that it implicitly uses this register, and that this argument will be absent in the final assembly. The virtual registers must be turned into physical registers by the register allocator before emitting the final assembly.

Middle-end instructions like @llvm.fmuladd have been replaced by target-specific instructions like FMADD_S (pretty-printed as fmadd.s), and several pseudo-instructions like PseudoBR, PseudoRET, and COPY have been inserted in the process. The pseudo-instructions must be lowered to real instructions before emitting the final assembly.

There is some minor additional information in this MIR: the live-ins, or registers that are live at the point of entry, and the branch frequency information: bb.6 is going to be the successor basic-block of bb.6 97% of the time.

Although the instruction selection infrastructure, SelectionDAG †, is target-independent, we can see that our MIR is clearly specialized for RISC-V. All the information about the registers available on RISC-V are defined in RISCVRegisterInfo.td. The RISC-V instructions themselves come out of the specification RISCVInstrInfo.td, and the target-specific code that specifies how middle-end instructions should be lowered to instructions in this specification via SelectionDAG is RISCVISelLowering.cpp.

In order to understand the various steps after instruction selection, let us inspect the diffs produced at each step by running llc -print-changed=diff.

The first significant pass to run after the instruction selection is early-machinelicm, which is just a backend version of the middle-end loop invariant code motion transform. Next, we have livevars, which analyzes live variables, and marks registers as killed aggressively.

phi-node-elimination runs next, which removes PHI nodes, leaving the MIR in SSA form by inserting pseudo COPY instructions:

  Function Live Ins: $x10 in %18, $x11 in %19, $x12 in %20

  bb.0 (%ir-block.3):
   successors: %bb.1(0x80000000); %bb.1(100.00%)
   liveins: $x10, $x11, $x12
   %20:gpr = COPY killed $x12
   %19:gpr = COPY killed $x11
   %18:gpr = COPY killed $x10
   %22:gpr = COPY $x0
   %21:gpr = COPY killed %22:gpr
   %27:gpr = LUI 4
   %32:gpr = ADDI $x0, 64
+  %36:gpr = COPY killed %18:gpr
+  %37:gpr = COPY killed %21:gpr

 bb.1 (%ir-block.4):
   successors: %bb.3(0x80000000); %bb.3(100.00%)
-  %0:gpr = PHI %18:gpr, %bb.0, %9:gpr, %bb.4
-  %1:gpr = PHI %21:gpr, %bb.0, %8:gpr, %bb.4
+  %1:gpr = COPY killed %37:gpr
+  %0:gpr = COPY killed %36:gpr
   %2:gpr = nuw nsw SLLI %1:gpr, 6
   %24:gpr = COPY $x0
   %23:gpr = COPY killed %24:gpr
+  %38:gpr = COPY %19:gpr
+  %39:gpr = COPY killed %23:gpr
   PseudoBR %bb.3

 bb.2 (%ir-block.7):
   PseudoRET

 bb.3 (%ir-block.8):
   successors: %bb.6(0x80000000); %bb.6(100.00%)
-  %3:gpr = PHI %19:gpr, %bb.1, %11:gpr, %bb.5
-  %4:gpr = PHI %23:gpr, %bb.1, %10:gpr, %bb.5
+  %4:gpr = COPY killed %39:gpr
+  %3:gpr = COPY killed %38:gpr
   %25:gpr = nuw nsw SLLI %4:gpr, 2
   %26:gpr = ADD killed %25:gpr, %19:gpr
   %5:gpr = ADD killed %26:gpr, %27:gpr
   %28:gpr = OR %4:gpr, %2:gpr
   %29:gpr = SLLI killed %28:gpr, 2
   %6:gpr = ADD %20:gpr, killed %29:gpr
   %7:fpr32 = FLW %6:gpr, 0
+  %40:gpr = COPY %0:gpr
+  %41:gpr = COPY %3:gpr
+  %42:fpr32 = COPY killed %7:fpr32
   PseudoBR %bb.6

 bb.4 (%ir-block.15):
   successors: %bb.2(0x04000000), %bb.1(0x7c000000);
               %bb.2(3.12%), %bb.1(96.88%)
   %8:gpr = nuw nsw ADDI killed %1:gpr, 1
   %9:gpr = ADDI killed %0:gpr, 256
-  BEQ %8:gpr, %32:gpr, %bb.2
+  %36:gpr = COPY killed %9:gpr
+  %37:gpr = COPY %8:gpr
+  BEQ killed %8:gpr, %32:gpr, %bb.2
   PseudoBR %bb.1

 bb.5 (%ir-block.18):
   successors: %bb.4(0x04000000), %bb.3(0x7c000000);
               %bb.4(3.12%), %bb.3(96.88%)
   FSW killed %15:fpr32, killed %6:gpr, 0
   %10:gpr = nuw nsw ADDI killed %4:gpr, 1
   %11:gpr = ADDI killed %3:gpr, 4
-  BEQ %10:gpr, %32:gpr, %bb.4
+  %38:gpr = COPY killed %11:gpr
+  %39:gpr = COPY %10:gpr
+  BEQ killed %10:gpr, %32:gpr, %bb.4
   PseudoBR %bb.3

 bb.6 (%ir-block.21):
   successors: %bb.5(0x04000000), %bb.6(0x7c000000);
               %bb.5(3.12%), %bb.6(96.88%)
-  %12:gpr = PHI %0:gpr, %bb.3, %17:gpr, %bb.6
-  %13:gpr = PHI %3:gpr, %bb.3, %16:gpr, %bb.6
-  %14:fpr32 = PHI %7:fpr32, %bb.3, %15:fpr32, %bb.6
+  %14:fpr32 = COPY killed %42:fpr32
+  %13:gpr = COPY killed %41:gpr
+  %12:gpr = COPY killed %40:gpr
   %30:fpr32 = FLW %12:gpr, 0
   %31:fpr32 = FLW %13:gpr, 0
   %15:fpr32 = nofpexcept FMADD_S killed %30:fpr32,
               killed %31:fpr32, killed %14:fpr32,
               7, implicit $frm
   %16:gpr = ADDI killed %13:gpr, 256
   %17:gpr = ADDI killed %12:gpr, 4
-  BEQ %16:gpr, %5:gpr, %bb.5
+  %40:gpr = COPY killed %17:gpr
+  %41:gpr = COPY %16:gpr
+  %42:fpr32 = COPY %15:fpr32
+  BEQ killed %16:gpr, %5:gpr, %bb.5
   PseudoBR %bb.6

The next pass liveintervals undoes many of the killed annotations inserted by livevars, by performing a more sophisticated analysis to determine the actual virtual registers killed in the loop-nest.

The final preparatory step before instruction scheduling and register allocation is register-coalescer, which eliminates many of the COPY instructions, taking the MIR out of SSA form:

 Function Live Ins: $x10 in %18, $x11 in %19, $x12 in %20

 bb.0 (%ir-block.3):
   successors: %bb.1(0x80000000); %bb.1(100.00%)
   liveins: $x10, $x11, $x12
   %20:gpr = COPY $x12
   %19:gpr = COPY $x11
-  %18:gpr = COPY $x10
-  %22:gpr = COPY $x0
-  %21:gpr = COPY %22:gpr
+  %0:gpr = COPY $x10
   %27:gpr = LUI 4
   %32:gpr = ADDI $x0, 64
-  %36:gpr = COPY %18:gpr
-  %37:gpr = COPY %21:gpr
+  %1:gpr = COPY $x0

 bb.1 (%ir-block.4):
   successors: %bb.3(0x80000000); %bb.3(100.00%)
-  %1:gpr = COPY %37:gpr
-  %0:gpr = COPY %36:gpr
   %2:gpr = nuw nsw SLLI %1:gpr, 6
-  %24:gpr = COPY $x0
-  %23:gpr = COPY %24:gpr
-  %38:gpr = COPY %19:gpr
-  %39:gpr = COPY %23:gpr
+  %3:gpr = COPY %19:gpr
+  %4:gpr = COPY $x0
   PseudoBR %bb.3

 bb.2 (%ir-block.7):
   PseudoRET

 bb.3 (%ir-block.8):
   successors: %bb.6(0x80000000); %bb.6(100.00%)
-  %4:gpr = COPY %39:gpr
-  %3:gpr = COPY %38:gpr
   %25:gpr = nuw nsw SLLI %4:gpr, 2
   %26:gpr = ADD %25:gpr, %19:gpr
   %5:gpr = ADD %26:gpr, %27:gpr
   %28:gpr = OR %4:gpr, %2:gpr
   %29:gpr = SLLI %28:gpr, 2
   %6:gpr = ADD %20:gpr, %29:gpr
-  %7:fpr32 = FLW %6:gpr, 0
+  %42:fpr32 = FLW %6:gpr, 0
   %40:gpr = COPY %0:gpr
   %41:gpr = COPY %3:gpr
-  %42:fpr32 = COPY %7:fpr32
   PseudoBR %bb.6

 bb.4 (%ir-block.15):
   successors: %bb.2(0x04000000), %bb.1(0x7c000000);
               %bb.2(3.12%), %bb.1(96.88%)
-  %8:gpr = nuw nsw ADDI %1:gpr, 1
-  %9:gpr = ADDI %0:gpr, 256
-  %36:gpr = COPY %9:gpr
-  %37:gpr = COPY %8:gpr
-  BEQ %8:gpr, %32:gpr, %bb.2
+  %1:gpr = nuw nsw ADDI %1:gpr, 1
+  %0:gpr = ADDI %0:gpr, 256
+  BEQ %1:gpr, %32:gpr, %bb.2
   PseudoBR %bb.1

 bb.5 (%ir-block.18):
   successors: %bb.4(0x04000000), %bb.3(0x7c000000);
               %bb.4(3.12%), %bb.3(96.88%)
-  FSW %15:fpr32, %6:gpr, 0
-  %10:gpr = nuw nsw ADDI %4:gpr, 1
-  %11:gpr = ADDI %3:gpr, 4
-  %38:gpr = COPY %11:gpr
-  %39:gpr = COPY %10:gpr
-  BEQ %10:gpr, %32:gpr, %bb.4
+  FSW %42:fpr32, %6:gpr, 0
+  %4:gpr = nuw nsw ADDI %4:gpr, 1
+  %3:gpr = ADDI %3:gpr, 4
+  BEQ %4:gpr, %32:gpr, %bb.4
   PseudoBR %bb.3

 bb.6 (%ir-block.21):
   successors: %bb.5(0x04000000), %bb.6(0x7c000000);
               %bb.5(3.12%), %bb.6(96.88%)
-  %14:fpr32 = COPY %42:fpr32
-  %13:gpr = COPY %41:gpr
-  %12:gpr = COPY %40:gpr
-  %30:fpr32 = FLW %12:gpr, 0
-  %31:fpr32 = FLW %13:gpr, 0
-  %15:fpr32 = nofpexcept FMADD_S %30:fpr32,
-              %31:fpr32, %14:fpr32, 7, implicit $frm
-  %16:gpr = ADDI %13:gpr, 256
-  %17:gpr = ADDI %12:gpr, 4
-  %40:gpr = COPY %17:gpr
-  %41:gpr = COPY %16:gpr
-  %42:fpr32 = COPY %15:fpr32
-  BEQ %16:gpr, %5:gpr, %bb.5
+  %30:fpr32 = FLW %40:gpr, 0
+  %31:fpr32 = FLW %41:gpr, 0
+  %42:fpr32 = nofpexcept FMADD_S %30:fpr32,
+              %31:fpr32, %42:fpr32, 7, implicit $frm
+  %41:gpr = ADDI %41:gpr, 256
+  %40:gpr = ADDI %40:gpr, 4
+  BEQ %41:gpr, %5:gpr, %bb.5
   PseudoBR %bb.6

The purpose of the instruction scheduler, machine-scheduler, is to re-order instructions to minimize hazards and stalls. The optimal schedule depends on micro-architectural details, and an example of a scheduler descriptor is RISCVSchedSiFive7.td, for the SiFive 7 CPU. In our invocation of Clang, we did not specify a particular CPU, so some generic information for RISC-V is used. The optimal schedule can be seen in the following diff:

 Function Live Ins: $x10 in %18, $x11 in %19, $x12 in %20

 bb.0 (%ir-block.3):
   successors: %bb.1(0x80000000); %bb.1(100.00%)
   liveins: $x10, $x11, $x12
   %20:gpr = COPY $x12
   %19:gpr = COPY $x11
   %0:gpr = COPY $x10
+  %1:gpr = COPY $x0
   %27:gpr = LUI 4
   %32:gpr = ADDI $x0, 64
-  %1:gpr = COPY $x0

 bb.1 (%ir-block.4):
   successors: %bb.3(0x80000000); %bb.3(100.00%)
+  %4:gpr = COPY $x0
   %2:gpr = nuw nsw SLLI %1:gpr, 6
   %3:gpr = COPY %19:gpr
-  %4:gpr = COPY $x0
   PseudoBR %bb.3

 bb.2 (%ir-block.7):
   PseudoRET

 bb.3 (%ir-block.8):
   successors: %bb.6(0x80000000); %bb.6(100.00%)
-  %25:gpr = nuw nsw SLLI %4:gpr, 2
-  %26:gpr = ADD %25:gpr, %19:gpr
-  %5:gpr = ADD %26:gpr, %27:gpr
   %28:gpr = OR %4:gpr, %2:gpr
   %29:gpr = SLLI %28:gpr, 2
   %6:gpr = ADD %20:gpr, %29:gpr
   %42:fpr32 = FLW %6:gpr, 0
+  %25:gpr = nuw nsw SLLI %4:gpr, 2
+  %26:gpr = ADD %25:gpr, %19:gpr
+  %5:gpr = ADD %26:gpr, %27:gpr
   %40:gpr = COPY %0:gpr
   %41:gpr = COPY %3:gpr
   PseudoBR %bb.6

 bb.4 (%ir-block.15):
   successors: %bb.2(0x04000000), %bb.1(0x7c000000);
               %bb.2(3.12%), %bb.1(96.88%)
   %1:gpr = nuw nsw ADDI %1:gpr, 1
   %0:gpr = ADDI %0:gpr, 256
   BEQ %1:gpr, %32:gpr, %bb.2
   PseudoBR %bb.1

 bb.5 (%ir-block.18):
   successors: %bb.4(0x04000000), %bb.3(0x7c000000);
               %bb.4(3.12%), %bb.3(96.88%)
   FSW %42:fpr32, %6:gpr, 0
   %4:gpr = nuw nsw ADDI %4:gpr, 1
   %3:gpr = ADDI %3:gpr, 4
   BEQ %4:gpr, %32:gpr, %bb.4
   PseudoBR %bb.3

 bb.6 (%ir-block.21):
   successors: %bb.5(0x04000000), %bb.6(0x7c000000);
               %bb.5(3.12%), %bb.6(96.88%)
   %30:fpr32 = FLW %40:gpr, 0
   %31:fpr32 = FLW %41:gpr, 0
   %42:fpr32 = nofpexcept FMADD_S %30:fpr32, %31:fpr32,
               %42:fpr32, 7, implicit $frm
   %41:gpr = ADDI %41:gpr, 256
   %40:gpr = ADDI %40:gpr, 4
   BEQ %41:gpr, %5:gpr, %bb.5
   PseudoBR %bb.6

We're now ready to run the greedy register allocator †, whose purpose is to assign every virtual register a physical register, while re-using the limited number of physical registers on the machine, and minimizing register spills. The actual replacement is done by virtregrewriter:

 Function Live Ins: $x10, $x11, $x12

 bb.0 (%ir-block.3):
   successors: %bb.1(0x80000000); %bb.1(100.00%)
   liveins: $x10, $x11, $x12
-  %20:gpr = COPY $x12
-  %19:gpr = COPY $x11
-  %0:gpr = COPY $x10
-  %1:gpr = COPY $x0
-  %27:gpr = LUI 4
-  %32:gpr = ADDI $x0, 64
+  renamable $x16 = COPY $x0
+  renamable $x17 = LUI 4
+  renamable $x5 = ADDI $x0, 64

 bb.1 (%ir-block.4):
   successors: %bb.3(0x80000000); %bb.3(100.00%)
-  %4:gpr = COPY $x0
-  %2:gpr = nuw nsw SLLI %1:gpr, 6
-  %3:gpr = COPY %19:gpr
+  liveins: $x5, $x10, $x11, $x12, $x16, $x17
+  renamable $x29 = COPY $x0
+  renamable $x6 = nuw nsw SLLI renamable $x16, 6
+  renamable $x28 = COPY renamable $x11
   PseudoBR %bb.3

 bb.2 (%ir-block.7):
 ; predecessors: %bb.4

   PseudoRET

 bb.3 (%ir-block.8):
   successors: %bb.6(0x80000000); %bb.6(100.00%)
-  %28:gpr = OR %4:gpr, %2:gpr
-  %29:gpr = SLLI %28:gpr, 2
-  %6:gpr = ADD %20:gpr, %29:gpr
-  %42:fpr32 = FLW %6:gpr, 0
-  %25:gpr = nuw nsw SLLI %4:gpr, 2
-  %26:gpr = ADD %25:gpr, %19:gpr
-  %5:gpr = ADD %26:gpr, %27:gpr
-  %40:gpr = COPY %0:gpr
-  %41:gpr = COPY %3:gpr
+  liveins: $x5, $x6, $x10, $x11, $x12, $x16, $x17, $x28, $x29
+  renamable $x13 = OR renamable $x29, renamable $x6
+  renamable $x13 = SLLI killed renamable $x13, 2
+  renamable $x7 = ADD renamable $x12, killed renamable $x13
+  renamable $f15_f = FLW renamable $x7, 0
+  renamable $x13 = nuw nsw SLLI renamable $x29, 2
+  renamable $x13 = ADD killed renamable $x13, renamable $x11
+  renamable $x14 = ADD killed renamable $x13, renamable $x17
+  renamable $x15 = COPY renamable $x10
+  renamable $x13 = COPY renamable $x28
   PseudoBR %bb.6

 bb.4 (%ir-block.15):
   successors: %bb.2(0x04000000), %bb.1(0x7c000000);
               %bb.2(3.12%), %bb.1(96.88%)
-  %1:gpr = nuw nsw ADDI %1:gpr, 1
-  %0:gpr = ADDI %0:gpr, 256
-  BEQ %1:gpr, %32:gpr, %bb.2
+  liveins: $x5, $x10, $x11, $x12, $x16, $x17
+  renamable $x16 = nuw nsw ADDI killed renamable $x16, 1
+  renamable $x10 = ADDI killed renamable $x10, 256
+  BEQ renamable $x16, renamable $x5, %bb.2
   PseudoBR %bb.1

 bb.5 (%ir-block.18):
   successors: %bb.4(0x04000000), %bb.3(0x7c000000);
               %bb.4(3.12%), %bb.3(96.88%)
-  FSW %42:fpr32, %6:gpr, 0
-  %4:gpr = nuw nsw ADDI %4:gpr, 1
-  %3:gpr = ADDI %3:gpr, 4
-  BEQ %4:gpr, %32:gpr, %bb.4
+  liveins: $x5, $x6, $x7, $x10, $x11, $x12, $x16,
+           $x17, $x28, $x29, $f15_f
+  FSW killed renamable $f15_f, killed renamable $x7, 0
+  renamable $x29 = nuw nsw ADDI killed renamable $x29, 1
+  renamable $x28 = ADDI killed renamable $x28, 4
+  BEQ renamable $x29, renamable $x5, %bb.4
   PseudoBR %bb.3

 bb.6 (%ir-block.21):
   successors: %bb.5(0x04000000), %bb.6(0x7c000000);
               %bb.5(3.12%), %bb.6(96.88%)
-  %30:fpr32 = FLW %40:gpr, 0
-  %31:fpr32 = FLW %41:gpr, 0
-  %42:fpr32 = nofpexcept FMADD_S %30:fpr32, %31:fpr32,
-              %42:fpr32, 7, implicit $frm
-  %41:gpr = ADDI %41:gpr, 256
-  %40:gpr = ADDI %40:gpr, 4
-  BEQ %41:gpr, %5:gpr, %bb.5
+  liveins: $x5, $x6, $x7, $x10, $x11, $x12, $x13, $x14,
+           $x15, $x16, $x17, $x28, $x29, $f15_f
+  renamable $f14_f = FLW renamable $x15, 0
+  renamable $f13_f = FLW renamable $x13, 0
+  renamable $f15_f = nofpexcept FMADD_S killed renamable $f14_f,
+                     killed renamable $f13_f, killed renamable $f15_f,
+                     7, implicit $frm
+  renamable $x13 = ADDI killed renamable $x13, 256
+  renamable $x15 = ADDI killed renamable $x15, 4
+  BEQ renamable $x13, renamable $x14, %bb.5
   PseudoBR %bb.6

Before we emit the assembly, there are two pending steps: first, basic blocks need to be re-ordered, so that the final structure is flat and there are no unnecessary jumps; second, pseudo-instructions like COPY need to be replaced by real instructions. The final MIR after branch-folder and postrapseduos is:

Function Live Ins: $x10, $x11, $x12

bb.0 (%ir-block.3):
  successors: %bb.1(0x80000000); %bb.1(100.00%)
  liveins: $x10, $x11, $x12
  $x16 = ADDI $x0, 0
  renamable $x17 = LUI 4
  renamable $x5 = ADDI $x0, 64

bb.1 (%ir-block.4):
  successors: %bb.2(0x80000000); %bb.2(100.00%)
  liveins: $x5, $x10, $x11, $x12, $x16, $x17
  $x29 = ADDI $x0, 0
  renamable $x6 = nuw nsw SLLI renamable $x16, 6
  $x28 = ADDI $x11, 0

bb.2 (%ir-block.8):
  successors: %bb.3(0x80000000); %bb.3(100.00%)
  liveins: $x5, $x6, $x10, $x11, $x12, $x16, $x17, $x28, $x29
  renamable $x13 = OR renamable $x29, renamable $x6
  renamable $x13 = SLLI killed renamable $x13, 2
  renamable $x7 = ADD renamable $x12, killed renamable $x13
  renamable $f15_f = FLW renamable $x7, 0
  renamable $x13 = nuw nsw SLLI renamable $x29, 2
  renamable $x13 = ADD killed renamable $x13, renamable $x11
  renamable $x14 = ADD killed renamable $x13, renamable $x17
  $x15 = ADDI $x10, 0
  $x13 = ADDI $x28, 0

bb.3 (%ir-block.21):
  successors: %bb.4(0x04000000), %bb.3(0x7c000000);
              %bb.4(3.12%), %bb.3(96.88%)
  liveins: $x5, $x6, $x7, $x10, $x11, $x12, $x13, $x14,
           $x15, $x16, $x17, $x28, $x29, $f15_f
  renamable $f14_f = FLW renamable $x15, 0
  renamable $f13_f = FLW renamable $x13, 0
  renamable $f15_f = nofpexcept FMADD_S killed renamable $f14_f,
                     killed renamable $f13_f, killed renamable $f15_f,
                     7, implicit $frm
  renamable $x13 = ADDI killed renamable $x13, 256
  renamable $x15 = ADDI killed renamable $x15, 4
  BNE renamable $x13, renamable $x14, %bb.3

bb.4 (%ir-block.18):
  successors: %bb.5(0x04000000), %bb.2(0x7c000000);
              %bb.5(3.12%), %bb.2(96.88%)
  liveins: $x5, $x6, $x7, $x10, $x11, $x12, $x16, $x17, $x28, $x29, $f15_f
  FSW killed renamable $f15_f, killed renamable $x7, 0
  renamable $x29 = nuw nsw ADDI killed renamable $x29, 1
  renamable $x28 = ADDI killed renamable $x28, 4
  BNE renamable $x29, renamable $x5, %bb.2

bb.5 (%ir-block.15):
  successors: %bb.6(0x04000000), %bb.1(0x7c000000);
              %bb.6(3.12%), %bb.1(96.88%)
  liveins: $x5, $x10, $x11, $x12, $x16, $x17
  renamable $x16 = nuw nsw ADDI killed renamable $x16, 1
  renamable $x10 = ADDI killed renamable $x10, 256
  BNE renamable $x16, renamable $x5, %bb.1

bb.6 (%ir-block.7):
  PseudoRET

The assembly is produced from the final MIR by AsmPrinter, which doesn't do much more than pretty-printing registers ($x10 is a0, $x11 is a1), and instructions (FMADD_S is fmadd.s), while dropping auxiliary information. This output can be checked against the RISC-V ISA manual.

matmul:                                 # @matmul
# %bb.0:
  li a6, 0
  lui   a7, 4
  li t0, 64
.LBB0_1:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB0_2 Depth 2
                                        #       Child Loop BB0_3 Depth 3
  li t4, 0
  slli  t1, a6, 6
  mv t3, a1
.LBB0_2:                                #   Parent Loop BB0_1 Depth=1
                                        # =>  This Loop Header: Depth=2
                                        #       Child Loop BB0_3 Depth 3
  or a3, t4, t1
  slli  a3, a3, 2
  add   t2, a2, a3
  flw   fa5, 0(t2)
  slli  a3, t4, 2
  add   a3, a3, a1
  add   a4, a3, a7
  mv a5, a0
  mv a3, t3
.LBB0_3:                                #   Parent Loop BB0_1 Depth=1
                                        #     Parent Loop BB0_2 Depth=2
                                        # =>    This Inner Loop Header: Depth=3
  flw   fa4, 0(a5)
  flw   fa3, 0(a3)
  fmadd.s fa5, fa4, fa3, fa5
  addi  a3, a3, 256
  addi  a5, a5, 4
  bne   a3, a4, .LBB0_3
# %bb.4:                                #   in Loop: Header=BB0_2 Depth=2
  fsw   fa5, 0(t2)
  addi  t4, t4, 1
  addi  t3, t3, 4
  bne   t4, t0, .LBB0_2
# %bb.5:                                #   in Loop: Header=BB0_1 Depth=1
  addi  a6, a6, 1
  addi  a0, a0, 256
  bne   a6, t0, .LBB0_1
# %bb.6:
  ret