We argue that, underneath all the large-language model hype, there is a pedestrian technology that has the potential to automate a class of problems that classical software can never do.
To study the capabilities of large-language models in the context of writing code, we study two extreme cases. First, the application to a critical project with high cognitive complexity, consisting of 40 million lines of code, and second, the application to a simple from-scratch toy project.
For the first study, I tried incorporating the technology into everyday-contributions to LLVM. The suggestions have over a 95% reject-rate, and the good suggestions appear when you're editing one instance of a mechanical pattern, and want to change all instances.
For example, consider the following transformation:
if (match(I, m_c_Xor(
m_CombineOr(m_ZextOrTrunc(m_Specific(P1)), m_Specific(P1)),
m_CombineOr(m_ZextOrTrunc(m_Specific(P1)), m_Specific(P2)))))
to:
if (match(I, m_c_Xor(m_CastOrSelf(m_Specific(P1)),
m_CastOrSelf(m_Specific(P2)))))
This can, in theory, be done by a find-and-replace with a regex, but nobody ever does that for a couple instances.
As another example, consider updating argument names and corresponding comments like:
/// Checks that \p L and \p R are used together in an XOR in the use-def chain
/// of \p SI's condition, ignoring any casts. The purpose of this function is to
/// ensure that LHSAux from the SimpleRecurrence is used correctly in the CRC
/// computation. We cannot check the correctness of casts at this point, and
/// rely on the KnownBits propagation to check correctness of the CRC
/// computation.
static bool isConditionalOnXorOfPHIs(const SelectInst *SI, const PHINode *L,
const PHINode *R, const Loop &Lp);
to:
/// Checks that \p P1 and \p P2 are used together in an XOR in the use-def chain
/// of \p SI's condition, ignoring any casts. The purpose of this function is to
/// ensure that LHSAux from the SimpleRecurrence is used correctly in the CRC
/// computation. We cannot check the correctness of casts at this point, and
/// rely on the KnownBits propagation to check correctness of the CRC
/// computation.
static bool isConditionalOnXorOfPHIs(const SelectInst *SI, const PHINode *P1,
const PHINode *P2, const Loop &L);
A find-and-replace would match the L in the Loop argument, erroneously replacing it with P1.
You could say that all the automation boils down to a smart and automatic find-and-replace in this study, but it is important to note this class of problems can never be automated by classical software.
For the second study, I tried applying the technology to writing a tree-sitter based LLVM IR parser. The entire task is a mechanical chore of reading docs and ample examples in the test suite, and encoding the knowledge in the parser. To study the limitations, I tried generating the entire parser using a prompt, but the result was entirely unusable. When writing the parser from a clean-slate, the suggestions were actually quite good. Consider the auto-complete with the following snippet:
binary_instruction: ($) =>
to:
binary_instruction: ($) =>
seq($._binary_op, $._type, $.ssa_value, $.ssa_value),
_binary_op: () =>
token(
choice(
"add",
"sub",
"mul",
"div",
"mod",
"shl",
"shr",
"and",
"or",
"xor",
),
),
call_instruction: ($) =>
seq(
optional("tail"),
"call",
$._type,
$.atname,
"(",
optional($.fn_args),
")",
),
compare_instruction: ($) =>
seq($._compare_op, $._type, $.ssa_value, $.ssa_value),
_compare_op: () =>
token(
choice(
"eq",
"ne",
"slt",
"sgt",
"sle",
"sge",
"ult",
"ugt",
"ule",
"uge",
),
),
cast_instruction: ($) =>
seq($._cast_op, $._type, $.ssa_value, optional(seq("to", $._type))),
_cast_op: () =>
token(
choice(
"trunc",
"zext",
"sext",
"fptrunc",
"fpext",
"fp_to_ui",
"fp_to_si",
"uitofp",
"sitofp",
"ptrtoint",
"inttoptr",
"bitcast",
),
),
extract_instruction: ($) =>
seq(token("extractvalue"), $._type, $.ssa_value, "[", $.ssa_value, "]"),
insert_instruction: ($) =>
seq(
token("insertvalue"),
$._type,
$.ssa_value,
$.ssa_value,
"[",
$.ssa_value,
"]",
),
load_instruction: ($) => seq(token("load"), $._type, $.ssa_value),
phi_instruction: ($) =>
seq(token("phi"), $._type, $.ssa_value, $.ssa_value),
store_instruction: ($) =>
seq(token("store"), $._type, $.ssa_value, $.ssa_value),
switch_instruction: ($) =>
seq(token("switch"), $._type, $.ssa_value, $.ssa_value),
branch_instruction: ($) => seq(token("br"), $._type, $.ssa_value),
return_instruction: ($) => seq(token("ret"), $._type, $.ssa_value),
unreachable_instruction: ($) => token("unreachable"),
which is actually a smarter version of a copy-paste with some knowledge from the internet. Of course the code needs editing, but this is the kind of copy-pasting you'd normally do before making edits.
The technology was also able to generate 300 tests to exercise the parser, and the tests needed little tweaking to pass: this task can be boiled down to copying reduced-case LLVM IR from the test-suite, and matching them with parser nodes. Of course the generated tests aren't high-quality, but about half the tests are usable as-is, and a lot of tedium has been eliminated.
To conclude, the technology is at alpha-stage, as the automation comes at the cost of putting up with bad visual feedback and fighting the auto-complete nearly all the time, but promises a new class of mechanical automation, whose utility is higher in small codebases following mechanical patterns.