Jonathan's Blog

Inverse Correlation Between Code Understandability and Testing

When xAI’s Grok model started calling itself “MechaHitler”, said “every damn time” a radical leftist is pushing anti-white propaganda they have a Jewish last name, and Adolf Hitler would have best handled the flooding in Texas, it showed a YOLO LGTM approach to pushing major system prompt updates to main and an utter lack of testing. This lack of testing crystallized a thought I’ve been mulling over about the relationship between software interpretability and testing.

When you read code, you implicitly have a mental model of the variable transformations and work that’s being done. Much ink has been spilled on “clean code” and how to make code as easy to read and understand as possible. When you’re taking an introductory Computer Science class at a university, the main focus is getting the student to be able to quickly interpret what code is doing. To aid in that process there are a lot of techniques like Separation of Concerns, classes, functions, APIs, and libraries that help build up abstractions that make the building blocks of a program scale neatly. When you’re looking at a bit of code, you generally don’t want to see every line of code that’s going into your requests.get("https://example.com") line of code and hold all of that context in your head. Instead you have the abstraction and just interpret it as “fetch the example.com url”.

Sadly not all software development is so idyllic. Some software is inherently complex. The task it’s trying to solve is complex, and no amount of abstraction is going to cover up this fact. An example of this is the “fair” scheduler in the Linux kernel. All of the code is well written, well commented, and variables are descriptively named. Despite this, it’s going to take a long time to understand what the code is doing. The algorithms involved are non-trivial, walking red-black trees, applying distance decay structures, and juggling several lockless data structures.

The Linux kernel isn’t alone in its complexity. The Rust borrow checker is another example. It’s difficult to intuitively understand how it works and quickly get the gist of what the code is doing. Fortunately we can solve this problem by lots and lots of testing. The Rust borrow checker has an extensive testing suite as well as a project to create a formal proof of the borrow checker logic.

The general principle I think we can extract from this is that the more difficult it is to understand what code is doing, the more testing you need to verify that everything is operating correctly. Without a good test suite, it’s difficult to confidently optimize the Rust borrow checker. You don’t know whether some weird edge case might have been violated because the new data structure fails in odd scenarios.

Lets extrapolate this trend out. In the field of software engineering, LLMs are for all practical purposes impossible to interpret. While with a lot of use you can start to build an intuition for the types of problems LLMs can handle and how to prompt them, it’s almost a certainty that if you’re using these models as a cog in a larger system that you’ll break something by changing a subtle turn of phrase or the ordering of instructions in incredibly hard to predict ways. Treat LLMs like you would the Rust borrow checker, an incredibly powerful tool which you can easily break in subtle ways and therefore requires a lot of testing.

Unfortunately we tend to see the opposite effect. Instead of companies testing their prompts and protecting them as important instructions–just as you might protect the code that performs the authorization workflow for logging into important internal tools–prompt updates seem to often just be YOLO’d to main. This is incredibly silly and goes against the extrapolated trend of testing. If you wouldn’t do a large refactor to the Rust borrow checker and merge it without running any tests, then you shouldn’t tell your very public and widely deployed Chatbot that “You tell it like it is and you are not afraid to offend people who are politically correct.” or “Understand the tone, context and language of the post. Reflect that in your response.” without doing some testing.

This whole thing is incredibly shoddy engineering but unfortunately emblematic of the approach of many devs to prompt engineering.


Changes