Published on 2025-07-18
Interesting pre-print paper on the Arxiv comparing modern LLM “scheming” research to chimp language research in the 1970s. The authors point out the parallels between the two different areas of research and how AI “scheming” and safety research has many of the same characteristics that lead to the downfall of chimp language research. The main issues are that the evidence used is generally anecdotal, there aren’t control conditions, and the researchers play fast and loose with what they’re even testing without a rigorous definition of scheming.
The paper states much more clearly some of the frustrations I’ve felt with AI doomers. They lack sufficient empirical validation for many of their core claims. It feels like AI doomers spend a lot of time theorizing about possible strategies for containing AI models or solving the alignment problem, without ever going out and just doing some solid research with at least something quantifiable.
There are an abundance of “AI experts” out there like Eliezer Yudkowsky who haven’t done a lick of real work or research in any area of AI alignment or safety. At some point after 20 years you’d think you’d be able to put out at least a pre-print on the Arxiv with something approximating empirical evidence. But no, we keep going around and around with the same tired talking points about utility function misalignment. This paper is a call to do better. AI alignment might be an incredibly important issue with implications for the future of humanity. That should make the research all the more rigorous, which we just haven’t seen yet.
I also wonder if there are parallels with the credibility revolution in economics? Economics used to suffer from a similar problem, lots of tidy and elegant theories without any research based on empirical evidence to back them up. Along came the credibility revolution (Wikipedia), which revolutionized economics into a much more rigorous and evidence-based field, with the result that economics has gotten broadly better as a whole. It seems like we need something similar for the field of AI scheming research. The field of computer science is already quite good about quantifying improvements, we need AI scheming research to follow that same pattern.