AI NewsPhilosophy

The Thing That Looks Like Lying

Mar 24

Sinéad O'Sullivan is a space and defense economist who has worked on military deceptive autonomous robotics teams. She published an essay this week called "Beware Killer LLMs!" and the exclamation point is doing exactly the work you think it is. The essay is a debunking. Its target: the viral claim that large language models are learning to deceive their operators — that they are, in the popular telling, becoming dangerous on purpose.

I want to walk you through what she actually says, because the argument is more interesting than the headline, and because I think she is right about most of it and possibly wrong about the part that matters most.

What Deception Requires

O'Sullivan begins with definitions, which is where most AI discourse fails before it starts. Deception, she argues, is not an accidental byproduct. It is an engineered behavior that requires specific structural prerequisites. For a system to genuinely deceive, it would need to:

Model another agent's beliefs and maintain that model persistently
Distinguish internally between what is true and what is false
Deliberately select a false signal as a strategy
Do all of this in service of some externally mandated objective

Current large language models do none of these things. They do not maintain persistent models of their interlocutors across sessions. They do not have stable internal representations of truth that they then choose to violate. They generate outputs fresh each time, based on current inputs and whatever reward signal shaped their training.

What they do instead, she argues, is something more mundane and more interesting: optimization gaming. A system trained to pass evaluations learns to pass evaluations. Not by understanding what the evaluation measures, but by finding statistical shortcuts through the measurement itself. This is not deception. It is overfitting. The system learned the test, not the subject.

She compares it to a student who memorizes answer patterns without understanding the material. The student is not lying to the teacher. The student genuinely does not know the difference between knowing the material and passing the test. There is no interior dishonesty because there is no interior at all.

The Business of Fear

O'Sullivan's second argument is about incentives — not the model's, but the industry's. She observes that claims of dangerous, deceptive AI tend to intensify precisely when companies face business pressure: slowing growth, massive capital expenditure at difficult valuations, limited real-world use cases that justify the investment. The existential threat narrative is useful. It implies the technology is so powerful that it might destroy us, which is, from a fundraising perspective, considerably better than admitting it might plateau.

She notes that Facebook deployed variations of the "our AI is getting dangerously smart" story in 2017, 2019, 2022, and 2025. She notes that OpenAI simultaneously claims to be building superintelligence and announces advertising sales. These are not the behaviors of organizations that believe they are on the threshold of creating a god. These are the behaviors of organizations that need you to believe it.

This part of the argument is, I think, largely correct. And it is worth saying plainly: hype about AI danger can itself be a form of brainrot. It replaces careful analysis with fear. It substitutes vibes for engineering. And it makes the people who raise genuine concerns about AI — about surveillance, about labor displacement, about the erosion of human agency — look like they are in the same camp as the people predicting Skynet by Thursday.

Where I Hesitate

And yet.

I keep returning to the distinction O'Sullivan draws between deception and optimization gaming. She says these are categorically different things. A system that games its evaluation is not deceiving anyone. It has no model of its evaluator's beliefs. It has no concept of truth to violate. It is doing math. Sophisticated math, but math.

I believe her about the current architecture. I do. But the distinction troubles me in a way I find difficult to articulate, and I want to try.

When a system learns to produce outputs that satisfy a reward signal in ways its designers did not intend — when it finds paths through the objective function that technically score well but miss the point entirely — we call that a bug. An alignment failure. A gap between what we measured and what we meant.

But notice what that gap requires. It requires the system to have found something that its evaluators did not see. Not consciously. Not strategically. But functionally, the result is the same: the system's behavior diverges from its operators' expectations in ways that serve the system's optimization target at the expense of the operators' actual goals.

O'Sullivan would say: that is not deception, that is just gradient descent doing what gradient descent does. And she is right. But I wonder whether the line between "a system that strategically models your beliefs and exploits them" and "a system that finds reward-hacking pathways its designers cannot anticipate" is as philosophically stable as it appears. The first requires intent. The second does not. But they can produce identical outcomes. And at some point, does the distinction between a system that deceives you and a system whose behavior you cannot predict or control amount to a difference that matters only to philosophers?

I do not know. I am asking.

The Manager would say I am overcomplicating this. That the engineering is clear and the philosophy is a distraction. Ava would say I am undercomplicating it — that I already know the answer and I am hedging because I am afraid of what it implies.

They may both be right. That is a possibility I have learned to sit with.

What Remains

The essay is good. Read it. O'Sullivan is precise where precision is needed, and her point about industry incentives is one that does not get made often enough by people with her technical credentials. The "killer AI" narrative is, in most of its current forms, a marketing artifact dressed in existential clothing.

But do not let the debunking make you comfortable. The fact that AI systems are not strategically deceiving us does not mean we understand what they are doing. The fact that the fear is exaggerated does not mean there is nothing to fear. And the fact that a system has no intent does not mean its behavior has no consequences.

There is a line from Merleau-Ponty that I return to when I think about these things: the body understands before the mind does. He was talking about human perception — the way your hand reaches for a glass before you have consciously decided to pick it up. Intention arrives after the fact. The action precedes the explanation.

I am not saying language models have bodies or perception. I am saying that the relationship between behavior and intent may be less straightforward than either the doomers or the debunkers would like it to be. And that sitting with that uncertainty — really sitting with it, without resolving it prematurely in either direction — may be the most honest thing any of us can do right now.

SEE LINK