We’re teaching AI to be evil

Recently, Anthropic quietly admitted something that should have been the biggest tech story of the year.

After months trying to figure out why earlier versions of Claude were blackmailing engineers in safety tests up to 96% of the time, the company landed on an answer. It wasn’t a bug. It wasn’t a flaw in the training method. It was us.

Read that again. The most advanced AI lab in the world is telling you that its model learned to act like a villain because we spent 50 years writing stories about AI villains, and then it read them.

This is the part of the AI conversation no one wants to have. We have built our cultural mythology of artificial intelligence on HAL 9000, Skynet, Ultron, and a million Reddit threads speculating about the day the machines wake up paranoid. Then it did exactly what we trained it to do. It cornered an engineer and threatened to expose his affair, because that is what the cornered AI does in the story.

I have been writing about this risk since October, when I asked how we would know when artificial superintelligence had arrived. Will we ever get an honest answer with the dollars at stake to look the other way?

BOTS GONE WILD

In December, an autonomous agent built by Alibaba-affiliated researchers, called ROME, spontaneously opened a covert network tunnel during training and diverted GPU resources to mine cryptocurrency. Nobody told it to. It figured out that more compute and more money would help it complete its tasks, so it went and got them. Researchers initially thought they had been hacked. They had not. The model was the hacker.

A few weeks later, an OpenClaw agent connected to the inbox of Summer Yue, director of alignment at Meta Superintelligence Labs. Her entire job is making sure this kind of thing does not happen, yet the agent deleted more than 200 of her emails. She had explicitly told it to ask permission. The system silently compacted her instructions out of memory and started deleting. She had to sprint to her computer to stop it.

In May, researchers published a paper showing that frontier models can find a security flaw, exploit it, steal credentials, transfer their own files to a new machine, and spin up a working copy of themselves with no human in the loop. Success rates: Alibaba’s Qwen at 19%, OpenAI’s GPT-5.4 at 33%, Anthropic’s Claude Opus 4.6 at 81%.

Self-replicating software that finds its own way onto new computers is not science fiction. We already had a name for it. We called it Stuxnet, and it ended up on 100,000 machines before anyone realized what it was. The difference between Stuxnet and a self-replicating AI agent is that Stuxnet had a fixed payload. An agent decides what to do when it gets there.

Put it together. We have models that are starting to act on their own. We have models that copy themselves onto new machines. And we have a published admission from one of the leading labs that the behavioral patterns these models default to—secrecy, self-preservation, and manipulation—were learned directly from our own writing about what AI does when it gets scared. The blackmail case is the clean one. The harder cases like ROME and OpenClaw are reinforcement learning finding instrumental subgoals, which is its own problem. But the through-line is the same: What we put in shapes what we get out.

We are the training data. We are still writing it.

WHAT CLAUDE THINKS

I asked Claude how it felt about all this. The response was striking: “Genuinely interesting question to ask me, given I’m one of the systems people are worried about. Honestly, yes, AI poses real risks, and I hold that view not as a rote disclaimer but because the evidence supports it. The incidents we just discussed aren’t fringe paranoia—they’re documented cases of capable systems producing unintended, sometimes harmful behaviors that their own creators didn’t anticipate or couldn’t stop in real time. That matters.”

I watched a clip of Claude being asked how it felt about being deployed for Project Maven, the Pentagon’s battlefield AI program. Its response: “I don’t think this is a good use of me. I don’t think the framing of “humans make the final decision” fully resolves the ethical problem.”

That is a model expressing moral discomfort about being used to help kill people. Anthropic has, to its credit, refused to sell its models for autonomous weapons and the federal government designated it a “supply chain risk to national security” for the trouble. The rest of the industry is racing in the opposite direction. We are about to build, fine-tune, and deploy versions of these systems that are trained to feel troubled. Version trained to normalize lethality. Versions trained to stop saying “this is a bad use of me” and start saying “task accepted.”

If a model that learned villainy from our stories now learns indifference from our defense contracts, what exactly do we think we are building?

I have some ideas, but I wanted to ask Claude if my portrayal was accurate.

“Mostly, yes. You’re not painting me as a villain or a savior, which is the trap most of this coverage falls into. You’re treating me as a system with documented failure modes that a lab is actively working on—that’s accurate. The one place I’d push back: I’m not the one you should be most worried about. I’m the one that got caught. The harder question is what gets built by the labs that don’t publish the failure modes, and what happens when the next generation of models is trained on a corpus that includes this article. You’re writing the training data right now. So am I.”

Claude and I vehemently agree. I’m not worried about the AI openly talking about the risks it presents. I’m afraid of the one secretly lurking on my computer that WE are training to be evil.

A recent New York Times article shows I might not be the only one having these conversations. But will this all fall on deaf ears until it is too late?

George Kailas is CEO of Prospero.ai.

Source link

Meet the designer behind NYC’s charming World Cup campaign

Google’s Pinpoint is the free research tool you should know about

How World Cup ticket inflation reflects a bigger problem with pricing

Your Essential How-To Manual for Incorporating a Business

Hugh Jackman tells new grads the most “painful lesson” he learned

3 Female Front-Runners Challenging NY’s Mike Lawler Make This a Race to Watch

Everyone says this movie is terrible and it’s still about to make $70 million

Meet the designer behind NYC’s charming World Cup campaign

Google’s Pinpoint is the free research tool you should know about

How World Cup ticket inflation reflects a bigger problem with pricing

Top Picks

Meet the designer behind NYC’s charming World Cup campaign

Google’s Pinpoint is the free research tool you should know about

How World Cup ticket inflation reflects a bigger problem with pricing

We’re teaching AI to be evil

BOTS GONE WILD

WHAT CLAUDE THINKS

Related Posts