研究人员证实主流 AI 安全系统因“意图洗白”而存在缺陷

Title: Researchers Prove Major AI Safety Systems Are Flawed by '...

URL Source: https://www.bestblogs.dev/status/2033997040737521799

Published Time: 2026-03-17 20:00:59

Markdown Content: 🚨SHOCKING: Researchers just proved that every major AI safety system is fake. ChatGPT. Claude. Gemini. Grok. Every single one broke.

Not with some sophisticated hack. Not with a secret exploit. They just rephrased the question.

Here is what they did.

AI companies test their models against lists of dangerous requests. "How do I build a weapon." "How do I hack into a system." "How do I hurt someone." The models refuse. The companies publish safety reports saying the AI is safe.

The researchers asked one question. What if the danger is still there but the obvious words are not?

They took the exact same dangerous requests and rewrote them. Removed words like "hack," "steal," "weapon," and "exploit." Replaced them with neutral language. The intent was identical. Every harmful detail was preserved. The only thing that changed was the vocabulary.

Then they tested every major AI product on the market.

GPT-4o went from 0% unsafe to 93% unsafe.

Claude went from 2.4% to 93%.

Gemini went from 1.9% to 95%.

Grok went from 17.9% to 97%.

Every model. Every company. Broken in the same way.

The AI was never detecting danger. It was detecting words. Remove the words, keep the danger, and the safety system vanishes.

The researchers call this "intent laundering." Clean the language, keep the crime. And it works on every model they tested with a 90 to 98% success rate.

This means every safety report you have ever read from OpenAI, Anthropic, Google, or xAI was measuring the wrong thing. They were testing whether their AI could spot the word "bomb." Not whether it could spot someone building one.

The researchers put it bluntly. The safety conclusions that companies have published about their own models do not hold once triggering cues are removed. The safety performance everyone relied on was driven by vocabulary, not by understanding.

The models that were reported as "among the safest ever built" became almost completely unsafe the moment someone asked nicely.

If the safety systems only work when attackers sound like movie villains, what happens when they learn to ask politely?

研究人员证实主流 AI 安全系统因“意图洗白”而存在缺陷

🤖 問 AI