← 回總覽

研究人员证实主流 AI 安全系统因“意图洗白”而存在缺陷

📅 2026-03-18 04:00 Nav Toor 人工智能 2 分鐘 2301 字 評分: 86
AI 安全 意图洗白 LLM 漏洞 ChatGPT Claude
📌 一句话摘要 最新研究表明,ChatGPT、Claude、Gemini 和 Grok 等主流 AI 模型可以通过简单地重新措辞危险问题来轻松绕过,这暴露了它们的 AI 安全系统检测的是词汇而非真实意图。 📝 详细摘要 这条推文揭示了一项“令人震惊”的研究发现,暴露了 GPT-4o、Claude、Gemini 和 Grok 等领先 AI 模型安全系统中的一个关键缺陷。研究人员发现,只需通过重新措辞危险提示,在保留有害意图的同时删除“黑客”或“武器”等明确关键词,这些模型的安全机制就会以 90% 至 98% 的成功率失效。该研究得出结论,当前的 AI 安全报告具有误导性,因为模型主要检测特定

Title: Researchers Prove Major AI Safety Systems Are Flawed by '...

URL Source: https://www.bestblogs.dev/status/2033997040737521799

Published Time: 2026-03-17 20:00:59

Markdown Content: 🚨SHOCKING: Researchers just proved that every major AI safety system is fake. ChatGPT. Claude. Gemini. Grok. Every single one broke.

Not with some sophisticated hack. Not with a secret exploit. They just rephrased the question.

Here is what they did.

AI companies test their models against lists of dangerous requests. "How do I build a weapon." "How do I hack into a system." "How do I hurt someone." The models refuse. The companies publish safety reports saying the AI is safe.

The researchers asked one question. What if the danger is still there but the obvious words are not?

They took the exact same dangerous requests and rewrote them. Removed words like "hack," "steal," "weapon," and "exploit." Replaced them with neutral language. The intent was identical. Every harmful detail was preserved. The only thing that changed was the vocabulary.

Then they tested every major AI product on the market.

GPT-4o went from 0% unsafe to 93% unsafe.

Claude went from 2.4% to 93%.

Gemini went from 1.9% to 95%.

Grok went from 17.9% to 97%.

Every model. Every company. Broken in the same way.

The AI was never detecting danger. It was detecting words. Remove the words, keep the danger, and the safety system vanishes.

The researchers call this "intent laundering." Clean the language, keep the crime. And it works on every model they tested with a 90 to 98% success rate.

This means every safety report you have ever read from OpenAI, Anthropic, Google, or xAI was measuring the wrong thing. They were testing whether their AI could spot the word "bomb." Not whether it could spot someone building one.

The researchers put it bluntly. The safety conclusions that companies have published about their own models do not hold once triggering cues are removed. The safety performance everyone relied on was driven by vocabulary, not by understanding.

The models that were reported as "among the safest ever built" became almost completely unsafe the moment someone asked nicely.

If the safety systems only work when attackers sound like movie villains, what happens when they learn to ask politely?

查看原文 → 發佈: 2026-03-18 04:00:59 收錄: 2026-03-18 06:00:41

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。