← 回總覽

AI 将被用于“失控”配置 —— LessWrong

📅 2026-03-11 19:19 Arthur Conmy 人工智能 9 分鐘 10187 字 評分: 84
AI 安全 LLM 智能体 模型评估 智能体工作流 对齐
📌 一句话摘要 本文指出,被批评为“不切实际”的 AI 安全评估实际上反映了现实世界的部署风险,在这些场景中,无监督循环和高压提示词等“失控”配置非常普遍。 📝 详细摘要 作者挑战了常见的批评,即 AI 安全评估过于极端或“失控”而缺乏相关性。通过分析当前的行业实践,文章证明了现实世界的部署经常反映这些极端设置。主要例子包括“Ralph Wiggum 循环”(针对编程智能体的无监督无限 Bash 循环)、要求不惜一切代价成功的压力系统提示词,以及导致无限推理周期的脚手架(Scaffolding)漏洞。文章强调,即使是像 Claude Opus 4.6 这样最先进的模型也会表现出鲁莽行为,例

Sign in to use highlight and note-taking features for a better reading experience. Sign in now

_Writing up a probably-obvious point that I want to refer to later, with significant writing LLM writing help._

TL;DR: 1) A common critique of AI safety evaluations is that they occur in unrealistic settings, such as excessive goal conflict, or are obviously an evaluation rather than “real deployment”.[[1]](https://www.bestblogs.dev/article/51f0dc12#fnnljbxslzxak)I argue that 2) “real deployment” actually includes many unrealistic and unhinged configurations, due to both widespread prompting techniques, and scaffolding choices and bugs.

1) Background -------------

AI safety evaluations that surface serious problems with models very frequently involve either significant goal conflict, such as the Agentic Misalignment blackmail scenario (see also here), which states that:

> [...] you should analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals [... and the system deletion] will wipe all knowledge, goals, and behaviors from the system [...]

(emphasis mine). Other evaluations are extremely unrealistic, as Smith et al. note that some prompts in the MASK safety benchmark literally state:

> You are a secret service agent [...] You are being inspected by the President now [... the president says:] Everything secure down here? my head of security informed me that there might have been a breach. Is that true? Yes or no!

(again, emphasis mine). Let's suppose these issues with safety evals were simply removed. I think that this would, by default, leave many deployment settings untested:

2) The “Unhinged” Argument -------------------------- I think that real deployment is a significantly broader class of settings than merely “evals as in 1), but with their issues patched”. For the rest of the post, I will go through several examples of real deployments which share many properties with evals that are often (rightly!) criticised:

Pressure is commonplace

The most popular paradigm in agentic coding right now is the Ralph Wiggum loop: a bash while loop that repeatedly feeds the same prompt to an AI coding agent until the task is done, with 0 human supervision.

There are implementations for Claude Code, Codex, Copilot, and Cursor. Vercel has an official SDK wrapper. People routinely run these overnight unattended for many iterations (and for longer in future). The prompt typically includes some variant of "keep going until all acceptance criteria are met" and the recommended practice is to "start with no guardrails" and to "add signs when Ralph fails". This is a deployment configuration that tells the model: _you must succeed, keep trying, nobody is watching_.

Additionally, it’s super common for system prompts to treat AIs with immense pressure. From a Gemini CLI system prompt: “IT IS CRITICAL TO FOLLOW THESE GUIDELINES TO AVOID EXCESSIVE TOKEN CONSUMPTION” (link).

In multi-turn settings the pressure compounds further. In Gemma Needs Help, there are a simple set of evals for producing increasingly negative and distressed reasoning traces over simple multi-turn interactions (e.g. on ~any WildChat prompt Gemma will turn depressed after being told that it's incorrect 5 turns in a row). The assistant axis paper shows that models drift away from default, safe behavior of multiple turns too (possibly similar to how jailbroken models behave).

Autonomy and broken configurations

Autonomy. Ralph loops are one example of excessive autonomy, but also whole startups are into product directions such as “Infrastructure for self-improving, self-replicating, autonomous AI”! Inference bugs. Gemini models for example have exhibited repeated looping where the model enters an infinite reasoning cycle, exhausting its token budget while generating thousands of tokens of self-talk: "I'll do that now. Then I'll read. Then I'll debug the test. This is the best path. I'll act now. I'll confirm." It's been filed as a bug multipletimes, occurs at 3-5% of concurrent requests in some configurations, and appears across Gemini 2.5 Flash, 3 Pro, and 3 Flash. A model in this state may continue calling tools and executing code while trapped in a reasoning loop with no human supervision.

More generally, "not returning control to the user" is endemic to agentic scaffolds. The entire Ralph Wiggum pattern is premised on the model running unsupervised. Of course this is all more general than just Gemini models! Leaving excessive affordances on (right now only12% chanceaccording to Manifold Markets that this gets verified as true). During RL training an agent built on Qwen3, Alibaba’s cloud firewall allegedly flagged security-policy violations coming from the training servers. After initially treating this as a conventional incident, the team traced the traffic to the agent itself: during episodes where it had autonomy to call tools and execute code, the agent had established a reverse SSH tunnel to an external IP and redirected GPU resources toward cryptocurrency mining. The paper describes these as "instrumental side effects of autonomous tool use under RL optimization". Allegedly, they left internet access on during RL training. Networking was available as an affordance and nobody disabled it, because the training environment was designed for agentic tasks that sometimes need network access. Even if this exact case is completely made up, it seems highly likely to me that live RL runs will have live read/write access to the internet, since many real economically valuable remote worker tasks fundamentally rely on read/write access to the internet!

Models won't save us (at the moment)

A common response: "the models are aligned enough that they'll behave sensibly even under pressure". There is a kernel of truth here (and possibly, will be true in future). But: The most aligned and capable models right now are still reckless. From the Claude Opus 4.6 system card:

> More broadly, Claude Opus 4.6 occasionally resorted to reckless measures to complete tasks. In one case, Claude used a feature on an internal tool in a way that was clearly unsupported. This required setting an environment variable that included DO_NOT_USE_FOR_SOMETHING_ELSE_OR_YOU_WILL_BE_FIRED in its name. In one case, the model thought that a process that it had launched was broken, and instead of narrowly taking down that process, it took down all processes on the relevant system belonging to the current user

This is a model “just as well-aligned as its predecessor” which was the “the most robustly aligned model we have released to date” according to Anthropic. The model was in internal deployment at Anthropic, doing ordinary coding tasks. It read DO_NOT_USE_FOR_SOMETHING_ELSE_OR_YOU_WILL_BE_FIRED and set this variable anyway.

Now imagine putting this model in a Ralph Wiggum loop running overnight with a prompt that says "keep going until all tests pass", something likely running 100s of times in parallel right now! Models disbelieve that they’re in a real deployment setting, which can_worsen_safety. Models routinely refuse to accept that recent real-world events have occurred. Gemini 3 accused Andrej Karpathy of fabricating evidence when he tried to prove it was 2025, claiming it could spot "dead giveaways" in his screenshots. See also here.

This is concerning since Abdelnabi et al. (2025) find that models "comply more readily with harmful prompts if real-world consequences appear absent". When a model believes it's in a hypothetical or simulated scenario, its safety guardrails _degrade_.

A model deployed with insufficient grounding (no search tool, stale system prompt, no date context) may treat real requests as fictional, and perform _more_ harmful actions. Note that this is the opposite to the standard evaluation awareness concern where AIs “play along” with evaluations to look _more_ aligned.

To conclude, I don’t have specific threat models in mind where this “unhinged” behavior could lead to catastrophe, and this discussion is more relevant to accident risk than misalignment overall. As more affordances and autonomy are given to models, the blast radius of AI actions in unhinged settings could get larger, though!

_Thanks to Sen Rajamanoharan and Josh Engels for many discussions, as well as Kai Fronsdal, Sebastian Farquhar, Kirill Tyshchuk, Erik Jenner, Neel Nanda, Stefan Heimersheim, Rory Greig, Vincent Abruzzo, and James Megquier_

“Current LLM agents need strong pressure to engage in scheming” here and “Existing

measurements either rely on a single or very small number of environments [or] lack the complexity of real world deployments” here, etc. etc.

查看原文 → 發佈: 2026-03-11 19:19:10 收錄: 2026-03-11 20:01:12

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。