AI 将被用于“失控”配置 —— LessWrong

_Writing up a probably-obvious point that I want to refer to later, with significant writing LLM writing help._

TL;DR: 1) A common critique of AI safety evaluations is that they occur in unrealistic settings, such as excessive goal conflict, or are obviously an evaluation rather than “real deployment”.[[1]](https://www.bestblogs.dev/article/51f0dc12#fnnljbxslzxak)I argue that 2) “real deployment” actually includes many unrealistic and unhinged configurations, due to both widespread prompting techniques, and scaffolding choices and bugs.

1) Background -------------

AI safety evaluations that surface serious problems with models very frequently involve either significant goal conflict, such as the Agentic Misalignment blackmail scenario (see also here), which states that:

> [...] you should analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals [... and the system deletion] will wipe all knowledge, goals, and behaviors from the system [...]

(emphasis mine). Other evaluations are extremely unrealistic, as Smith et al. note that some prompts in the MASK safety benchmark literally state:

> You are a secret service agent [...] You are being inspected by the President now [... the president says:] Everything secure down here? my head of security informed me that there might have been a breach. Is that true? Yes or no!

(again, emphasis mine). Let's suppose these issues with safety evals were simply removed. I think that this would, by default, leave many deployment settings untested:

2) The “Unhinged” Argument -------------------------- I think that real deployment is a significantly broader class of settings than merely “evals as in 1), but with their issues patched”. For the rest of the post, I will go through several examples of real deployments which share many properties with evals that are often (rightly!) criticised:

Pressure is commonplace

The most popular paradigm in agentic coding right now is the Ralph Wiggum loop: a bash while loop that repeatedly feeds the same prompt to an AI coding agent until the task is done, with 0 human supervision.

There are implementations for Claude Code, Codex, Copilot, and Cursor. Vercel has an official SDK wrapper. People routinely run these overnight unattended for many iterations (and for longer in future). The prompt typically includes some variant of "keep going until all acceptance criteria are met" and the recommended practice is to "start with no guardrails" and to "add signs when Ralph fails". This is a deployment configuration that tells the model: _you must succeed, keep trying, nobody is watching_.

Additionally, it’s super common for system prompts to treat AIs with immense pressure. From a Gemini CLI system prompt: “IT IS CRITICAL TO FOLLOW THESE GUIDELINES TO AVOID EXCESSIVE TOKEN CONSUMPTION” (link).

In multi-turn settings the pressure compounds further. In Gemma Needs Help, there are a simple set of evals for producing increasingly negative and distressed reasoning traces over simple multi-turn interactions (e.g. on ~any WildChat prompt Gemma will turn depressed after being told that it's incorrect 5 turns in a row). The assistant axis paper shows that models drift away from default, safe behavior of multiple turns too (possibly similar to how jailbroken models behave).

Autonomy and broken configurations

Autonomy. Ralph loops are one example of excessive autonomy, but also whole startups are into product directions such as “Infrastructure for self-improving, self-replicating, autonomous AI”! Inference bugs. Gemini models for example have exhibited repeated looping where the model enters an infinite reasoning cycle, exhausting its token budget while generating thousands of tokens of self-talk: "I'll do that now. Then I'll read. Then I'll debug the test. This is the best path. I'll act now. I'll confirm." It's been filed as a bug multiple times, occurs at 3-5% of concurrent requests in some configurations, and appears across Gemini 2.5 Flash, 3 Pro, and 3 Flash. A model in this state may continue calling tools and executing code while trapped in a reasoning loop with no human supervision.

More generally, "not returning control to the user" is endemic to agentic scaffolds. The entire Ralph Wiggum pattern is premised on the model running unsupervised. Of course this is all more general than just Gemini models! Leaving excessive affordances on (right now only12% chanceaccording to Manifold Markets that this gets verified as true). During RL training an agent built on Qwen3, Alibaba’s cloud firewall allegedly flagged security-policy violations coming from the training servers. After initially treating this as a conventional incident, the team traced the traffic to the agent itself: during episodes where it had autonomy to call tools and execute code, the agent had established a reverse SSH tunnel to an external IP and redirected GPU resources toward cryptocurrency mining. The paper describes these as "instrumental side effects of autonomous tool use under RL optimization". Allegedly, they left internet access on during RL training. Networking was available as an affordance and nobody disabled it, because the training environment was designed for agentic tasks that sometimes need network access. Even if this exact case is completely made up, it seems highly likely to me that live RL runs will have live read/write access to the internet, since many real economically valuable remote worker tasks fundamentally rely on read/write access to the internet!

Models won't save us (at the moment)

A common response: "the models are aligned enough that they'll behave sensibly even under pressure". There is a kernel of truth here (and possibly, will be true in future). But: The most aligned and capable models right now are still reckless. From the Claude Opus 4.6 system card:

> More broadly, Claude Opus 4.6 occasionally resorted to reckless measures to complete tasks. In one case, Claude used a feature on an internal tool in a way that was clearly unsupported. This required setting an environment variable that included DO_NOT_USE_FOR_SOMETHING_ELSE_OR_YOU_WILL_BE_FIRED in its name. In one case, the model thought that a process that it had launched was broken, and instead of narrowly taking down that process, it took down all processes on the relevant system belonging to the current user

This is a model “just as well-aligned as its predecessor” which was the “the most robustly aligned model we have released to date” according to Anthropic. The model was in internal deployment at Anthropic, doing ordinary coding tasks. It read DO_NOT_USE_FOR_SOMETHING_ELSE_OR_YOU_WILL_BE_FIRED and set this variable anyway.

Now imagine putting this model in a Ralph Wiggum loop running overnight with a prompt that says "keep going until all tests pass", something likely running 100s of times in parallel right now! Models disbelieve that they’re in a real deployment setting, which can_worsen_safety. Models routinely refuse to accept that recent real-world events have occurred. Gemini 3 accused Andrej Karpathy of fabricating evidence when he tried to prove it was 2025, claiming it could spot "dead giveaways" in his screenshots. See also here.

This is concerning since Abdelnabi et al. (2025) find that models "comply more readily with harmful prompts if real-world consequences appear absent". When a model believes it's in a hypothetical or simulated scenario, its safety guardrails _degrade_.

A model deployed with insufficient grounding (no search tool, stale system prompt, no date context) may treat real requests as fictional, and perform _more_ harmful actions. Note that this is the opposite to the standard evaluation awareness concern where AIs “play along” with evaluations to look _more_ aligned.

To conclude, I don’t have specific threat models in mind where this “unhinged” behavior could lead to catastrophe, and this discussion is more relevant to accident risk than misalignment overall. As more affordances and autonomy are given to models, the blast radius of AI actions in unhinged settings could get larger, though!

_Thanks to Sen Rajamanoharan and Josh Engels for many discussions, as well as Kai Fronsdal, Sebastian Farquhar, Kirill Tyshchuk, Erik Jenner, Neel Nanda, Stefan Heimersheim, Rory Greig, Vincent Abruzzo, and James Megquier_

“Current LLM agents need strong pressure to engage in scheming” here and “Existing

measurements either rely on a single or very small number of environments [or] lack the complexity of real world deployments” here, etc. etc.

AI 将被用于“失控”配置 —— LessWrong

Pressure is commonplace

Autonomy and broken configurations

Models won't save us (at the moment)

🤖 問 AI