← 回總覽

循环一致性激活预言机 — LessWrong

📅 2026-03-12 10:58 slavachalnev 人工智能 12 分鐘 14515 字 評分: 84
机械可解释性 激活预言机 循环一致性 LLM 内部机制 GRPO
📌 一句话摘要 本研究探索了利用循环一致性训练模型将 LLM 激活转换为自然语言的方法,发现虽然该方法能生成合理的文本,但往往优先考虑上下文重建,而非忠实描述内部状态。 📝 详细摘要 本文研究了一种名为“循环一致性激活预言机”(Cycle-Consistent Activation Oracles)的机械可解释性新方法。作者训练了一个解码器将 LLM 内部激活(具体来自 Qwen3-8B)转换为自然语言描述,并训练了一个相应的编码器从该文本中重建原始激活。通过使用群体相对策略优化(GRPO)来奖励循环一致性(激活 → 文本 → 重建激活),该系统尝试在没有标注数据的情况下进行学习。然而,结

Sign in to use highlight and note-taking features for a better reading experience. Sign in now TL;DR: I train a model to translate LLM activations into natural language, using cycle consistency as a training signal (activation → description → reconstructed activation). The outputs are often plausible, but they are very lossy and are usually guesses about the context surrounding the activation, not good descriptions of the activation itself. This is an interim report with some early results.

Overview --------

I think Activation Oracles (Karvonen et al., 2025) are a super exciting research direction. Humans didn't evolve to read messy activation vectors, whereas ML models are great at this sort of thing.

An activation oracle is trained to answer specific questions about an LLM activation (e.g. "is the sentiment of this text positive or negative?" or "what are the previous 3 tokens?"). I wanted to try something different: train a model to _translate_ activations into natural language.

The main problem to solve here is the lack of training data. There's no labeled dataset of activations paired with their descriptions. So how do we get around this?

One idea is to use cycle consistency: if you translate from language A to language B and back to A, you should end up approximately where you started.

!Image 1: Cycle diagram: activation vector → decoder → text tokens → encoder → reconstructed vector

I train a decoder that takes an activation and generates text, and an encoder that takes text and reconstructs the original activation. The cosine distance between the original and reconstructed activation is the training signal for both models.

Here's the rest of the post:

* Setup: I train an AO-style decoder and a separate encoder on middle-layer Qwen3-8B activations. I first do a supervised warmup on LatentQA and token prediction, followed by a cycle-consistency stage with GRPO. * Example outputs: I show that the decoder often gets the broad topic and local context right, but the outputs don't look like good descriptions of what the model is thinking. * Problems with this approach: Cycle consistency does not force the decoder to produce faithful explanations, only text that helps the encoder reconstruct the vector. Additionally, the decoder can do its own reasoning, and the text bottleneck is very lossy. * Evals: I compare this method to linear probes for classification tasks. Probes always do better because the text representation is very lossy. * Next steps: Ideas for improving the setup.

Setup -----

I take a single activation vector from the middle layer of Qwen3-8B and train a model to map it into natural language. The decoder follows Karvonen et al.'s activation oracle architecture: a LoRA fine-tune of the base model that receives an activation injected at layer 1 and generates a text description. [[1]](https://www.bestblogs.dev/article/2191b837#fn-WDEmmfmc3PYPrYLHq-1) The encoder is a separate LoRA fine-tune that maps text back into activation space. Given a text description, it extracts hidden states from the same middle layer and pools them into a single vector; this reconstructed vector is then compared to the original activation. Training happens in two stages: Stage 1: Supervised warmup. The decoder is first trained on ~50k examples, using a simplified version of the Activation Oracles supervision: a mixture of LatentQA data (behavioral QA) and token prediction (predict surrounding tokens given a FineWeb activation). Stage 2: Cycle consistency via GRPO. Starting from the stage 1 checkpoint, I run 4k steps of GRPO (Group Relative Policy Optimization). For each activation, the decoder samples K=4 candidate descriptions. Each is scored by how well the encoder reconstructs the original activation (cosine similarity). These scores are group-normalized into advantages for policy gradient updates. I apply a KL penalty against the base model to keep outputs coherent. [[2]](https://www.bestblogs.dev/article/2191b837#fn-WDEmmfmc3PYPrYLHq-2)

Example outputs ---------------

For each token in a text, I take the middle-layer activation and have the decoder describe it. I uploaded example outputs for a variety of texts where you can click on any token and see the decoder's description. Take a look! A word problem example. Here are a few of the descriptions:

!Image 2

* .: the decoder says "he" instead of "she" and guesses 5, 6, or 10 apples. * She: says Sarah, not Alice, and gives her marbles instead of apples. * 3: knows we're subtracting apples from a collection. * .: knows that we had some number, subtracted a number, then added a number. * 1: gets the first digit of the final answer right. Sentence boundaries. Most of the time, the decoder seems to only pick up on the immediately preceding context. However, at full stops, it often produces something more like a summary of the preceding sentence. See e.g. the second full stop in the example above. This is what you'd expect if the model stores summary context at sentence boundaries.

A common pattern is that the decoder keeps the structure of what's happening but substitutes different entities. The She token above is a good example: the actual text is about Alice giving apples to Bob, but the decoder describes Sarah giving away marbles. It has "female character gives counted objects to someone" but not the particular character or objects.

There's a spectrum of interpretability approaches, from rigorous but narrow (e.g. probes) all the way to just reading chain-of-thought. This sits somewhere in between (probably closer to the CoT end).

Problems with this approach --------------------------- Cycle consistency doesn't imply good descriptions. The decoder doesn't need to describe what the activation contains, it just needs to produce text that _prompts_ the encoder into a similar internal state to the original activation. In the extreme case, you can imagine the decoder reproducing the original prompt exactly, which would make the encoder's job trivially easy. The pressure to do this is especially strong because the encoder is just a finetune of the original model. What gets rewarded here is not "say what the activation means," but "say something that makes the model land in a similar state." When you read the decoder's outputs, you see that the descriptions gesture at the right topic and context and try to guess what tokens preceded the activation. The descriptions do contain information about the activation, but that information is also in the original prompt! The decoder does its own thinking. If the decoder picks up enough context from the activation to guess what the input text was, it can reason forward and report conclusions that weren't in the activation at all. You can't tell whether the decoder read something from the activation or worked it out itself. (Standard AOs also have this issue.)

Say we want to examine an activation partway through a math calculation and we'd like to know what the model has computed so far. What intermediate results is it tracking? If the decoder can "read" the preceding tokens, it might just solve the math problem itself while writing the description. So we won't know whether or not the original activation contained the solution. Lossy bottleneck. We're compressing a high-dimensional continuous vector into 128 tokens of text. The reconstruction cosine similarity is around 0.8 so a lot of information is lost. So if you know in advance what question you want to ask about an activation, it’s going to be better to train a probe rather than read the text description.

Evals -----

Retrieval ---------

As a sanity check: are the descriptions specific enough to identify which activation they came from? For each activation in a pool of 1000, I generated a description, encoded it back, and checked if the nearest match was the original. Top-1 accuracy was 95.7%, top-5 was 99.1%. Note that this mostly confirms that the cycle is working, not that the descriptions are good.

Classification --------------

For six binary classification tasks, I compared a linear probe on the raw activation against the decoder: an LLM reads the decoder's description and answers the same yes/no question. [[3]](https://www.bestblogs.dev/article/2191b837#fn-WDEmmfmc3PYPrYLHq-3) Last tok uses the activation at the final token; mean averages over all token positions.

| Dataset | Probe (last tok) | Decoder (last tok) | Probe (mean) | Decoder (mean) | | --- | --- | --- | --- | --- | | SST-2 (sentiment) | 88.5% | 72.5% | 87.0% | 79.5% | | Bias in Bios (gender) | 81.0% | 61.3% | 96.2% | 70.8% | | AG News (sports/non-sports) | 94.2% | 91.5% | 98.8% | 97.8% | | Tense (past/non-past) | 97.5% | 94.5% | 98.5% | 91.2% | | Singular/Plural | 96.2% | 88.5% | 96.7% | 63.4% | | Language ID | 99.5% | 50.7% | 99.2% | 54.8% | | Language ID (multilingual ckpt) | 99.5% | 78.5% | 99.2% | 85.5% |

The decoder can't do Language ID at all (50.7-54.8%, near chance). I suspected this was because the decoder was only trained on English data, so I trained a separate checkpoint with multilingual text in the training corpus (the "multilingual ckpt" row), which brings it up to 78.5-85.5%. That's a good improvement but it also shows that we didn't get the English to multilingual generalisation automatically.

Arithmetic ----------

Inspired by the number prediction eval in Current Activation Oracles Are Hard to Use, this eval feeds expressions like "4 * 7 = " into the original model and takes the activation at the trailing space. Decoder performance (n=50 per tier):

| Difficulty | Exact match | First digit in desc | | --- | --- | --- | | Simple (1-digit answers) | 46% | 52% | | Medium (2-digit answers) | 0% | 52% | | Harder (3-digit answers) | 0% | 24% |

The decoder mostly just says numbers like 5, 10, 12, 15 regardless of the actual answer.

How much of this is the decoder's fault vs the original activation not containing the answer? These activations come from layer 18 (middle of the network), so the model may not have computed the answer by then. I checked this by training a linear probe at the same position, which gets 100% on the answer's first digit for medium-difficulty problems. So the information is there but the decoder doesn't extract it.

What I want to try next -----------------------

The core problem is that cycle consistency rewards the decoder for guessing context, not for producing good descriptions. I don't have a good solution to this, but here are some things I want to try. A different encoder. The current encoder is a finetune of the same model, looking at activations in the same layer, which makes it especially easy for the decoder to score well by just guessing the original text. I want to try replacing it with a separate generative model over activations (e.g. something like Ameisen et al., 2025), which might make guessing context less effective. But it's not clear whether this actually helps, or how to make it text-conditioned in a way that doesn't recreate the same problem. Continuing AO training during cycle consistency. I want to try running AO-style supervision alongside cycle training. I'd guess the choice of tasks matters a lot: If the decoder starts out producing descriptions, the encoder would learn to reconstruct from descriptions, and cycle consistency would reinforce that. If it starts out guessing the original text, that gets reinforced instead.

If you have ideas for training signals that push toward better descriptions and not just better reconstruction, I'd love to hear them.

Appendix: Other training ideas ------------------------------ Per-token credit assignment. GRPO treats the encoder as a black-box scoring function for whole descriptions, but the encoder is differentiable and contains signal about how each token in the description affects reconstruction. Per-token rewards or gradient-based credit assignment might help the decoder learn more efficiently. Backward cycle (text → encode → decode ≈ text): Pass text through the encoder to get an activation vector, inject it into the decoder, and train the decoder to reconstruct the original text. This is fully differentiable and requires no generation. The problem is that there's no text that resembles activation descriptions, and using the decoder's own outputs as training data would be circular. Soft forward cycle (activation → decode → encode ≈ activation, but differentiable): I tried using straight-through Gumbel-Softmax as a surrogate gradient for the decoder, so the encoder's reconstruction loss could train the decoder more directly. I haven't gotten this to work yet.

  • Injection follows AO: the activation is added to the hidden state at layer 1 via . The encoder's pooling head is a learned linear layer that scores each token and takes the weighted average of hidden states. Both adapters are LoRA rank 64, alpha 128. ↩︎
  • Batch size 32, learning rate 1e-5 with linear warmup and decay, temperature 1.0. KL divergence is computed against the bare original model (no LoRA) on the generated text only (no oracle prompt or activation injection). ↩︎
  • The LLM judge is GPT-5. It receives the decoder's description along with per-token encoder attention weights (showing which positions the encoder found most informative) and a yes/no question like "Does this text express positive sentiment?" or "Is this text about sports?". SST-2, Bias in Bios, AG News, and Language ID (WiLI-2018) are from HuggingFace. Tense and Singular/Plural are taken from the Activation Oracles repo. ↩︎
查看原文 → 發佈: 2026-03-12 10:58:33 收錄: 2026-03-12 14:01:02

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。