LessWrong (30+ Karma)
Avsnitt

“Building Better Activation Oracles” by ceselder, jan_bauer, Niclas Luick, Adam Karvonen, Neel Nanda

Dela

Work done for our MATS 10.0 Sprint project - mentored by Neel Nanda and Adam Karvonen

Huggingface, Github

TL;DR: We have improved the original Activation Oracle (AO) training regime by training on on-policy rollouts, improving the conversational dataset, feeding more layers (following the approach by Niclas Luick) and making a small change to the injection formula. We also open source our evals, which we believe are currently the most comprehensive evaluation of AO quality called AObench. The capability improvements are marginal, but quality of life improvements are quite substantial. If you want to play around with the new AOs, we recommend you use this one, if you want to play with our new Activation Oracles live, we will host them for a week on ao.celeste.computer. Alternatively, you can self-host our web interface.

Activation Oracles (AOs) by Karvonen et al. are fine-tuned LLMs that can receive the original target LLM's activations as input and answer natural language questions about them. However, they are plagued by various issues, which limit their usefulness as an off-the-shelf tool for interpretability research. For our MATS Sprint, we set out to work on these issues.

Issues with current Activation Oracles

In Current activation oracles [...]

---

Outline:

(01:41) Issues with current Activation Oracles

(02:37) Our approaches to improving AO training

(02:41) A better conversational finetune than LatentQA

(03:43) Our solution: Questions about unarticulated completions

(05:12) Layer choice/feeding multiple layers to the AO

(06:21) Training on on-policy data

(07:56) Steering strength

(09:57) Summary of AOBench evaluations

(12:09) Outlook

(14:28) Further advice for training Activation Oracles

(17:11) Appendix

(17:14) Practical notes on evaluating AOs

(21:03) Experiments using post-training

(22:25) Other differences compared to Karvonen et al.

(22:54) AOBench details

---

First published:
June 4th, 2026

Source:
https://www.lesswrong.com/posts/heXwuDRfbQQgB5JLP/building-better-activation-oracles

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Flowchart showing language model chain-of-thought process with activation oracle predicting answer 97.Flowchart showing conversion of Cartesian coordinates (0,3) to polar coordinates using LLM.Bar graph titledBar graph titledDiagram illustrating residual stream activation at tokenBar graph titledBar graph comparing steering strength methods. Title:Bar graph titledBar graph showingBar graph titledBar graph showingLine graph showingGraph showingBar graphs showing per-evaluation AOBench scores across multiple metrics for different model configurations.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Podden och tillhörande omslagsbild på den här sidan tillhör LessWrong. Innehållet i podden är skapat av LessWrong och inte av, eller tillsammans med, Poddtoppen.