“Building Better Activation Oracles” by ceselder, jan_bauer, Niclas Luick, Adam Karvonen, Neel Nanda - LessWrong (30+ Karma) | Lyssna här

Work done for our MATS 10.0 Sprint project - mentored by Neel Nanda and Adam Karvonen

Huggingface, Github

TL;DR: We have improved the original Activation Oracle (AO) training regime by training on on-policy rollouts, improving the conversational dataset, feeding more layers (following the approach by Niclas Luick) and making a small change to the injection formula. We also open source our evals, which we believe are currently the most comprehensive evaluation of AO quality called AObench. The capability improvements are marginal, but quality of life improvements are quite substantial. If you want to play around with the new AOs, we recommend you use this one, if you want to play with our new Activation Oracles live, we will host them for a week on ao.celeste.computer. Alternatively, you can self-host our web interface.

Activation Oracles (AOs) by Karvonen et al. are fine-tuned LLMs that can receive the original target LLM's activations as input and answer natural language questions about them. However, they are plagued by various issues, which limit their usefulness as an off-the-shelf tool for interpretability research. For our MATS Sprint, we set out to work on these issues.

Issues with current Activation Oracles

In Current activation oracles [...]

---

Outline:

(01:41) Issues with current Activation Oracles

(02:37) Our approaches to improving AO training

(02:41) A better conversational finetune than LatentQA

(03:43) Our solution: Questions about unarticulated completions

(05:12) Layer choice/feeding multiple layers to the AO

(06:21) Training on on-policy data

(07:56) Steering strength

(09:57) Summary of AOBench evaluations