“Training Deliberative Monitors for Black-Box Scheming Detection” by aksh-n, adityasinha, Victor Gillioz, Simon Storf, Kilian Merkelbach, richbc, Axel Højmark, Marius Hobbhahn - LessWrong (30+ Karma) | Lyssna här

Paper: https://arxiv.org/abs/2605.29601

Thread: https://x.com/aksh_n0/status/2062568855814193497

TL;DR: Training small open-weight monitors provides a cost-effective alternative to prompted frontier monitors. Applying our training recipe to Qwen3.5-27B results in a monitor better at scheming detection than all smaller prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, Claude Haiku 4.5) and Gemini 2.5 Pro, while achieving lower marginal inference cost. Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, Claude Opus 4.6) achieve higher performance but at roughly higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost-performance Pareto frontier.

Cost-performance frontier for action-only scheming monitors. Our trained monitors provide a low-cost option for low-FPR monitoring. Across six OOD datasets, our best trained monitor, Qwen3.5-27B, outperform smaller prompted frontier monitors for under $ per 1,000 evaluations, while larger frontier monitors are substantially more expensive.[1]

Setup

Overview of our training pipeline for action-only deliberative monitoring. We collect labeled action-only trajectories from scheming and sabotage datasets, define a full scheming specification to elicit structured deliberative rationales from a strong teacher model, filter those rationales with a separate judge, and distill the accepted rationale into open-weight monitors with supervised fine-tuning and reinforcement learning. At inference time, our trained monitors receives only [...]

---

Outline:

(01:15) Setup

(07:33) Results

(07:36) Supervised distillation teaches action-only deliberation, RL improves marginally

(08:24) Trained monitors are low-cost Pareto-optimal for low-FPR monitoring

(09:05) Broad training mixtures improves OOD transfer, but effects are non-monotonic

(10:08) Larger trained models within model families perform better

(11:14) Performance scales effectively with more data

The original text contained 1 footnote which was omitted from this narration.

---

First published:
June 4th, 2026