Paper: https://arxiv.org/abs/2605.29601
Thread: https://x.com/aksh_n0/status/2062568855814193497
TL;DR: Training small open-weight monitors provides a cost-effective alternative to prompted frontier monitors. Applying our training recipe to Qwen3.5-27B results in a monitor better at scheming detection than all smaller prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, Claude Haiku 4.5) and Gemini 2.5 Pro, while achieving lower marginal inference cost. Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, Claude Opus 4.6) achieve higher performance but at roughly higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost-performance Pareto frontier.
Cost-performance frontier for action-only scheming monitors. Our trained monitors provide a low-cost option for low-FPR monitoring. Across six OOD datasets, our best trained monitor, Qwen3.5-27B, outperform smaller prompted frontier monitors for under $ per 1,000 evaluations, while larger frontier monitors are substantially more expensive.[1]
Setup
Overview of our training pipeline for action-only deliberative monitoring. We collect labeled action-only trajectories from scheming and sabotage datasets, define a full scheming specification to elicit structured deliberative rationales from a strong teacher model, filter those rationales with a separate judge, and distill the accepted rationale into open-weight monitors with supervised fine-tuning and reinforcement learning. At inference time, our trained monitors receives only [...]
---
Outline:
(01:15) Setup
(07:33) Results
(07:36) Supervised distillation teaches action-only deliberation, RL improves marginally
(08:24) Trained monitors are low-cost Pareto-optimal for low-FPR monitoring
(09:05) Broad training mixtures improves OOD transfer, but effects are non-monotonic
(10:08) Larger trained models within model families perform better
(11:14) Performance scales effectively with more data
The original text contained 1 footnote which was omitted from this narration.
---
First published:
June 4th, 2026
Source:
https://www.lesswrong.com/posts/BmA5KwzZhLk2rykoz/training-deliberative-monitors-for-black-box-scheming
---
Narrated by TYPE III AUDIO.
---