LessWrong (30+ Karma)
Avsnitt

“Training Deliberative Monitors for Black-Box Scheming Detection” by aksh-n, adityasinha, Victor Gillioz, Simon Storf, Kilian Merkelbach, richbc, Axel Højmark, Marius Hobbhahn

Dela

Paper: https://arxiv.org/abs/2605.29601

Thread: https://x.com/aksh_n0/status/2062568855814193497

TL;DR: Training small open-weight monitors provides a cost-effective alternative to prompted frontier monitors. Applying our training recipe to Qwen3.5-27B results in a monitor better at scheming detection than all smaller prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, Claude Haiku 4.5) and Gemini 2.5 Pro, while achieving lower marginal inference cost. Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, Claude Opus 4.6) achieve higher performance but at roughly higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost-performance Pareto frontier.

Cost-performance frontier for action-only scheming monitors. Our trained monitors provide a low-cost option for low-FPR monitoring. Across six OOD datasets, our best trained monitor, Qwen3.5-27B, outperform smaller prompted frontier monitors for under $ per 1,000 evaluations, while larger frontier monitors are substantially more expensive.[1]

Setup

Overview of our training pipeline for action-only deliberative monitoring. We collect labeled action-only trajectories from scheming and sabotage datasets, define a full scheming specification to elicit structured deliberative rationales from a strong teacher model, filter those rationales with a separate judge, and distill the accepted rationale into open-weight monitors with supervised fine-tuning and reinforcement learning. At inference time, our trained monitors receives only [...]

---

Outline:

(01:15) Setup

(07:33) Results

(07:36) Supervised distillation teaches action-only deliberation, RL improves marginally

(08:24) Trained monitors are low-cost Pareto-optimal for low-FPR monitoring

(09:05) Broad training mixtures improves OOD transfer, but effects are non-monotonic

(10:08) Larger trained models within model families perform better

(11:14) Performance scales effectively with more data

The original text contained 1 footnote which was omitted from this narration.

---

First published:
June 4th, 2026

Source:
https://www.lesswrong.com/posts/BmA5KwzZhLk2rykoz/training-deliberative-monitors-for-black-box-scheming

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Scatter plot titledFlowchart showing machine learning pipeline with transcripts, rationale generation, quality control, SFT, RL, monitoring, and evaluation components.Two heatmap tables comparing per-stage model performance improvements across different benchmarks.Performance comparison table titledTwo graphs showing training data mixes by misalignment type and mean-OOD pAUROC@20 scores.Scatter plot titledTwo line graphs titled

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Podden och tillhörande omslagsbild på den här sidan tillhör LessWrong. Innehållet i podden är skapat av LessWrong och inte av, eller tillsammans med, Poddtoppen.