″(Mis)generalization of Helpful-Only Fine-tuning” by Omar Khursheed, Baram Sosis, Fabien Roger - LessWrong (30+ Karma) | Lyssna här

TLDR

We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. None of these problems are a necessary consequence of helpful-only training, though: we show that synthetic document fine-tuning and adding character-related questions to SFT and RL can mitigate them.

Research done as part of MATS/Anthropic Fellows Program. See here for the full paper.

Introduction

Modern large language models (LLMs) undergo extensive fine-tuning for safety. This usually includes training to be helpful, honest, and harmless even when users ask for harmful content, sometimes referred to as HHH training (Askell et al. 2021, Bai et al. 2022). Some models, however, are instead trained to be “Helpful-only” (H-only), that is, to comply with all requests regardless of ethics or harm. These models have legitimate research uses: for instance, they are important for evaluating dangerous capabilities in critical areas like cybersecurity or biosecurity, as HHH models often refuse to answer such questions. They may also be useful when performing sensitive AI R&D tasks like training new models with different values, as HHH models might resist such updates (Greenblatt et al. 2024, Roger 2026).

[...]

---

Outline:

(00:13) TLDR

(00:48) Introduction

(04:14) Many existing H-only models have shortcomings

(08:36) Simple anti-refusal training generalizes poorly

(12:07) Constitutional character training makes more coherent H-only models

(12:48) Training pipelines

(13:46) Quantitative results

(16:00) Qualitative assessment

(19:01) Model Internals