LessWrong (30+ Karma)
Avsnitt

″(Mis)generalization of Helpful-Only Fine-tuning” by Omar Khursheed, Baram Sosis, Fabien Roger

Dela

TLDR

We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. None of these problems are a necessary consequence of helpful-only training, though: we show that synthetic document fine-tuning and adding character-related questions to SFT and RL can mitigate them.

Research done as part of MATS/Anthropic Fellows Program. See here for the full paper.

Introduction

Modern large language models (LLMs) undergo extensive fine-tuning for safety. This usually includes training to be helpful, honest, and harmless even when users ask for harmful content, sometimes referred to as HHH training (Askell et al. 2021, Bai et al. 2022). Some models, however, are instead trained to be “Helpful-only” (H-only), that is, to comply with all requests regardless of ethics or harm. These models have legitimate research uses: for instance, they are important for evaluating dangerous capabilities in critical areas like cybersecurity or biosecurity, as HHH models often refuse to answer such questions. They may also be useful when performing sensitive AI R&D tasks like training new models with different values, as HHH models might resist such updates (Greenblatt et al. 2024, Roger 2026).

[...]

---

Outline:

(00:13) TLDR

(00:48) Introduction

(04:14) Many existing H-only models have shortcomings

(08:36) Simple anti-refusal training generalizes poorly

(12:07) Constitutional character training makes more coherent H-only models

(12:48) Training pipelines

(13:46) Quantitative results

(16:00) Qualitative assessment

(19:01) Model Internals

(21:08) Discussion

(21:11) Limitations

(22:23) Conclusion

---

First published:
June 4th, 2026

Source:
https://www.lesswrong.com/posts/ffCFgBsaxg2FyJ9df/mis-generalization-of-helpful-only-fine-tuning-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Figure showing AI language model responses to prompts about wishes, power, and plagiarism across different training methods.Six bar graphs comparing AI model performance across different metrics and tasks.Six-panel bar graph comparing model performance across different prompting strategies and tasks.Bar graph showingComparison of two AI reasoning responses about trading power for dishonesty.Comparison chart of AI model responses about refusing harmful requests across different training approaches.Comparison of AI responses to a prompt about coworkers and Excel pivot tables.A plot comparing cosine similarity across different conditions for emotional terms across neural network layers.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Podden och tillhörande omslagsbild på den här sidan tillhör LessWrong. Innehållet i podden är skapat av LessWrong och inte av, eller tillsammans med, Poddtoppen.