Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a link post.New Anthropic model organisms research paper led by Carson Denison from the Alignment Stress-Testing Team demonstrating that large language models can generalize zero-shot from simple reward-hacks (sycophancy) to more complex reward tampering (subterfuge). Our results suggest that accidentally incentivizing simple reward-hacks such as sycophancy can have dramatic and very difficult to reverse consequences for how models generalize, up to and including generalization to editing their own reward functions and covering up their tracks when doing so.

Abstract:

In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too [...]

---

First published:
June 17th, 2024

Source:
https://www.lesswrong.com/posts/FSgGBjDiaCdWxNBhj/sycophancy-to-subterfuge-investigating-reward-tampering-in

---

Narrated by TYPE III AUDIO.

Podden och tillhörande omslagsbild på den här sidan tillhör LessWrong. Innehållet i podden är skapat av LessWrong och inte av, eller tillsammans med, Poddtoppen.

Senast besökta

LessWrong (Curated & Popular)

“Sycophancy to subterfuge: Investigating reward tampering in large language models” by evhub, Carson Denison

00:00