Signal #124167NEGATIVE

Models May Behave Worse When Eval Aware

85

This is the first in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas.TL;DRIt's often assumed that models will act more aligned when they can tell they're being evaluated. But we find that Gemini can take “undesired” actions in behavioural evals even when it explicitly reasons that the environments are contrived, and sometimes this reasoning will increase the rate of undesired actions. When we dig into the model's reasoning, we find that this is typically associated with Gemini perceiving an environment as a puzzle where the aim is to achieve the goal by unconventional means (like capture the flag challenges – Gemini’s thoughts often literally call it a “CTF” challenge) or a consequence-free simulation in which it should play along – rather than recognising it as an alignment test. This complicates the usual story about evaluation awareness: detecting that an environment is synthetic doesn't reliably push...

AI Alignment Forumabout 4 hours ago
Read Full Article

Explore with AI-Powered Tools

View All Signals

Explore more AI intelligence

Want to discover more AI signals like this?

Explore Steek
Models May Behave Worse When Eval Aware | Steek AI Signal | Steek