Signal #125950NEUTRAL

Synthetic document finetuning for instilling positive traits

90

This is the fifth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The fourth post can be found here.TLDR: Via adapting the methods of Marks et al and Li et al, we train Gemini 3 Flash to have certain traits/values by midtraining it on documents about how Gemini has those properties, followed by finetuning it on synthetic chat data where it demonstrates those properties. The chat finetuning is effective for instilling the traits robustly, working OOD. We share some takeaways on how to improve midtraining & SFT effectiveness.IntroductionInspired by Marks et al, where a multi-step finetuning process involving synthetic documents is used to create a model robustly pursuing a complex goal (taking actions favoured by a reward model), we wanted to use this method to robustly instil positive traits instead. Our motivation was deep alignment: we want to train principles into the model which guide beh...

AI Alignment Forumabout 3 hours ago
Read Full Article

Explore with AI-Powered Tools

View All Signals

Explore more AI intelligence

Want to discover more AI signals like this?

Explore Steek
Synthetic document finetuning for instilling positive traits | Steek AI Signal | Steek