John Schulman - Language Models, Search, and Overoptimization

Abstract

Language models can be dramatically improved by reward models, which predict the quality of a sample. Two approaches for combining language and reward models are searching (at test time) and fine-tuning (at training time, usually via policy gradient algorithms). These methods are key for ‘aligning’ language models – getting them to behave in a maximally helpful and truthful way. The key challenge is that if you search or optimize too much against a fixed reward model, you’ll exploit the errors in the reward model and get worse performance; this is called ‘overoptimization’. I’ll show some elementary but interesting calculations that quantify the optimization strength of different search procedures, and I’ll discuss some recent empirical results on overoptimization involving large language models.

Date
Event
Location
SEC 1.413.