Why do large learning rates often produce better results? Why do “infinitely wide” networks trained using kernel methods tend to underperform ordinary networks? In the talk I will argue that these questions are related. Existing kernel-based theory can explain the dynamics of networks trained with small learning rates. However, optimal performance is often achieved at large learning rates, where we find qualitatively different dynamics that converge to flat minima. The distinction between the small and large learning rate phases becomes sharp at infinite width, and is reminiscent of nonperturbative phase transitions that appear in physical systems.