Tengyu Ma - Explaining the Regularization Effect of Initial Large Learning Rate


Algorithms in deep learning have a regularization effect: different optimizers with different hyper-parameters, on the same training objective, converge to different local minima, which have different test performance. A concrete example is that a relatively large initial learning rate often leads to a better final generalization performance. Towards explaining this phenomenon, we devise a setting in which we can prove that a two-layer network trained with large initial learning rate and annealing provably generalizes better than the same network trained with a small learning rate from the start. I will also discuss potential practical mitigation strategies, inspired by the theory, to address the lack of algorithmic regularization, including adding noises to activations or adding explicit regularization. Based on the joint works with Yuanzhi Li and Colin Wei. https://arxiv.org/abs/1907.04595 https://arxiv.org/abs/1905.03684

Seminar Talk
60 Oxford St, Room 330. Cambridge, Massachusetts