What is the relationship between task geometry, network architecture, and emergent feature learning dynamics in nonlinear deep networks? I will describe the Gated Deep Linear Network framework, which schematizes how pathways of information flow impact learning dynamics within an architecture. Because of the gating, these networks can compute nonlinear functions of their input. We derive an exact reduction and, for certain cases, exact solutions to the dynamics of learning. The reduction takes the form of a neural race with an implicit bias towards shared representations, which then govern the model’s ability to systematically generalize, multi-task, and transfer. We give examples where depth enables qualitatively new representations to be learned, and where the rich feature learning dynamics described by the neural race reduction permit systematic generalization while lazy learning dynamics do not. Together, these results begin to shed light on the links between architecture, initialization and network performance.