I will describe a personal perspective on a few key problems in learning theory at the moment. Several different architectures that perform well have emerged, in addition to CNN, such as transformers, perceivers and MLP mixers. Is there a common motif to all of them and to their good performance? A natural conjecture is that these architecture are well-suited for the approximation, learning and optimization of input-output map- pings that can be represented by ‘sparse’ functions that are effectively low-dimen- sional. In particular, I will discuss ‘sparse’ target functions that are compositional with a function graph that has nodes each with dimensionality at most $k$, with $k<<d$ where $d$ is the dimensionality of the function domain.