Morgane Austern - Theory for some graph embedding methods such as node2vec, and efficient concentration inequalities.


Network data is ubiquitous, and many examples can be found in domains ranging from biology to social sciences. Learning from graph data is increasingly important with tasks of interest including node classification, node clustering, and link prediction. A frequent approach begins by learning an Euclidean embedding of the network. Over the last decade node2vec and its successors have become popular methods to do so. However, despite their strong empirical performance, they are not well understood theoretically. In the first part of this talk, we will see how those methods can be rephrased through the lens of empirical risk minimization. We then characterize, under the assumption that the graph is exchangeable, the asymptotic distribution of the learned embedding vectors and prove that it asymptotically decouples. We characterize what the embedding vectors represent and how well these methods perform on downstream tasks. This asymptotic theory allows us to question the role of the hyperparameters and some of the heuristics under which the algorithms are built. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency. The proof of those results relies on concentration inequalities, inequalities that are known to be loose. Their main strength however is that they provide us with finite sample guarantees. Limit theorems, such as the central limit theorem offer much tighter bounds but that are only asymptotically valid. In the second part of this talk, we will see how we can propose new bounds that are valid in finite samples, are significantly tighter than Bernstein and Azuma, and are asymptotically tight. We will discuss the application of these concentration inequalities in machine learning and reinforcement learning.

SEC 1.413