An Illustrated Guide to Automatic Sparse Differentiation

Adrian Hill; Guillaume Dalle; Alexis Montoison

An Illustrated Guide to Automatic Sparse Differentiation

Adrian Hill, Guillaume Dalle, Alexis Montoison

Published: 23 Jan 2025, Last Modified: 26 Feb 2025ICLR 2025 Blogpost TrackEveryoneRevisionsBibTeXCC BY 4.0

Blogpost Url: https://d2jud02ci9yv69.cloudfront.net/2025-04-28-sparse-autodiff-95/blog/sparse-autodiff/

Abstract: In numerous applications of machine learning, Hessians and Jacobians exhibit sparsity, a property that can be leveraged to vastly accelerate their computation. While the usage of automatic differentiation in machine learning is ubiquitous, automatic sparse differentiation (ASD) remains largely unknown. This post introduces ASD, explaining its key components and their roles in the computation of both sparse Jacobians and Hessians. We conclude with a practical demonstration showcasing the performance benefits of ASD. ___ First-order optimization is ubiquitous in machine learning (ML) but second-order optimization is much less common. The intuitive reason is that high-dimensional vectors (gradients) are cheap, whereas high-dimensional matrices (Hessians) are expensive. Luckily, in numerous applications of ML to science or engineering, Hessians and Jacobians exhibit sparsity: most of their coefficients are known to be zero. Leveraging this sparsity can vastly accelerate automatic differentiation (AD) for Hessians and Jacobians, while decreasing its memory requirements. Yet, while traditional AD is available in many high-level programming languages like Python and Julia, automatic sparse differentiation (ASD) is not as widely used. One reason is that the underlying theory was developed outside of the ML research ecosystem, by people more familiar with low-level programming languages. With this blog post, we aim to shed light on the inner workings of ASD, bridging the gap between the ML and AD communities by presenting well established techniques from the latter field. We start out with a short introduction to traditional AD, covering the computation of Jacobians in both forward and reverse mode. We then dive into the two primary components of ASD: sparsity pattern detection and matrix coloring. Having described the computation of sparse Jacobians, we move on to sparse Hessians. We conclude with a practical demonstration of ASD, providing performance benchmarks and guidance on when to use ASD over AD.

Conflict Of Interest: We use our own open source code in our demonstration of ASD. We have contacted the ICLR Blogpost Track organizers about this and have received their green light. We append the full conversation with the ICLR organizers for full transparency. Importantly, this means that the software citations in the demonstration section can be used to identify the authors of the post (we do not cite any of our published papers, only the relevant code repositories). ___ (Authors, November 16, 2024, 20:06) Dear ICLR Blogpost Track organizers, My coauthors and I intend to submit an illustrated introductory guide to sparse automatic differentiation (AD). This research domain is largely unknown to the ML community and ties in well with last year's spotlight submission "How to compute Hessian-vector products?". We are also the developers of open source sparse AD software and are working on several publications in this field (although none have been submitted yet). That is why we write to you to obtain feedback on potential conflicts of interest. Our blog post's goal is to introduce the theory behind sparse AD, with custom illustrations created specifically for this purpose. These theoretical aspects are not tied to any specific package or method, instead they provide an intuitive overview of core concepts and mainstream approaches in the field. However, we are wondering whether to add a last section that demonstrates the practical use and performance benefits of sparse AD. This section would leverage our own software packages, as they seem to be the only generic sparse AD implementations available in high-level languages (including Python, R and Julia). We recognize that this part could be seen as self-promotion. If desired, this demonstration could therefore be omitted. Could you please advise whether either the theoretical content or the practical demonstration would constitute a conflict of interest? Best regards, [First author's name] ___ (ICLR Organizers, November 16, 2024, 20:09) Hi [First author's name], the topic sounds very interesting and underrepresented. You can include both aspects! Best, Leo on behalf of the blogpost organizers

Submission Number: 59

Loading