Numerical optimization is the key to training modern machine learning models. Most optimization methods, like stochastic gradient descent or gauss-newton optimization, rely on minimizing the energy function using gradient information. However, computing the gradient depends on the model and is very difficult and time-consuming. With the increasing popularity of deep learning, there was a need for reliable algorithms to compute the exact derivatives of a complex model (e.g. deep neural network) automatically and quickly. Automatic differentiation allows for rapidly calculating the exact function derivatives just from high-level function implementation in the programming languages like C++ and Python. It is a
Contents
What is Automatic Differentiation?
It is important to note that autodiff is not numerical or symbolic differentiation. Autodiff provides exact derivatives values (similar to symbolic differentiation) at point-wise inputs (like numerical differentiation). However, autodiff never explicitly expands/or computes the equations for the derivatives. Instead, it overloads the standard mathematical operators, such as addition, multiplication, etc, and augments number with infinitesimal part epsilon such that e != 0 but e^2 = 0. The concept is very similar to complex numbers, except the part about e^2=0. The main benefit is convenience in computing exact derivatives without computing symbolic expressions.
Consider a function . If evaluated at , we have and . For using auto-differentiation, we evaluate the function at to get since we have . In this particular example, the e was a scalar value, however, for a function with n inputs, e is a vector of size n, and each denotes the derivatie of f w.r.t. the ith input.
Implementing Auto-differentiation
The standard math library in C++ or Python does not support floating numbers with an infinitesimal. Furthermore, standard math operators like addition do not take numbers with infinitesimal parts. So we need to implement two things:
1. Numbers with infinitesimal parts, say a class with scalar and infinitesimal part.
2. Overload all standard math functions to take this numbers and propagate them to their outputs
For example, multiplication becomes
Many optimization/machine learning libraries support auto-differentiation. In C++, The Ceres-Solver is shipped with an auto-differentiation engine named Jet. For convenience, I have extracted the Jet files and put them in their own repository on Github. Eigen also has an unsupported autodiff engine. In Python, all major ML libraries, including Tensorflow and Pytorch support autodiff. There are even standalone libraries like
Autodiff in common libraries
For example, say I want to compute the derivatives of the Rosenbrock function. The Rosenbrock function is defined as
This is a particularly interesting function: It has a minimum at (
If I want to compute the derivatives of this function, w.r.t. x and y, All I have to do is to implement the function itself in high-level language like C++ or Python. For example, If I were to use C++ and Jet autodiff library the function would look like this:
Similarly in Python
Ceres-Solver Jet (C++)
In C++, the derivatives of the Rosenbrock function can be computed using Jets as follows (A complete example is provided here):
PyTorch (Python)
Auto differentiation is Pytorch is stupidly easy. We have to add a requires_grad flag to each of our input variables and Torch will take care of the rest.
Notice how at the end, before evaluating the x.grad we had to call the backward() method on our output. The backward function back-propagated the derivatives from the output y to the input variable x. This is in contrast to Ceres-Solver where there was no need to backpropagate: Ceres-Solver’s implementation of Autodiff is a forward auto-differentiation engine, where as Pytorch is a backward implementation.
Autograd (Python)
Auto differentiation is Autograd is also very straightforward:
Tensorflow (Python)
Finally, in
Autodiff limitations
Autodiff comes with its own set of limitations and penalties. Autodiff will compute the exact derivative at a given point. As a result, it is assumed that both left and right derivatives
For example, consider the function . We have and . Using autodiff, the at is not defined.
In comparison to analytically computing the Jacobian, autodiff is a lot slower and more memory-consuming, which prohibits using it in real-time embedded applications like computer vision and robotics. I absolutely recommend using autodiff for rapid-prototyping; However; however for shipping a production system it is better to compute the Jacobian using symbolic methods (such as Sympy package) and then replace autodiff with analytical function. For example, Sophus uses symbolic method to compute the derivatives and replaces the code for analytical
Online resources
Havard Berland has a deck of slides on autodiff, which is one of the best practical introduction to the topic. Recently, Baydin et. al. surveyed the auto differentiation in JMLR. Ceres-Solver tutorials also covers the auto-differentiation, as well as numerical and analytical differentiation in Ceres.
In the follow-up post, I’ll talk about how to use the chain rules in the autodiff framework.
Leave A Comment