Automatic Differentiation

Numerical optimization is the key to training modern machine learning models. Most optimization methods, like stochastic gradient descent or gauss-newton optimization, rely on minimizing the energy function using gradient information. However, computing the gradient depends on the model and is very difficult and time-consuming. With the increasing popularity of deep learning, there was a need for reliable algorithms to compute the exact derivatives of a complex model (e.g. deep neural network) automatically and quickly. Automatic differentiation allows for rapidly calculating the exact function derivatives just from high-level function implementation in the programming languages like C++ and Python. It is a genius idea that exploits modern object-oriented paradigm to implicitly implement chain rules, and hides all of these from the end-user inside a nice autodiff facade! Auto differentiation is not new. It has been around for more than a decade. However, thanks to deep learning, the recent development of the production-level library for autodiff like PyTorch, Ceres-Solver, and Tensorflow, pushed autodiff to the mainstream machine learning community.

Contents

1 What is Automatic Differentiation?
2 Implementing Auto-differentiation
3 Autodiff in common libraries
4 Autodiff limitations
5 Online resources
- 5.1 Share this:
- 5.2 Related

What is Automatic Differentiation?

It is important to note that autodiff is not numerical or symbolic differentiation. Autodiff provides exact derivatives values (similar to symbolic differentiation) at point-wise inputs (like numerical differentiation). However, autodiff never explicitly expands/or computes the equations for the derivatives. Instead, it overloads the standard mathematical operators, such as addition, multiplication, etc, and augments number with infinitesimal part epsilon such that e != 0 but e^2 = 0. The concept is very similar to complex numbers, except the part about e^2=0. The main benefit is convenience in computing exact derivatives without computing symbolic expressions.

Consider a function $f(x) = x^2$ . If evaluated at $x = 10$ , we have $f(10) = 100$ and $df/dx = 20$ . For using auto-differentiation, we evaluate the function at $10+e$ to get $f(10+e) = (10+e)^2 = 100 + 2*10*e + e^2 = 100 + 20*e + e^2.$ since $e^2=0$ we have $df/dx = 20$ . In this particular example, the e was a scalar value, however, for a function with n inputs, e is a vector of size n, and each $e_i$ denotes the derivatie of f w.r.t. the ith input.

Implementing Auto-differentiation

The standard math library in C++ or Python does not support floating numbers with an infinitesimal. Furthermore, standard math operators like addition do not take numbers with infinitesimal parts. So we need to implement two things:

1. Numbers with infinitesimal parts, say a class with scalar and infinitesimal part.

2. Overload all standard math functions to take this numbers and propagate them to their outputs

For example, multiplication becomes

Many optimization/machine learning libraries support auto-differentiation. In C++, The Ceres-Solver is shipped with an auto-differentiation engine named Jet. For convenience, I have extracted the Jet files and put them in their own repository on Github. Eigen also has an unsupported autodiff engine. In Python, all major ML libraries, including Tensorflow and Pytorch support autodiff. There are even standalone libraries like autograd.

Autodiff in common libraries

For example, say I want to compute the derivatives of the Rosenbrock function. The Rosenbrock function is defined as

$f(x,y) = (a-x)^2 + b(y - x^2)^2$

This is a particularly interesting function: It has a minimum at (a, a^2). The landscape of this function forms a valley. Most optimizers quickly reach the valley floor, then zig-zag through the valley to reach the minimum. Testing this zig-zagging behavior is one of the reasons that this function is often used as a baseline test for developing optimization algorithms.

If I want to compute the derivatives of this function, w.r.t. x and y, All I have to do is to implement the function itself in high-level language like C++ or Python. For example, If I were to use C++ and Jet autodiff library the function would look like this:

Similarly in Python

Ceres-Solver Jet (C++)

In C++, the derivatives of the Rosenbrock function can be computed using Jets as follows (A complete example is provided here):

PyTorch (Python)

Auto differentiation is Pytorch is stupidly easy. We have to add a requires_grad flag to each of our input variables and Torch will take care of the rest.

Notice how at the end, before evaluating the x.grad we had to call the backward() method on our output. The backward function back-propagated the derivatives from the output y to the input variable x. This is in contrast to Ceres-Solver where there was no need to backpropagate: Ceres-Solver’s implementation of Autodiff is a forward auto-differentiation engine, where as Pytorch is a backward implementation.

Autograd (Python)

Auto differentiation is Autograd is also very straightforward:

Tensorflow (Python)

Finally, in tensorflow, syntax is slightly more complicated but the idea is the same:

Autodiff limitations

Autodiff comes with its own set of limitations and penalties. Autodiff will compute the exact derivative at a given point. As a result, it is assumed that both left and right derivatives exist. If they don’t exist, autodiff will run into problems when computing the higher derivatives. Normally, we can avoid these situations using L’hopital rules, however, L’hopital is not supported by autodiff engines. This is especially important when dealing with second derivatives of lie groups around the origin.

For example, consider the function $1.0-{\sqrt{x^2+y^2}}$ . We have $df/dx = -\frac{x}{\sqrt{x^2+y^2}}$ and $df/dy = -\frac{y}{\sqrt{x^2+y^2}}$ . Using autodiff, the $df/dx$ at $x=y=0$ is not defined.

In comparison to analytically computing the Jacobian, autodiff is a lot slower and more memory-consuming, which prohibits using it in real-time embedded applications like computer vision and robotics. I absolutely recommend using autodiff for rapid-prototyping; However; however for shipping a production system it is better to compute the Jacobian using symbolic methods (such as Sympy package) and then replace autodiff with analytical function. For example, Sophus uses symbolic method to compute the derivatives and replaces the code for analytical jacobian.

Online resources

Havard Berland has a deck of slides on autodiff, which is one of the best practical introduction to the topic. Recently, Baydin et. al. surveyed the auto differentiation in JMLR. Ceres-Solver tutorials also covers the auto-differentiation, as well as numerical and analytical differentiation in Ceres.

In the follow-up post, I’ll talk about how to use the chain rules in the autodiff framework.