Francesco Montanari

Automatic differentiation (or autodiff) is one of the most unfairly neglected numerical algorithms for scientific computing. Fortunately, widely popular machine learning libraries recently started to adopt it (like TensorFlow, for gradients computation).

Automatic differentiation refers to a set of computational procedures to calculate automatically exact (or better, at working precision) derivatives of any function (including any computer program), using at most a small constant factor more arithmetic operations than the original function. Hence, it differs from:

Finite differences: fast method, but introduces large numerical errors depending on the choice of the finite step. Although the dependence can be slightly relieved with some algorithm (e.g., the five-point stencil), errors are significantly larger than working precision and require careful iterative accuracy tests. Finite differences are unreliable in the vicinity of discontinuities.

Symbolic differentiation: the chain rule is a mechanical procedure easy to apply programmatically. It relies on the fact that a function is written in terms of other elementary mathematical functions whose derivative is known. However, symbolic differentiation of arbitrary programs is less straightforward. Furthermore, it soon becomes inefficient. For instance, let's consider the iteration n=4 of the logistic map defined by l_n+1 = 4l_n(1-l_n) and l₁ = x (code scripted in Maxima):

(%i1) f(x) := 64 * x * (1 - x) * (1 - 2*x)^2 * (1 - 8*x + 8*x^2)^2;
                                             2               2 2
(%o1)          f(x) := 64 x (1 - x) (1 - 2 x)  (1 - 8 x + 8 x )

The symbolic derivative obtained through systematic application of the chain rule is:

(%i2) diff(f(x), x);
                                     2           2
(%o2) (- 256 (1 - 2 x) (1 - x) x (8 x  - 8 x + 1) )
               2       2           2               2             2           2
 - 64 (1 - 2 x)  x (8 x  - 8 x + 1)  + 64 (1 - 2 x)  (1 - x) (8 x  - 8 x + 1)
                2                          2
 + 128 (1 - 2 x)  (1 - x) x (16 x - 8) (8 x  - 8 x + 1)

Let's simplify the result:

(%i3) ratsimp(%o2);
                 7            6           5           4           3          2
(%o3) (- 131072 x ) + 458752 x  - 638976 x  + 450560 x  - 168960 x  + 32256 x
                                                                  - 2688 x + 64

This shows that symbolic differentiation can produce large symbolic expressions that take correspondingly long to evaluate (problem known as expression swell).

Automatic differentiation solve the problems described above.

Overview using dual numbers

In its simpler form (called forward-mode automatic differentiation), the idea is straightforward. Given a real number x, define its dual x+x_dε such that

(x+x_dε) + (y+y_dε) = (x+y) + (x_d + y_d)ε
(x+x_dε) (y+y_dε) = xy + (xy_d + yx_d )ε

In other words, terms of order ε² (or larger) are set to zero by definition. Then, the Taylor-expansion of a scalar function f: R → R reads:

f(x+x_dε) = f(x) + f'(x)x_dε

Setting arbitrarily x_d = 1, the coefficients of the ε term provide the first derivative of the function, f'(x).

Take a polynomial p(x) = p₀ + p₁x + p₂x² as an example. The dual version reads:

p(x+x_dε) = p₀ + p₁(x+x_dε) + p₂(x+x_dε)²
= p₀ + p₁x + p₂x² + (p₁ + 2p₂x)x_dε
= p(x) + p'(x)x_dε

Hence, given the dual version of the polynomial p(x) it is easy to extract automatically its derivative p'(x) = p₁ + 2p₂x as the coefficient of the ε term when x_d = 1. In procedural terms, given a function f the idea is to advance as:

extract-eps(dual(f)(x + 1ε))

Every computer program eventually executes a sequence of elementary arithmetic operations (addition, multiplication, exponential, etc.). Defining how dual numbers operate on such basic operations and functions then allows to take derivatives at working precision of any computer program. Automatic derivatives are well defined also close to discontinuities, and at most only a small factor more operations are required compared to evaluating the original function.

For instance, if a language defines a double type, a new adouble type can be defined to operate on dual numbers overloading all fundamental operations (if the language allows operator overloading, like C++). Conceptually equivalent, but practically different procedures are also possible. Some library rely on source-to-source transformation, i.e., it reads a given source file, analyzes and transforms it automatically writing a new source file (this can be done also with languages not allowing operator overloading, like C). Yet another approach is to write a programming language with autodiff primitives (being acquainted with some Lisp dialect can ease this task).

It is easy to generalize the method to higher-order derivatives, and to functions f: Rⁿ → R^m. However, while the procedure described above is efficient if n ≥ m, in the case n < m a different approach called reverse-mode automatic differentiation is most suited.

Limitations

Implementing automatic differentiation is rather involved and faces subtle issues (for instance, perturbation confusion), so it is preferable to rely on existing libraries rather than implementing it from scratch. Of course, the algorithm complexity depends on the specific programming language.

Given a programming language with powerful enough abstraction, relatively simple implementations allow impressive results. This is the case for Lisp dialects. (See scmutils, implemented in Scheme and described in the notation appendix of Structure and Interpretation of Classical Mechanics; note that scmutils also implements symbolic differentiation, not to be confused.)

Most commonly used, but less powerful languages (e.g., C/C++) require significantly more intricated design patterns. Several libraries are available. E.g., ADOL-C allows arbitrary differentiation order using operator over-loading in C++, both for forward and reverse modes. However, its usage requires to write C/C++ code in a peculiar form (recording operation on a so-called tape structure) that assumes a good understanding of the underlying implementation (the documentation is clear, but has a rather steep learning curve). Furthermore, all operators and function must be defined in terms of dual numbers. External functions need to be registered first and, while this is a welcome feature of the library, it is difficult to use. The alternative is to re-implement a function from scratch, that can be easy (e.g., a power operator) or cumbersome (e.g., a Bessel function). Other libraries like ADEPT are easier to use and have a friendly documentation, but are limited to first derivatives. More libraries for different programming languages are listed at autodiff.org.

Outlook

Automatic differentiation solves all the main issues related to finite differences and symbolic differentiation. The usage problems outlined above are more related to autodiff not being widely used, rather than with intrinsic limitations of the algorithm. (And to the lack of abstraction in specific languages, but this is a fundamental problem not specific to autodiff.) Fortunately, advances in machine learning and AI are bringing deserved attention to it.

One powerful and user-friendly implementation is available (unsurprisingly) in Python. Autograd computes arbitrary order reverse-mode autodiff with a simple and intuitive interface. Registering external functions is fairly easy, and several Numpy and Scipy functions are already available anyway.