Multivariate Calculus

56 minute read

IntroductionPermalink

The purpose of this document is to quickly refresh (presumably) known notions in multivariate differential calculus such as differentials, directional derivatives, the gradient and the Hessian. These notions will be used heavily in our course. Even though this quick reminder may seem redundant or trivial to most of you (I hope), I still suggest at least to skim through it, as it might present less common ways of interpretation of very familiar definitions and properties. And even if you discover nothing new in this document, it will at least be useful to introduce notation.

NotationPermalink

In our course, we will deal exclusively with real functions. A scalar function will be denoted as f:RnR, f(x), or simply f. A vector-valued function will be denoted in bold, as f:RnRm, or component-wise as f(x)=(f1(x),,fm(x)). A scalar function of a matrix variable, f:Rm×nR, will be denoted as f(A), and a matrix-valued function of a vector, f:RnRm×k as F(x). Derivatives of a scalar function of one variable will be denoted as f(x), f(x), etc. An n-times continuously differentiable function will be said Cn (fCn). In most cases, we will tacitly assume that a function is sufficiently smooth for at least the first-order derivative to exist.

First-order derivative of a function of one variablePermalink

Before proceeding to multivariate functions, let us remind ourselves a few basic notions of univariate calculus. A C1 function f(x) can be approximated linearly around some point x=x0. Incrementing the argument by dx, the function itself changes by the amount that we denote by Δf=f(x0+dx)f(x0), while the linear approximation changes by the amount denoted by df. For a sufficiently small dx (more formally, in the limit |dx|0), it can be shown that Δf=df+o(dx)1. This means that for an infinitesimally small increment dx, the linear approximation of the function becomes exact. In this limit, df is called the differential of f, and the slope of the linear approximation, is called the first-order derivative of f, denoted dfdx=f(x0). Another way to express this fact is through the first-order Taylor expansion of f around x0:

f(x0+dx)=f(x0)+f(x0)dx+O(dx2),

which essentially says that a linear function whose value at x0 matches that of f(x0), and whose slope matches that of f (expressed by f(x0)) approximates f around x0 up to some second-order error.

GradientPermalink

We can extend the previous discussion straightforwardly to the n-dimensional case. Let f now be a C1 function on Rn. The surface the function creates in Rn+1 can be approximated by an n-dimensional tangent plane (the multidimensional analog of linear approximation). Fixing a point x0 and making a small step dx=(dx1,,dxn) (note that now dx is a vector), it can be shown that the change in the value of the linear approximation is given by

df=fx1dx1++fxndxn,

where fxi denotes the partial derivative of f at x0. The latter formula is usually known as the total differential. Arranging the partial derivatives into a vector g=(fx1,,fxn), the total differential can be expressed as the inner product df=g,dx. The object g appearing in the inner product is called the gradient of f at point x0, and will be denoted by f(x0) (the symbol , graphically a rotated capital Delta, is pronounced “nabla”, from the grecized Hebrew “nevel” for “harp”; is sometimes called the del operator). While we can simply define the gradient as the vector of partial derivatives, we will see that the definition through the inner product can often be more useful.

Directional derivativePermalink

In this course, we will often encounter situations where we are interested in the behavior of a function along a line (formally, we say that f(x) is restricted to the one-dimensional linear subspace L=x0+αr:αR, where x0 is some fixed point, and r is a fixed direction). Let use define a new function of a single variable α, φ(α)=f(x0+αr). Note that we can find the first-order derivative of φ, arriving at the following important notion:

fr(x0)=ddαf(x0+αr)|α=0=φ(0)

which is called the directional derivative of f at x0 in the direction r.

The same way a derivative measures the rate of change of a function, a directional derivative measures the rate of change of a multivariate function when we make a small step in a particular direction.

Denoting g=f(x0) and using our definition of the gradient as the inner product, we can write

dφ=df=gdx=g(dαr)=dα(gr).

Identifying in the latter quantity an inner product of dα with the scalar gr, we can say that gr is the gradient of φ(α) at α=0, which coincides with the first-order derivative, φ(0)=gr, as φ is a function of a single variable. We can summarize this result as the following:

Property. The directional derivative of f at x0 in the direction r is obtained by projecting the gradient at x0 onto the direction r, fr=rf(x0).

HessianPermalink

In the case of a function of a single variable, we saw that the differential of f was given by df=f(x)dx. However, the first-order derivative f(x) is also a function of x, and we can again express its differential as df=f(x)dx, where f(x) denotes the second-order derivative. This notion can be extended to the multivariate case. Recall our definition of the gradient through the inner product, df=gdx. Thinking of the gradient as of a vector-valued function on Rn, g(x)=(g1(x),,gn(x)), we can write

{dg1=h1dxdgn=hndx,

with each hi being the gradient of the i-th component of the gradient vector g,

hi=(gix1,,gixn)=(2fx1xi,,2gixnxi).

Denoting by H=(h1,,hn), we can write compactly dg=Hdx. The n×n matrix H containing all the second-order partial derivatives of f as its elements is called the Hessian of f at point x, and is also denoted2 as 2f(x). We tacitly assumed that f is C2 in order for the second-order derivatives to exist. A nice property of C2 functions is that partial derivation is commutative, meaning that the order of taking second-order partial derivatives can be interchanged: hij=2fxixj=2fxjxi=hji. Algebraically, this implies that the Hessian matrix is symmetric, and we can write

dg=Hdx.

Second-order directional derivativePermalink

Recall that we have previously considered the restriction of a multivariate function f to a line, φ(α)=f(x0+αr). This gave rise to the first-order directional derivative fr(x0)=φ(0). In a similar way, we define the second-order directional derivative at x0 in the direction r as

frr(x0)=φ(0)=d2dα2f(x0+αr)|α=0=ddαfr(x0+αr)|α=0.

Considering fr(x)=rg(x) as a function of x, we can write its differential as

dfr=rdg=rH(x0)dx=rH(x0)rdα,

from where

frr=rHr.

In other words, in order to get the second-order directional derivative in the direction r, one has to evaluate the quadratic form rHr.

Derivatives of linear and quadratic functionsPermalink

Let y=Ax be a general linear operator defined by an m×n matrix. Its differential is given straightforwardly by

dy=A(x+dx)Ax=Adx.

Using this result, we will do a small exercise deriving gradients and Hessians of linear and quadratic functions. As we will see, it is often convenient to start with evaluating the differential of a function.

Our first example is a linear function of the form f(x)=bx, where b is a constant vector. Note that this function is a particular case of the previous result (with A=b), and we can write df=bdx. Comparing this to the general definition of the gradient, df=g(x)dx, we deduce that the gradient of f is given by f(x)=b. Note that the gradient of a linear function is constant – this generalizes the case of a linear function of one variable, f(x)=bx, which has a constant derivative f(x)=b.

Our second example is a quadratic function of the form f(x)=xAx, where A is an n×n matrix. We again compute the differential by definition,

df=f(x+dx)f(x)=(x+dx)A(x+dx)xAx=xAx+dxAx+xAdx+dxAdxxAx=dxAx+xAdx+dxAdx.

Note that in the limit |dx|0, the third term (quadratic in |dx|) goes to zero much faster than the first two terms (linear in dx), and can be therefore neglected3, leading to

df=dxAx+xAdx=dxAx+(xAdx)=dx(A+A)x.

Again, recognizing in the latter expression an inner product with dx, we conclude that f(x)=(A+A)x. For a symmetric A, the latter simplifies to f(x)=2Ax. Note that the gradient of a quadratic function is a linear function; furthermore, the latter expression generalizes the univariate quadratic function f(x)=ax2, whose first-order derivative f(x)=2ax is linear.

Since the gradient g(x)=(A+A)x of the quadratic function is linear, its differential is immediately given by dg=(A+A)dx, from where we conclude that the Hessian of f is H(x)=A+A (or 2A in the symmetric case). Note that the Hessian of a quadratic function is constant, which coincides with the univariate case f(x)=2a.

In the sequel, we will see more complicated examples of gradients and Hessians. For a comprehensive reference on derivatives of matrix and vector expressions, the Matrix Cookbook4 is highly advisable.

Multivariate Taylor expansionPermalink

We have seen the Taylor expansion of a function of one variable as a way to obtain a linear approximation. This construction can be generalized to the multivariate case, as we show here, limiting the expansion to second order.

Theorem: Second-order Taylor expansion. Let f be a C2 function on Rn, x some point, and r a sufficient small vector. Then,

f(x+r)=f(x)+g(x)r+12rH(x)r+O(r3).

The theorem say that up to a third-order error term, the function can be approximated around x by a quadratic function q(r)=f+gr+12rHr (note that the function is quadratic in r, as x is constant, and so are f=f(x), g, and H). Out of all possible quadratic approximations of f, the approximation described by q(r)f(x+r) is such that its value, slope, and curvature at x (equivalently, at r=0) match those of f. The latter geometric quantities are captured, respectively, by the values of the function, its gradient, and its Hessian; in order to match the value, slope, and curvature of f, q has to satisfy q(0)=f(x), q(0)=f(x), and 2q(0)=2f(x) (note that the gradient and the Hessian of q are w.r.t r, whereas the derivatives of f are w.r.t. x). To see that the later equalities hold, we first observe that q(0)=f(x). Next, using the fact that q(r) is quadratic, its gradient and Hessian (w.r.t. r) are given by q(r)=g+Hr and 2q(r)=Hr. Substituting r=0 yields q(0)=g and 2q(r)=H.

Gradient of a function of a matrixPermalink

The notion of gradient can be generalized to functions of matrices. Let f:Rm×nR be such function evaluated at some X. We can think of an equivalent function on Rmn evaluated at x=vec(X), for which the gradient is defined simply as the mn-dimensional vector of all partial derivatives. We can therefore think of the gradient of f(X) at X as of the m×n matrix

G(X)=(fx11fx1nfxm1fxmn).

Previously, we have seen that an “external” definition of the gradient through an inner product is often more useful. Such a definition is also valid for matrix arguments. Recall our definition of the standard inner product on the space of m×n matrices as A,B=ijaijbij=tr(AB), for A,BRm×n. Using the total differential formula yields

df=ijfxijdxij=G,dX,

where dX is now an m×n matrix. The matrix G appearing in the above identity can be defined as the gradient of f.

Gradient of a nonlinear functionPermalink

We finish this brief introduction by deriving the gradient of a more complicated function of the form

f(X)=cφ(Xa+b),

where XRm×n, aRn, b,cRm, and φ is a C1 function applied element-wise. We will encounter such functions during the course when dealing with nonlinear regression and classification applications. In machine learning, functions of this form constitute building blocks of more complicated functions called artificial neural networks. As before, we proceed by computing differentials and using the chain rule. Denoting u=Xa+b, we have

φ(u)=(φ(u1)φ(um)).

Since φ is applied element-wise to u, the differential of φ=φ(u) is given by

dφ=(φ(u1)du1φ(um)dum)=(φ(u1)φ(um))Φdu=Φdu.

Next, we consider the function u(X)=Xa+b; since it is linear in X, its differential is given by du=dXa. Finally, we consider the function f(φ)=cφ, which is linear in φ and has the differential df=cdφ.

Combining these results and using simple properties of the matrix trace yields

df=cdφ=cΦdu=cΦdXa=tr(cΦdXa)=tr(dXacΦ)=dX,acΦ.

In the latter expression, we recognize in the second argument of the inner product the gradient of f w.r.t. X,

f(X)=acΦ.
  1. The little-o notation means that there exists some function of dx, o(dx), going faster to zero than dx (i.e., o(dx)dx0), but the exact form of this function is unimportant. On the other hand, the big-O notation, as in O(dx2), stands for some function that grows with the same rate as dx2 (i.e., lim|dx|0dx2O(dx2)<). 

  2. Some people find the following abuse of notation helpful: Thinking of the gradient of f as of a differential operator of the form “=(x1xn)” applied to f, the Hessian can be expressed by applying the operator “2==(x1xn)(x1,,xn)=(2x1x12x1xn2xnx12xnxn)”. 

  3. This “explanation” can be written rigorously using limits. Another way of getting the same result is the well-known rule of “differential of a product”, d(fg)=dfg+fdg, which can be generalized to the multivariate case as follows: Let h be a scalar function given as the inner product of two vector-valued functions, h(x)=f(x)g(x). Then, dh=dfg+fdg

  4. http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf