Linear Algebra Crash Course

60 minute read

Prof. Alex Bronstein

Introduction

The purpose of this document is to quickly refresh (presumably) known notions in linear algebra. It contains a collection of facts related to vectors, matrices, and geometric entities they represent that we will use heavily in our course. Even though this quick reminder may seem redundant or trivial to most of you (I hope), I still suggest at least to skim through it, as it might present less common ways of interpretation of very familiar definitions and properties. And even if you discover nothing new in this document, it will at least be useful to introduce notation.

Notation

In our course, we will deal almost exclusively with the field of real numbers, which we denote by R. An n-dimensional Euclidean space will be denoted as Rn, and the space of m×n matrices as Rm×n. We will try to stick to a consistent typesetting denoting a scalar a with lowercase italic, a vector a in lowercase bold, and a matrix A in uppercase bold. Elements of a vector or a matrix will be denoted using subindices as ai and aij, respectively. Unless stated otherwise, a vector is a column vector, and we will write a=(a1,,an) to save space (here, is the transpose of the row vector (a1,,an)). In the same way, A=(a1,,an) will refer to a matrix constructed from n column vectors ai. We will denote the zero vector by 0, and the identity matrix by I.

Linear and affine spaces

Given a collection of vectors v1,,vmRn, a new vector b=a1v1++amvm, with a1,,amR some scalars, is called a linear combination of the vi’s. The collection of all linear combinations is called a linear subspace of Rn, denoted by

L=span{v1,,vm}={a1v1++amvm:a1,,amR}.

We will say that the vi’s span the linear subspace L.

A vector u that cannot be described as a linear combination of v1,,vm (i.e., uL) is said to be linearly independent of the latter vectors. The collection of vectors v1,,vm is linearly independent if no vi is a linear combination of the rest of the vectors. In such a case, a vector in L can be unambiguously defined by the set of m scalars a1,,am; omitting any of these scalars will make the description incomplete. Formally, we say that m is the dimension of the subspace, dimL=m.

Geometrically, an m-dimensional subspace is an m-dimensional plane passing through the origin (a one-dimensional plane is a line, a two-dimensional plane is the regular plane, and so on). The latter is true, since setting all the ai’s to zero in the definition of L yields the zero vector. Figure 1 (left) visualizes this fact.

An affine subspace can be defined as a linear subspace shifted by a vector bRn away from the origin:

A=L+b={u+b:uL}.

Exercise. Show that any affine subspace can be defined as

A={a1v1++amvm:a1++am=1}.

The latter linear combination with the scalars restricted to unit sum is called an affine combination.

Figure 1: One-dimensional linear (left) and affine (right) subspaces of R2.

Vector inner product and norms

Given two vectors u and v in Rn, we define their inner (a.k.a. scalar or dot) product as

u,v=ni=1uivi=uv.

Though the notion of an inner product is more general than this, we will limit our attention exclusively to this particular case (and its matrix equivalent defined in the sequel). This is sometimes called the standard or the Euclidean inner product. The inner product defines or induces a vector norm

u=u,u

(it is also convenient to write |u|2=uu), which is called the Euclidean or the induced norm.

A more general notion of a norm can be introduced axiomatically. We will say that a non-negative scalar function ||:RnR+ is a norm if it satisfies the following axioms for any scalar aR and any vectors u,vRn

  1. |au|=|a||u| (this property is called absolute homogeneity);

  2. |u+v||u|+|v| (this property is called subadditivity, and since a norm induces a metric (distance function), the geometric name for it is triangle inequality);

  3. |u|=0 iff1 u=0.

In this course, we will almost exclusively restrict our attention to the following family of norms called the p norms:

For 1p<, the p norm of a vector uRn is defined as up=(ni=1|ui|p)1/p.

Note the sub-index p. The Euclidean norm corresponds to p=2 (hence, the alternative name, the 2 norm) and, whenever confusion may arise, we will denote it by ||2.

Another important particular case of the p family of norms is the 1 norm u1=ni=1|ui|. As we will see, when used to quantify errors for example in a regression or model fitting task, the 1 norm (representing mean absolute error) is more robust (i.e., less sensitive to noisy data) than the Euclidean norm (representing mean squared error). The selection of p=1 is the smallest among the p norms for which the norm is convex. In this course, we will dedicate significant attention to this important notion, as convexity will have profound impact on solvability of optimization problems.

Yet another important particular case is the norm u=maxi=1,,n|ui|. The latter can be obtained as the limit of p.

Exercise. Show that limpup=maxi=1,,n|ui|.

Geometrically, a norm measures the length of a vector; a vector of length |u|=1 is said to be a unit vector (with respect to2 that norm). The collection u:|u|=1 of all unit vectors is called the unit circle (see figure). Note that the unit circle is indeed a “circle” only for the Euclidean norm. Similarly, the collection Br=u:|u|r of all vectors with length no bigger than r is called the ball of radius r (w.r.t. a given norm). Again, norm balls are round only for the Euclidean norm. From figure 2 we can deduce that for p<q, the p unit circle is fully contained in the q unit circle. This means that the p norm is bigger than the q norm in the sense that |u|q|b|p.

Exercise. Show that |u|q|b|p for every p<q.

Figure 2: Unit circles with respect to different p norms in R2

Continuing the geometric interpretation, it is worthwhile mentioning several relations between the inner product and the 2 norm it induces. The inner product of two (2-) unit vectors measures the cosine of the angle between them or, more generally, u,v=u2v2cosθ, where θ is the angle between u and v. Two vectors satisfying u,v=0 are said to be orthogonal (if the vectors are unit, they are also said to be orthonormal) – the algebraic way of saying “perpendicular”.

The following result is doubtlessly the most important inequality in linear algebra (and, perhaps, in mathematics in general):

Theorem: Cauchy-Schwartz inequality. Let || be the norm induced by an inner product , on Rn, Then, for any u,vRn,

|u,v|uv

with equality holding iff u and v are linearly dependent.

Matrices

A linear map from the n-dimensional linear space Rn to the m-dimensional linear space Rm is a function mapping uRn vRm according to

vi=nj=1aijuj   i=1,,m.

The latter can be expressed compactly using the matrix-vector product notation v=Au, where A is the matrix with the elements aij. In other words, a matrix ARm×n is a compact way of expressing a linear map between Rm and Rn. An matrix is said to be square of m=n; such a matrix defines an operator mapping Rn to itself. A symmetric matrix is a square matrix A such that A=A.

Recollecting our notion of linear combinations and linear subspaces, observe that the vector v=Au is the linear combination of the columns a1,,an of A with the weights u1,,un. The linear subspace ARm=Au:uRm  is called the columns space of A. The space is n-dimensional if the columns are linearly independent; otherwise, if k columns are linearly dependent, the space is nk dimensional. The latter dimension is called the column rank of the matrix. By transposing the matrix, the row rank can be defined in the same way. The following result is fundamental in linear algebra:

Theorem. The column rank and the row rank of a matrix are always equal.

Exercise. Prove the above theorem.

Since the row and the column ranks of a matrix are equal, we will simply refer to both as the rank, denoting rankA. A square n×n matrix is full rank if its rank is n, and is rank deficient otherwise. Full rank is a necessary condition for a square matrix to possess an inverse (Reminder: the inverse of a matrix A is such a matrix B that AB=BA=I; when the inverse exists, the matrix is called invertible and its inverse is denoted by A1).

In this course, we will often encounter the trace of a square matrix, which is defined as the sum of its diagonal entries, trA=ni=1ai. The following property of the trace will be particularly useful:

Let ARm×n and BRn×m. Then tr(AB)=tr(BA).

This is in sharp contrast to the result of the product itself, which is generally not commutative.

In particular, the squared norm |u|2=uu can be written as tr(uu)=tr(uu). We will see many cases where such an apparently weird writing is very useful. The above property can be generalized to the product of k matrices A1Ak by saying that tr(A1Ak) is invariant under a cyclic permutation of the factors as long as their product is defined. For example, tr(ABC)=tr(CAB)=tr(BCA) (again, as long as the matrix dimensions are such that the products are defined).

Matrix inner product and norms

The notion of an inner product can be extended to matrices by thinking of an m×n matrix as of a long vector m×n vector containing the matrix elements for example, in the column-stack order. We will denote such a vector as vec(A)=(a11,,am1,a12,,am2,,a1n,,amn). With such an interpretation, we can define the inner product of two matrices as A,B=vec(A),vec(B)=i,jaijbij.

Exercise. Show that A,B=tr(AB).

The inner product induces the standard Euclidean norm on the space of column-stack representation of matrices,

A=A,A=vec(A),vec(B)=i,ja2ij,

known as the Frobenius norm. Using the result of the exercise above, we can write

A2F=tr(AA).

Note the qualifier F used to distinguish this norm from other matrix norm that we will define in the sequel.

There exist another “standard” way of defining matrix norms by thinking of an m×n matrix A as a linear operator mapping between two normed spaces, say, A:(Rm,p)(Rn,q). Then, we can define the operator norm measuring the maximum change of length (in the q sense) of a unit (in the p sense) vector in the operator domain:

Ap,q=maxup=1Auq.

Exercise. Use the axioms of a norm to show that |A|p,q is a norm.

Eigendecomposition

Eigendecomposition (a.k.a. eigenvalue decomposition or in some contexts spectral decomposition, from the German eigen for “self”) is doubtlessly the most important and useful forms of matrix factorization. The following discussion will be valid only for square matrices.

Recall that an n×n matrix A represents a linear map on Rn. In general, the effect of A on a vector u is a new vector v=Au, rotated and elongated or shrunk (and, potentially, reflected). However, there exist vectors which are only elongated or shrunk by A. Such vectors are called eigenvectors. Formally, an eigenvector of A is a non-zero vector u satisfying Au=λu, with the scalar λ (called eigenvalue) measuring the amount of elongation or shrinkage of u (if λ<0, the vector is reflected). Note that the scale of an eigenvector has no meaning as it appears on both sides of the equation; for this reason, eigenvectors are always normalized (in the 2 sense). For reasons not so relevant to our course, the collection of the eigenvalues is called the spectrum of a matrix.

For an n×n matrix A with n linearly independent eigenvectors we can write the following system

{Au1=λ1u1Aun=λnun

Stacking the eigenvectors into the columns of the n×n matrix U=(u1,,un), and defining the diagonal matrix

Λ=diag{λ1,,λn}=(λ1λn),

we can rewrite the system more compactly as AU=UΛ. Independent eigenvectors means that U is invertible, which leads to A=UΛU1. Geometrically, eigendecomposition of a matrix can be interpreted as a change of coordinates into a basis, in which the action of a matrix can be described as elongation or shrinkage only (represented by the diagonal matrix Λ).

If the matrix A is symmetric, it can be shown that its eigenvectors are orthonormal, i.e., ui,uj=uiuj=0 for every ij and, since the eigenvectors have unit length, uiu=1. This can be compactly written as UU=I or, in other words, U1=U. Matrices satisfying this property are called orthonormal or unitary, and we will say that symmetric matrices admit unitary eigendecomposition A=UΛU.

Exercise. Show that a symmetric matrix admits unitary eigendecomposition A=UΛU.

Finally, we note two very simple but useful facts about eigendecomposition:

  • trA=tr(UΛU1)=tr(ΛU1U)=trΛ=ni=1λi.

  • detA=detUdetΛdetU1=detUdetΛ1detU=ni=1λi.

In other words, the trace and the determinant of a matrix are given by the sum and the product of its eigenvalues, respectively.

Matrix functions

Eigendecomposition is a very convenient way of performing various matrix operations. For example, if we are given the eigendecomposition of A=UΛU1, its inverse can be expressed as A1=(UΛU1)1=UΛ1U1; however, since Λ is diagonal, Λ1=diag1/λ1,,1/λn. (This does not suggest, of course, that this is the computationally preferred way to invert matrices, as the eigendecomposition itself is a costly operation).

A similar idea can be applied to the square of a matrix: A2=UΛU1UΛU1=UΛ2U1 and, again, we note that Λ2=diagλ21,,λ2n. By using induction, we can generalize this result to any integer power p0: Ap=UΛpU1. (Here, if, say, p=1000, the computational advantage of using eigendecomposition might be well justified).

Going one step further, let

φ(t)=i0citi

be a polynomial (either of a finite degree or an infinite series). We can apply this function to a square matrix A as follows:

φ(A)=i0ciAi=i0ciUΛpU1=U(i0ciΛp)U1=U(i0diag{ciλi1,,ciλin})U1=U(i0ciλi1i0ciλin)U1=Udiag{φ(λ1),,φ(λn)}U1.

Denoting by φ(Λ)=diagφ(λ1),,φ(λn) the diagonal matrix formed by the element-wise application of the scalar function φ to the eigenvalues of A, we can write compactly

φ(A)=Uφ(Λ)U1.

Finally, since many functions can be described polynomial series, we can generalize the latter definition to a (more or less) arbitrary scalar function φ.

The above procedure is a standard way of constructing a matrix function (this term is admittedly confusing, as we will see it assuming another meaning); for example, matrix exponential and logarithm are constructed exactly like this. Note that the construction is sharply different from applying the function φ element-wise!

Positive definite matrices

Symmetric square matrices define an important family of functions called quadratic forms that we will encounter very often in this course. Formally, a quadratic form is a scalar function on Rn given by xAx=ni,j=1aijxixj, where A is a symmetric n×n matrix, and xRn.

A symmetric square matrix A is called positive definite (denoted as A0) iff for every x0, xAx>0. The matrix is called positive semi-definite (denoted as A0) if the inequality is weak.

Positive (semi-) definite matrices can be equivalently defined through their eigendecomposition:

Let A be a symmetric matrix admitting the eigendecomposition A=UΛU. Then A0 iff λi>0 for i=1,,n. Similarly, A0 iff λi0.

In other words, the matrix is positive (semi-) definite if it has positive (non-negative) spectrum. To get a hint why this is true, consider an arbitrary vector x0, and write

xAx=xUΛUx=(Ux)Λ(Ux)

Denoting y=Ux, we have

xAx=yΛy=ni=1λiy2i.

Since U is full rank, the vector y is also an arbitrary non-zero vector in Rn and the only way to make the latter sum always positive is by ensuring that all λi are positive. The very same reasoning is also true in the opposite direction.

Geometrically, a quadratic form describes a second-order (hence the name quadratic) surface in Rn, and the eigenvalues of the matrix A can be interpreted as the surface curvature. Very informally, if a certain eigenvalue λi is positive, a small step in the direction of the corresponding eigenvector ui rotates the normal to the surface in the same direction. The surface is said to have positive curvature in that direction. Similarly, a negative eigenvalue corresponds to the normal rotating in the opposite direction of the step (negative curvature). Finally, if λi=0, a step in the direction of ui leave the normal unchanged (the surface is said to be flat in that direction). A quadratic form created by a positive definite matrix represents a positively curved surface in all directions. Such a surface is cup-shaped (if you can imagine an n-dimensional cup) or, formally, is convex; in the sequel, we will see the important consequences this property has on optimization problems.

  1. We will henceforth abbreviate “if and only if” as “iff”. Two statements related by “iff” are equivalent; for example, if one of the statements is a definition of some object, and the other is its property, the latter property can be used as an alternative definition. We will see many such examples. 

  2. We will often abbreviate “with respect to” as “w.r.t.”