Lecture 7: Sparsity-based priors
We continue our quest for a better model for the patch prior probability PF(πxF) to be used in MAP and Bayesian estimators. This time we consider the family of priors based on the assertion that a patch admits a sparse representation in some dictionary.
Kurtosis and sparsity
Let us start with a little background in probability theory. Let us consider the Gaussian distribution of a random variable given by the density function
fX(x)∝e−(x−μ)22σ2and fully characterized by the mean μ and variance σ2. Since the density function has a negative exponential of the square of x, its tails decay very fast. However, there might be other distributions with much “heavier” tails, that is, containing more probability in the tails.
A common measure for the “tailedness” of a distribution is the notion of kurtosis, defined as the normalized central fourth moment,
KurtX=E(X−μσ)4=E(X−μ)4σ4.All Gaussian distributions happen to have kurtosis 3. It is customary to define the kurtosis excess as KurtX−3. Distributions with positive kurtosis excess are called super-Gaussian or leptokurtic or, colloquially, as “heavy-tailed”. Positive curtosis excess arises in two circumstances: when the probability mass is concentrated around the mean and the data-generating process produces occasional values far from the mean, and when the probability mass is concentrated in the tails of the distribution. On the contrary, when the kurtosis excess is negative, the distribution is called sub-Gaussian or platykurtic or, informally, as “light-tailed”.
The exponential power distribution given by the density function
fX(x)∝e−(α|x−μ|)pcan be thought of a generalized Gaussian distribution with the mean and variance controlled by the parameters μ and α, respectively, and the shape of the distribution controlled by the power p. Note that p=2 yields exactly the Gaussian distribution with KurtX−3=0. For p<2, a super-Gaussian family of distributions is obtained with the notable case of the Laplace (a.k.a. double-exponential) distribution corresponding to p=1. For p>2, the family is sub-Gaussian converging pointwise to the uniform distribution on [μ−α,μ+α] as p→∞.
Let x be an n-dimensional vector sampled i.i.d. from a zero-mean super-Gaussian random variable. Since the variable will often realize values around zero, and occasionally large non-zero values, most of the elements of the vector will be nearly zero, with some elements (at uniformly random indices) having a large magnitude. This can be thought of as a probabilistic formalization of the statement “vector x is sparse”. In this sense, super-Gaussianity leads to sparsity.
Bases, frames and dictionaries
We will now need a few important notions in linear algebra and functional analysis. Recall that we defined a patch of a signal f centered at x as the function f translated by x and restricted to the domain ◻=[−T2,T2]d. For notation convenience, in the following treatment we refer to the patch by the name f itself. Recall that any continuous function on ◻ could be decomposed into the Fourier series
f=∑n∈Zdcnϕn,with
ϕn(x)=eix⊤nTand the coefficients cn=⟨f,ϕn⟩L2(◻). An important observation here is that the functions ϕnn∈Zd are linearly-independent and span the space of functions L2(◻). Such a set is called a basis of the said space. The crucial feature of a basis is the uniqueness of the decomposition. In other words, there exists only one set of coefficients cn representing f in ϕnn; in the particular case discussed above, the basis is furthermore orthonormal, so we have a simple formula for computing the coefficients using orthogonal projections on the basis functions1.
Let us now be given a collection of linearly-dependent functions ϕnn spanning L2(◻). Such vectors no more form a basis and therefore the benefit of unique representation is lost. However, as we will see in the sequel, the loss of uniqueness can be actually a very powerful tool. A formal generalization of a basis is the notion of a frame defined through the following condition: if there exists two constants 0<A≤B<∞ such that for every f in the space
A‖f‖2≤∑n∈Zd|⟨f,ϕn⟩|2≤B‖f‖2,the set ϕnn is called a frame. A proper frame (that is, which is not a basis) is called overcomplete or redundant. Intuitively, it contains “more” functions that a basis would need. Formally, there exist an infinite set of coefficients c=cn representing f w.r.t. ϕn.
We will associate with the frame the synthesis operator Φ:ℓ2→L2(◻) producing
Φc=∑n∈Zdcnϕnfrom the sequence of coefficients cn. The adjoint analysis operator Φ∗:L2(◻)→ℓ2 mapps f to c.
In the signal processing jargon, it is customary to speak of a frame as of a dictionary and of the constituent functions as of atoms. Then, a function f can be represented as a superposition of atoms. If the dictionary is overcomplete, such a representation is not unique.
For example, the set of functions
ϕn(x)=eix⊤naTwith a>1 forms a frame on ◻. We can think of it as an overcomplete Fourier transform dictionary. Taking the real value yields the overcomplete cosine dictionary,
ϕn(x)=cos(x⊤naT).There exists a plethora of other useful dictionaries such as wavelets, ridgelets, curvelets, countourlets (which all have a beautiful theory) and there also exists a possibility to learn the dictionary (optimal in some sense) from examples. The latter approach has been shown advantageous in many applications.
Discrete dictionaries
When the patch domain is sampled, say, on the lattice TNZd, f is given on a finite set of (2N+1)d points
f[n]=f(TNn)with n∈−N,…,Nd. Reordering the samples f[n] into a long n=(2N+1)d-dimensional vector f, we can think of the dictionary representation as
f=Φc,where Φ is an n×k matrix whose columns are the sampled versions of the frame functions ϕk sampled and reodered into n-dimensional vectors, ϕk that are ordered in some order such that there are k=an>n functions. The overcomplete cosine frame readily becomes the overcomplete discrete cosine transform (DCT) dictionary.
Sparse patch priors
A crucial empirical observation is that patches of natural images can be approximated by a small number of atoms in many overcomplete dictionaries such as the overcomplete cosine dictionary. We will formalize this by letting
πxF=Φc+E=∑n∈Zdcnϕn+Eand asserting that the coefficients cn are i.i.d. with a super-Gaussian distribution, while the residual E is white Gaussian with the variance σ2E.
This leads to the following negative log likelihood:
−logfπxF|c(x)(πxF=f|c(x)={cn(x)})=−logfS(f−Φc(x))=σ2E‖f−Φc(x)‖2L2(◻)+constwith the negative log prior density of the coefficients,
−logfc(c(x))=∑n∈Zd−logfc(cn(x)).For example, using the exponential power family for fc leads to
−logfc(c(x))=αp∑n∈Zd|cn(x)|p+const.Invoking the Bayes theorem leads to the following posterior density
−logfc(x)|πxF(c(x)|f)=σ2E‖f−Φc(x)‖2L2(◻)+αp∑n∈Zd|cn(x)|p+const.Note that in order to promote sparsity of the coefficients p<2 has to be small. However, for p<1, the negative log density is non-convex, which is a major complication as we eventually would like to add it as a prior term to our optimization problem in a Bayesian or MAP estimatior. A pragmatic choice is p=1, the smallest value of p keeping the term convex (yet making it non-smooth at zero). This choice corresponds to the Laplacian distribution, leading to the following aggregate of the L2 norm with the ℓ1 norm:
−logfc(x)|πxF(c(x)|f)=σ2E‖f−Φc(x)‖2L2(◻)+α∑n∈Zd|cn(x)|+const=‖f−Φc(x)‖2L2(◻)+λ∑n∈Zd|cn(x)|+const,where λ=ασ2E.
Some flavors of sparse priors use dictionaries with non-negative functions and further assert non-negative coefficients, cn(x)≥0. This incorporates the prior information that the signal is non-negative, and is often employed for modelling images and spectral magnitudes of audio signals.
MAP estimation with a sparse patch prior
At this point, we have two possibilities. We can rephrase our model as
πxY=Φc+Eestimate the contents of a patch directly as
πxˆf=Φˆcwhere
ˆc=argminclogfc|πxY(c|πxY=πxy)=argminc‖πxy−Φc‖2L2(◻)+λ∑n∈Zd|cn|.This can be done for every x; afterwards, the overlapping estimated patches are aggregated by simple averaging. However, such an avergaing steers us farther away from the assumption of sparse representation, as a sum of sparsely represented functions is less sparsely represented. A cure can be a smarter way to aggregate the patches.
An alternative approach is to estimate the entire signal f by solving
ˆf=argminf,cμ‖f−y‖2L2(Rd)+∫Rd(‖πxf−Φc(x)‖2L2(◻)+λ∑n∈Zd|cn(x)|)dxsimultaneously for f and cn(x) at all patches.
Iterative shrinkage
Regardless of which flavor of the above MAP estimators we choose, both require the minimization of an aggregate of the L2 and ℓ1 norms. We will rewrite this problem in the form
argminc12‖Φc−y‖2L2(◻)+λ∑n∈Zd|cn|=argminc12⟨Φc−y,Φc−y⟩+λ‖c‖1=argminc12⟨Φc,Φc⟩−⟨Φc,y⟩+12⟨y,y⟩+λ‖c‖1=argminc12⟨Φ∗Φc,c⟩−⟨Φ∗y,c⟩+λ‖c‖1,where the inner products are on L2(◻) and the ℓ1 norm is on the space of sequences. This problem bears the name of Lasso (short for least absolute shrinkage and selection operator) in statistics.
Note that the objective is a function of the sequence c with the first two terms differentiable that we will denote as
g(c)=12⟨Φ∗Φc,c⟩−⟨Φ∗y,c⟩,and a non-differentiable third term, h(c)=λ‖c‖1. Let us fix some c and approximate g(c) around c as
g(u−c)≈g(c)+⟨∇g(c),(u−c)⟩ℓ2+12η‖u−c‖2ℓ2=12η‖u−c+η∇g(c)‖2ℓ2−η2‖∇g(c)‖2ℓ2+g(c)=12η‖u−(c−η∇g(c))‖2ℓ2+const.Here η controls the curvature of the second-order term in the approximation. Plugging this approximation into the minimization problem yields
argminug(u−c)+h(u)=argminu12η‖u−(c−η∇g(c))‖2ℓ2+λ‖u‖1=argminu12‖u−z‖2ℓ2+ηλ‖u‖1=argminu∑n∈Zd12(un−zn)2+ηλ|un|={argminu12(u−zn)2+ηλ|u|}n∈Zd,where
z=c−η∇g(c)=c−ηΦ∗(Φc−y).Note that the latter problem is coordinate-separable, so we can consider the following one-dimensional minimization problem:
argminu12(u−z)2+λ|u|Since the problem is non-smooth at u=0, we cannot readily take a derivative w.r.t. to u and compare it to zero; instead, we have to use the sub-differential set
∂|u|=sign(u):u≠0[−1,1]:u=0This yields u=y−λα for α∈∂|u|. Whenever z∈[−λ,λ], we can set u=0 with α=zλ∈(∂|u||u=0. Otherwise, u≠0 and we have α=signu yielding u=z−λsignu. If z>λ, the right hand side is positive, hence u>0 and signu=1; this yields u=z−λ. Similarly, for y<−λ, we have u<0 and hence u=z+λ. We can therefore summarize the solution to the problem as
argminu12(u−z)2+λ|u|={0:−λ≤z≤λz−λ:z>λz+λ:z<−λThis operation on z is called soft thresholding or shrinkage2 and will be denoted by u=Sλ(z).
Plugging this result back into our original multi-dimensional problem yields
argminug(u−c)+h(u)={Sηλ(zn)}n∈Zd=Sηλ(c−ηΦ∗(Φc−y)),where in the last passage the shrinkage operator Sηλ is applied element-wise to the sequence. We can repeat the process iterartively starting at some initial c0 (upperscript indices denote iteration number),
ck+1=Sηλ(ck−ηΦ∗(Φck−y)),yielding a process known as iterative shrinkage (a.k.a. iterative shrinkage and thresholding algorithm or ISTA for short)3. Note that the step ck−ηΦ∗(Φck−y) inside the shrinkage operator is a gradient descent step at point ck with the step size η, and it would be exactly the step if the objective lacked the term h(c). The shrinkage operator accounts for this additional term.
Non-negativity
In the case of non-negative dictionaries with non-negative coefficients, we can modify our one-dimensional Lasso problem by adding a non-negativity constraint on u,
argminu≥012(u−z)2+λ|u|The solution is obtained in the same manner, except thart now u cannot attain negative values. This leads to the one-sided shrinkage operator
argminu≥012(u−z)2+λ|u|={0:z≤λz−λ:z>λ=Rλ(z).This operator is known under the name of rectified linear unit (or ReLU for short) in the deep learning literature. We will explore this surprising connection more in the sequel.