Lecture 6: Patch-based priors
We continue our quest for a better model for the prior probability PF to be used in MAP and Bayesian estimators. While it is almost hopeless to be able to formulate such a prior on the entire image, it is a much more practical task to do it on a small region of an image (known in the image processing jargon as a patch). Let us first fix some size T and define a square domain ◻=[−T2,T2]d. A patch around some point x in an image f can be thought of as τxf|◻. We can therefore define a patch operator πp:F(Rd,R)→F(◻,R) taking f as the input and producing τxf|◻ as the output. In these terms, we redefine the problem of modeling the prior probability of a natural image F as the problem of modeling the prior probability of πxF for every location x in the image domain. To that end, let us assume to have access to a (potentially, very large) collection of clean patches p1,…,pK; such a collection can be obtained by taking a very large representative set of images, decomposing them into a collection of overlapping patches and clustering the latter into K samples. We will further assume that a patch in a natural image πqF is uniformly distributed over that collection, i.e., πxF=pk with probability 1K for k=1,…,K. Note that the simple assumption of uniform distribution over the collection of example patches leads to a potentially very intricate prior distribution of πxF itself, since the collection captures the intricate structure of representative image patches!
Patch-based denoising
Denoising using multiple images
Let us simplify our archetypal inverse problem assuming that the latent signal F is only degraded by white additive noise, Y=F+N, admitting a Gaussian distribution N(x)∼N(0,σ2N). This problem is known as Gaussian denoising. The likelihood of πxY given that πxF was formed using the patch pk is simply given by the density of πxN,
PπxY|πxF(πxY=y|πxF=pk)=fN(y−pk)∝e−‖y−pk‖22σ2N,where the norm is the L2 norm on ◻. Since we asserted uniform distribution of k over 1,…,K, the Bayes formula readliy yields
PπxF|πxY(πxF=pk|πxY=y)=e−‖y−pk‖22σ2NK∑i=1e−‖y−pi‖22σ2N.Using this posterior, we can define the MSE of the estimator ˆf at point x as
ϵ(ˆf(x))=E(ˆf(x)−pk(0))2=K∑k=1(ˆf(x)−pk(0))2PπxF|πxY(πxF=pk|πxY=πxy);the MMSE estimator is therefore defined as
ˆf(x)=argminˆf(x)ϵ(ˆf(x))=argminˆf(x)K∑k=1(ˆf(x)−pk(0))2hk(x),where hk(x)=PπxF|πxY(πxF=pk|πxY=πxy). Note that the sum of the latter weights over k is always one. Also note that pk(0) gives the value at the center of the patch. The latter problem has a simple closed-form solution as the weighted average
ˆf(x)=K∑k=1hk(x)pk(0)=K∑k=1e−‖πxy−pk‖22σ2Npk(0)K∑k=1e−‖πxy−pk‖22σ2Nof the central values from exemplar patches p1,…,pK, with weights inversely proportional to the distance of the said patches from the corresponding input patch centered at x. Since K might be very big and the exponential anyway decays very fast in |πxy−pk|, in practice the above sum is approximated by taking a fixed number of (approximate) nearest neighbors) in the sense of the inter-patch distance |πxy−pk|.
Non-local means
It is sometimes inconvenient to assume the availability of exemplar patches pk coming from an external collection of images; instead, it is often desirable to use the patches from the image itself for the purpose of defining the prior on πxF. The former MMSE estimator can be readily formulated in this setup as
ˆf(x)=∫Rde−‖πxy−πx′y‖22σ2Ny(x′)dx′∫Rde−‖πxy−πx′y‖22σ2Ndx′or, more practically,
ˆf(x)=K∑k=1e−‖πxy−πxky‖22σ2Ny(xk)K∑k=1e−‖πxy−πxky‖22σ2N,where πx1y,…,πxKy are the K nearest neighbors of πxy in the noisy image itself y (variants of the PatchMatch are very useful to compute the nearest neighbors). Such a filter is known as non-local means (NLM) – it works like a regular local averaging that supresses noise (by the law of large numbers), yet, in the case of NLM averaging is non-local. Oftentimes, an additional (very slowly decaying) spatial weight is added to down-weigh spatially distant pixels,
ˆf(x)=K∑k=1e−‖πxy−πxky‖22σ2N−‖x−x′‖22σ2sy(xk)K∑k=1e−‖πxy−πxky‖22σ2N−‖x−x′‖22σ2s,where σ2s controls the width of the spatial weight.
Bilateral filter
A particularly renowned setting of NLM (that was actually devised prior to NLM) is the case of point-wise patches, i.e., πx:f→f(x). In this case, the NLM estimator simplifies to
ˆf(x)=∫Rde−(y(x)−y(x′))22σ2N−‖x−x′‖22σ2sy(x′)dx′∫Rde−(y(x)−y(x′))22σ2N−‖x−x′‖22σ2sdx′,which is known as the bilateral filter. Bilateral filter can be thought of as a non-shift invariant linear filter,
ˆf(x)=∫Rdh(x,x′)y(x′)dx′with the spatially-varying unit-DC impulse response
h(x,x′)=e−(y(x)−y(x′))22σ2N−‖x−x′‖22σ2s∫Rde−(y(x)−y(x′))22σ2N−‖x−x′‖22σ2sdx′.Another useful model is to consider the so-called lifted space Rd×R, in which the graph of the image f resides. The graph can be modeled as a delta distribution on Rd×R supported on (x,f(x) (note that every function f has such a representation, but not every function or distribution on Rd×R represents a valid function). Then, convolving the latter distribution with an LSI Gaussian kernel,
h((x,f))∝exp(−‖(x√2σN,f√2σs)‖22),and marginalizing over f yields precisely the bilateral filter.
NLM as MMSE
Note that in its original formulation, the NLM estimator is not an MMSE estimator. This stems from the fact that when using the patches of the image itself as exemplars to define the prior on πxF, the patches are contaminated with noise! Fixing a set of K locations xk from which exemplar patches are taken, we now do not have access to pk but rather to their version contaminated by noise, qk=pk+πxkN=pk+Nk.
Let us repeat the derivation of the posterior probability distribution. The likelihood of πxY given that πxF was formed using the patch pk is simply given by the density of πxN,
PπxY|πxF(πxY=y|πxF=pk)∝e−‖y−pk‖22σ2N;however, we no more have access to pk. Instead, we can write
‖y−pk‖2=‖y−qk+Nk‖2≈E‖y−qk+Nk‖2=‖y−qk‖2+E‖Nk‖2+2E⟨y−qk,Nk⟩By definition of random Gaussian noise, since the norm is computed on ◻ of volume Td, E|Nk|2=Tdσ2N. Letting y=pk+πxN, the third term can be written as
E⟨y−qk,Nk⟩=E⟨pk+πxN−pk−Nk,Nk⟩=E⟨πxN,Nk⟩−E‖Nk‖2.Even though the denoised and the reference patches may overlap, we will assume that πxN and Nk are statistically independent and, since they are zero mean, they are orthogonal. This yields E⟨y−qk,Nk⟩=−Tdσ2N; combining the previous results, we have
PπxY|πxF(πxY=y|πxF=pk)∝e−‖y−qk‖2+Tdσ2N2σ2NSimilarly to the noiseless exemplars case, the Bayes formula yields
PπxF|πxY(πxF=pk|πxY=y)=e−‖y−qk‖2+Tdσ2N2σ2NK∑i=1e−‖y−qi‖2+Tdσ2N2σ2N=hk(x).Using this posterior, we again define the MSE of the estimator ˆf at point x as
ϵ(ˆf(x))=E(ˆf(x)−pk(0))2=K∑k=1(ˆf(x)−pk(0))2hk(x).In the previous case, this yielded an estimator in the form of the weighted sum
πxˆf=K∑k=1ckpk,with ck=hk(x) summing to 1. However, note that pk are now unaccessible, so we have to write
πxˆf=K∑k=1ckqk=K∑k=1ckpk+K∑k=1ckNk,with some other set of weights c also summing to 1. This yields
ϵ(c)=K∑j=1‖πxˆf−pk‖2hk(x)=K∑k=1‖K∑i=1cipi+K∑i=1ciNi−pk‖2hk(x)≈K∑k=1E‖K∑i=1cipi+K∑i=1ciNi−pk‖2hk(x)≈K∑k=1(‖K∑i=1cipi−pk‖2+K∑i=1c2iE‖Ni‖2)hk(x)=K∑k=1‖K∑i=1cipi−pk‖2hk(x)+Tdσ2Nc⊤c=c⊤(P+Tdσ2NI)c−2c⊤Ph+w⊤h,where P is a K×K matrix with the elements (P)ij=⟨pi,pj⟩, h is the K-dimensional vector with the elements hk(x), and w is the K-dimensional vector with the elements (w)i=|pi|2. Our MMSE estimator is therefore given as the solution to the constrained minimization problem
c∗=argmincϵ(c)s.t.c⊤1=1.Defining the Lagrangian ϵ(c)+λ1⊤c, differentiating w.r.t. c and setting the gradient to zero yields
c∗=(P+Tdσ2NI)−1(Ph+λ1)with the constant
λ=1−1⊤(P+Tdσ2NI)−1Ph1⊤(P+Tdσ2NI)−11satisfying the unit sum constraint.
Note the inversion of the potentially very big matrix P+Tdσ2NI makes this solution impractical. However, assuming that for every i, ⟨pi,pi⟩=Tds2F and for every j≠i, ⟨pi,pj⟩=ρTds2F, where s2F=EF(x), the matrix the matrix P assumes the simple form P=αI+β11⊤, where α=Td(1−ρ)s2F and β=Tdρs2F. The matrix to invert therefore becomes
P+Tdσ2NI=γI+β11⊤,where γ=α+Tdσ2N. Using the Sherman-Morrison matrix identity, the inverse can be expressed as
(γI+β11⊤)−1=1γ(I−βγ+βK11⊤).Substituting the latter result to the formula for c∗, after tedious and boring algebra, we arrive at
c∗=(1−ρ)s2F(1−ρ)s2F+σ2Nh+σ2N(1−ρ)s2F+σ2N1K1.Note that the weight vector is given by a weighted average (a convex combination) of the posterior probabilities h and the uniform vector 1K1. Interpreting the signal to noise ratio as
SNR=(1−ρ)s2Fσ2N,we can rewrite
ck=SNR⋅hk+1KSNR+1.For high SNR (σ2N approaching 0), we obtain the classical NLM with ck=hk.