The report has been published in http://arxiv.org/abs/0906.1487
The Physics of Compressive Sensing and the GradientBased Recovery Algorithms
Qi Dai and Wei Sha
Departmen
t of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong, China. Email: daiqi@hku.hk (Qi Dai); wsha@eee.hku.hk (Wei Sha) Research Report
Compiled June 4, 2009 The physics of compressive sensing (CS) and the gradientbased recovery algorithms are presented. First, the dierent forms for CS are summarized. Second, the physical meanings of coherence and measurement are given. Third, the gradientbased recovery algorithms and their geometry explanations are provided. Finally, we conclude the report and give some suggestion for future work. c Keywords: Compressive Sensing; Coherence; Measurement; GradientBased Recovery Algorithms. 2009 Optical Society of America
1.
Introduction
The wellknown Nyquist/Shannon sampling theorem that the sampling rate must be at least twice the maximum frequency of the signal is a golden rule used in visual and audio electronics, medical imaging devices, radio receivers and so on. However, can we simply recover a signal from a small number of linear measurements? Yes, we can, answered rmly by Emmanuel J. Cand`s, Justin e Romberg, and Terence Tao [1] [2] [3]. They brought us the tool called Compressive Sensing (CS) [4] [5] [6] several years ago which avoids large digital data set and enables us to build the data compression directly from the acquisition. The mathematical theory underlying CS is deep and beautiful and draws from diverse elds, but we don’t focus too much on the mathematical proofs. Here, we will give some physical explanations and discuss relevant recovery algorithms. 2. Exact Recovery of Sparse Signals
(d) If f is sparse in the transformdomain and the measurements are acquired in the transformdomain also, then the optimization problem can be given by min f
1
s.t. M0 f = y .
(4)
Given a timedomain signal f ∈ RN ×1 , there are four dierent forms for CS. (a) If f is sparse in the timedomain and the measurements are acquired in the timedomain also, then the optimization problem can be given by min f
1
s.t. M0 f = y
(1)
where M0 ∈ RM ×N is the observation matrix and y ∈ RM ×1 are the measurements. (b) If f is sparse in the timedomain and the measurements are acquired in the transformdomain (Fourier transform, discrete cosine transform, wavelet transform, Xlet transform, etc), then the optimization problem can be given by min Ψ f 1 s.t. M0 f = y (2) where Ψ is the inverse transform matrix and satises Ψ Ψ = ΨΨ = I. (c) If f is sparse in the transformdomain and the measurements are acquired in the timedomain, then the optimization problem can be given by min Ψf
1
From the above equations, the meanings of the sparsity can be generalized. If the number of the nonzero elements is very small compared with the length of the timedomain signal, the signal is sparse in the timedomain. If the most important K components in the transformdomain can represent signal accurately, we can say the signal is sparse in the transformdomain. Because we can set other unimportant components to be zero and implement the inverse transform, the timedomain signal can be reconstructed with very small numerical error. The sparsity property also makes the lossy data compression possible. For the image processing, the derivatives of the image (especially for the geometric image) along the horizontal and vertical directions are sparse. For the physical society, we can say the wave function is sparse with the specic basis representations. Before you go into the CS world, you must know what is sparse in what domain. The second question is what is the size limit for the measurements y in order to perfectly recover the K sparse signal. Usually, M K log2 (N ) or M ≈ 4K for the general signal or image. Further, if the signal f is sparse in the transformdomain Ψ and the measurements are acquired in the timedomain, then M χ (Ψ, M0 ) K log2 (N ), where χ (Ψ, M0 ) is the coherence index between the basis system Ψ and the measurement system M0 [3]. The incoherence leads to the small χ and therefore fewer measurements are required. The coherence index χ can be easily found, if we rewrite (3) as min f
1
s.t. M0 Ψ f = y.
(5)
Similarly, (2) can be rewritten as min f
1
s.t. M0 Ψf = y
(6)
s.t. M0 f = y.
(3) 1
Third, what are the inherent properties for the observation matrix? The observation matrix obeys what is
The report has been published in http://arxiv.org/abs/0906.1487
known as a uniform uncertainty principle (UUP).
1 1 0.5 0 0.5 0 (a) 1 1 1 1 0.5 0 0.5 0 (c) 1 1 1 0 (d) 1 0 (b) 1
M0 f 2 M M C1 ≤ ≤ C2 2 N N f 2
2
(7)
0.5 0 0.5 1 1 1 0.5 0 0.5 1 1
where C1 1 C2 . An alternative condition, which is called restricted isometry property (RIP), can be given by 1 δk ≤ M0 f 2 f 2
2 2
≤ 1 + δk
(8)
where δk is a constant and is not too close to 1. The properties show the three facts: (a) The measurements y can maintain the energy of the original timedomain signal f . In other words, the measurement process is stable. (b) If f is sparse, then M0 must be dense. This is the reason why the theorem is called UUP. (c) If we want to perfectly recover f from the measurements y, at least 2K measurements are required. According to the UUP and RIP theorems, it is convenient to set the observation matrix M0 to a random matrix (normal distribution, uniform distribution, or Bernoulli distribution). Four, why l1 norm is used in (1)(2)(3)(4)? For the real application, the size of measurement M N . As a result, one will face the problem how to solve an underdetermined matrix equation. In other words, there are a huge amount of dierent candidate signals that could all result in the given measurements. Thus, one must introduce some additional constraints to select the best candidate. The classical solution to such problems would be minimizing the l2 norm (the pseudoinverse solution), which minimizes the amount of energy in the system. However, this leads to poor results for most practical applications, as the recovered signal seldom has zero components. A more attractive solution would be minimizing the l0 norm, or equivalently maximize the number of zero components in the basis system. However, this is NPhard (it contains the subsetsum problem), and so is computationally infeasible for all but the tiniest data sets. Thus, the l1 norm, or the sum of the absolute values, is usually what is to be minimized. Finding the candidate with the smallest l1 norm can be expressed relatively easily as a linear convex optimization program, for which ecient solution methods already exist. This leads to comparable results as using the l0 norm, often yielding results with many components being zero. For simplicity, we take the 2D case for example. The boundaries of l0 , l1 , and l2 norms are cross, diamond, and circle, respectively. (See Fig. 1). The underdetermined matrix equation can be seen as a straight line. If the intersection between the straight line and the boundary is located at the xaxis or yaxis, the recovered result will be sparse. Obviously, the intersection will always be located at the axes if p ≤ 1 for the lp norm. Five, can CS have a good performance in a noisy environment? Yes, it can. Because the recovery algorithm can get the most important K components and force other components (including noise components) to be 2
Fig. 1. The geometries of lp norm: (a) p = 2; (b) p = 1; (c) p = 0.5; (d) p = 0. zero. For the image processing, the recovery algorithm will not smooth the image but yield the sharp edge. Finally, let us review how CS encode and decode the timedomain signal f . For the encoder, it gets the measurements y or y according to the observation ma trix M0 . For the decoder, it recovers f or f by solving the convex optimization problem (1)(2)(3)(4) with y (), y M0 , and Ψ1 . Hence, CS can be seen as a fast encoder with lower sampling rate (fewer data set). The sampling rate only depends on the sparsity of f in some domains and goes beyond the limit of the Nyquist/Shannon sampling theorem. However, CS will face a challenging problem: How to perfectly recover the signal with low computational complexity and memory? 3. Physics of Compressive Sensing
The most important concepts of the CS theory involve Coherence and Measurement. In physics, coherence is a property of waves, that enables stationary (i.e. temporally and spatially constant) interference. More generally, coherence describes all correlation properties between physical quantities of a wave. When interfering, two waves can add together to create a larger wave (constructive interference) or subtract from each other to create a smaller wave (destructive interference), depending on their relative phase. The coherence of two waves follows from how well correlated the waves are as quantied by the crosscorrelation function. The crosscorrelation quanties the ability to predict the value of the second wave by knowing the value of the rst. As an example, consider two waves perfectly correlated for all times. At any time, if the rst wave changes, the second will change in the same way. If combined they can exhibit complete constructive interference at all
1 For
(1)(4), Ψ is not necessary.
The report has been published in http://arxiv.org/abs/0906.1487
times. It follows that they are perfectly coherent. So, the second wave needs not be a separate entity. It could be the rst wave at a dierent time or position. In this case, sometimes called selfcoherence, the measure of correlation is the autocorrelation function. Take the Thomas Young’s doubleslit experiment for example, a coherent light source illuminates a thin plate with two parallel slits cut in it, and the light passing through the slits strikes a screen behind them. The wave nature of light causes the light waves passing through both slits to interfere, creating an interference pattern of bright and dark bands on the screen. In fact, the dark bands can relate to the zero components in the signal processing eld. It is well known that we select the basis functions coherent with the signal or image. If the signal is the square wave, the haar wavelet is a good choice. If the signal is the sine wave, the Fourier transform is a good choice. The coherence index between a signal and a basis system will decide the sparsity of the signal in the transformdomain. In other words, fewer basis functions will be used or more components in the transformdomain are to be zero if the signal and the basis functions are coherent. For the CS, however, the observation matrix M0 and the timedomain signal f should be incoherent. In addition, the observation matrix M0 and the basis system Ψ also should be incoherent. If it is not the case, the reconstruction matrix M0 Ψ in (5) will be sparse, which violates the UUP theorem. Then, let us talk about quantum mechanics and the measurement. In physics, a wave function or wavefunction is a mathematical tool used in quantum mechanics to describe any physical system. The values of the wave function are probability amplitudes (complex numbers). The squares of the absolute values of the wave functions f 2 give the probability distribution (the chance of nding the subject at a certain time and position) that the system will be in any of the possible quantum states. The modern usage of the term wave function refers to a complex vector or function, i.e. an element in a complex Hilbert space. An element of a vector space can be expressed in dierent bases; and so the same applies to wave functions. The components of a wave function describing the same physical state take dierent complex values depending on the basis being used; however the wave function itself is not dependent on the basis chosen. Similarly, in the signal processing eld, we use dierent basis functions to represent the signal or image. The quantum state of a system is a mathematical object that fully describes the quantum system. Once the quantum state has been prepared, some aspect of it is measured (for example, its position or energy). It is a postulate of quantum mechanics that all measurements have an associated operator2 (called an observable operator). The expected result of the measurement is in general described not by a single number, but by a probability distribution that species the likelihoods that the
2 For
various possible results will be obtained. The measurement process is often said to be random and indeterministic. Suppose we take a measurement corresponding to observable operator O, on a state whose quantum state is f . The mean value (expectation value) of the measure ment is f, Of and the variance of the measurement is 2 f ( f, Of )2 . Each descriptor (mean, variance, f, O etc) of the measurement involves a part of information of the quantum state. In the signal processing eld, the observation matrix M0 is a random matrix and each measurement yi captures a portion of information of the signal f . Due to the UUP and RIP theorems, all the measurements make the same contribution to recovering f . In other words, each measurement yi is equally important or unimportant3 . The unique property will make CS very powerful in the communication eld (channel coding). 4. GradientBased Recovery Algorithms
A fast, lowconsumed, and reliable recovery algorithm is the core of the CS theory. There are a lot of outstanding work on the topic [7] [8] [9] [10]. Based on their work, we developed the gradientbased recovery algorithms. In particular, we did not reshape the image (matrix) to the signal (vector), which will consume a large amount of memory. We treat each column of the image as a vector and the comparable results also can be obtained. For the sparse image in the timedomain, the l1 norm constraint is used. For the general image (especially for the geometric image), the total variation constraint is used. Considering the nondierentiability of the function fj,k  at the origin point, the subgradient or smooth approximation strategies [10] are employed. A. Gradient Algorithms and Their Geometries
Before solving the constrained convex optimization problems, the clear and deep understandings for the gradientbased algorithms are necessary. Given a linear matrix equation M0 f = y (9)
the solution f can be found by solving the following minimization problem min L(f ) ≡
f
1 M0 f y2 2 2
(10)
The gradientbased algorithms for solving (10) can be written as f i+1 = f i i L(f i ) where i is the iteration step size and L(f i ) = M (M0 f i y) 0 (12) (11)
the discrete system, an operator can be seen as a matrix.
3 For the traditional compression method, the perfect reconstruction is impossible if some important components are lost.
3
The report has been published in http://arxiv.org/abs/0906.1487
A variety of settings for i result in dierent algorithms involving the gradient method, the steepest descent method, and the Newton’s method. The gradient method sets i to a small constant. The steepest descent method sets i to i = L(f i ), L(f i ) L(f i ), M M0 L(f i ) + ε 0 (13)
1 ≤ k ≤ N ). The convex optimization problem (1) for the sparse image can be converted to min H(f ) ≡ L(f ) + λ f
f 1
(15)
which minimizes the residual Ri = M0 f i y in each iteration. Here, the small ε is used for avoiding a zero denominator. For the Newton’s method, i is taken as a constant matrix i = M M0 + εI 0
1
(14)
Here, the small ε is used for avoiding a nearly singular matrix. To understand the geometries for the three gradientbased algorithms, a simple case is taken for example. A 2D function f (x, y) = (x + y)2 + (x + 1)2 + (y + 3)2 has a local minimum f = 1 , 5 . The contour of the 3 3 function and the trajectory of f i are drawn in Fig. 2.
2 1 0 1 2 3 4 5 6 7 8 4 2 0 2 x 4 6 8 10 y Newton Steepest Descent Method Gradient
where f 1 = j,k fj,k . The above equation is a Lagrange multiplier formulation. The rst term relates to the underdetermined matrix equation (9) and the second l1 penalty term will assure a regularized sparse solution. The parameter λ balances the weight of the rst term and the second term. Because fj,k  is not dierentiable at the origin point, we can dene a new subgradient for each fj,k as follows j,k L(f ) + λ sign(fj,k ), fj,k  ≥ ε j,k L(f ) + λ, fj,k  < ε, j,k L(f ) < λ j,k H(f ) = j,k L(f ) λ, fj,k  < ε, j,k L(f ) > λ 0, fj,k  < ε,  j,k L(f ) ≤ λ (16) Then the gradientbased algorithm can be written as
i+1 i fj,k = fj,k i k j,k H(f i
)
(17)
where j and k are, respectively, the row index and the column index of the image f and i denotes the ith iteration step. The i has been given in (13) and (14). Bear k in mind, the image will be treated column by column when computing i and j,k L(f i ). k For the steepest descent method, the parameter λ can be taken as a small positive constant (λ = 0.001 0.01). But for the Newton’s method, the parameter λ must be gradually decreased as the iteration step increases (λi+1 = (0.99 0.999) × λi ). C. Total Variation Strategy
Fig. 2. The geometries for the gradientbased algorithms. The convergence of the gradient method is worst. The steepest descent method converges fast at the rst several steps but slowly as the iteration step increases. The Newton’s method is best and needs only one step for the 2D case4 . Next, we will apply the steepest descent method and the Newton’s method to recover the signal or image f by using (1). The treatments for other convex optimization problems (2)(3)(4) are similar. B. l1 Norm Strategy
For a general image, especially for a geometric image, it is not sparse in the timedomain. Hence, the l1 norm strategy developed in the previous subsection will break down. The convex optimization problem for the general image can be given by min H(f ) ≡ L(f ) + λ TV(f)
f
(18)
where TV(f) is the total variation of the image f . The derivatives of f along the vertical and horizontal directions can be dened as
v Dj,k f =
fj,k fj+1,k 0 fj,k fj,k+1 0
1≤j<N j=N 1≤k<N k=N
(19)
h Dj,k f =
(20)
We assume fj,k is the pixel of an N × N image f at the jth row and the kth column (1 ≤ j ≤ N and
the CS, the performance of the Newton’s method will decrease due to the nearly singular observation matrix and the l1 norm constraint.
4 For
The total variation of the image f is the summation for the magnitude of the gradient of each pixel [11] TV(f) =
j,k
Dv f j,k
2
+ Dh f j,k
2
=
j,k

j,k f
(21)
4
The report has been published in http://arxiv.org/abs/0906.1487
After some simple derivations, the gradient of the total variation with each pixel is given by
j,k
(a)
(b)
(TV(f)) =
v h Dj,k f Dj,k f +  j,k f   j,k f 
v h Dj1,k f Dj,k1 f  j1,k f   j,k1 f 
(22)
(c)
(d)
When treating (22), the smooth approximation strategy is used for avoiding a zero denominator, i.e. 
j,k f 
=
v Dj,k f
2
h + Dj,k f
2
+ε
(23)
The gradientbased algorithm has been given in (17). 5. A. Numerical Experiments and Results Cases of l1 Norm Strategy Fig. 4. The recovered diamond gures by the Newton’s method from the observation matrices with dierent sizes: (a) 10 × 64; (b) 12 × 64; (c) 15 × 64; (d) 20 × 64.
The rst image we’d like to recover is a 64 × 64 sparse diamond as shown in Fig. 3.
Fig. 5. The 64 × 64 sparse circle. one is more powerful and eective. (Actually, their performances are almost the same for the image.) What we want to show is the sparsity of an image aects the recovered results a lot. Fig. 6 shows the recovered results from the observation matrices with dierent sizes slightly larger than the previous case. For the subplots (a) and (b), the results are undesirable. The subplot (c) is better and almost perfect reconstruction is obtained for the subplot (d). Again, we notice that for the small observation matrix, the recovered results may vary drastically. If the size of the observation matrix is large enough, the reconstructed image is accurate or even exact in any repeated experiments. The subplots (a), (b), and (c) in Fig. 6 show that those columns which are less sparse are the hardest to recover when the small observation matrix is utilized. For the case, the total variation strategy may be a better choice. In a word, a large observation matrix can capture more information of the image and therefore the image can be recovered with higher probability. B. Cases of Total Variation Strategy
Fig. 3. The 64 × 64 sparse diamond. Notice that the image itself is sparse in the timedomain, we need not to transform the image into other domains, such as the wavelet domain or Fourier domain. The size of the observation matrix for the nearly perfect reconstruction should be at least larger than 12 × 64 where 12 is calculated by 2 log2 64 = 12. If our observation matrices are generated by the uniform distribution from 0 to 1, after 20000 iteration steps, Fig. 4 can be obtained. The subplot (a) shows the recovered image with a 10 × 64 observation matrix, while (b), (c), and (d) are recovered with 12 × 64, 15 × 64, and 20 × 64 random observation matrices respectively. When the size of the observation matrix is small, the poor reconstructed images are shown in (a) and (b). But when the observation matrix gets larger and larger, the better results can be obtained as shown in (c) and (d). Another 64 × 64 image we wish to recover is a circle as shown in Fig. 5. Although the circle is sparse on the whole, but it is not the case for some columns. These columns may require more measurements if we treat an n × n image as n vectors. Using the steepest descent method, we can recover the image after several iterations. Here, we don’t want to compare the Newton’s method with the steepest descent method to nd which 5
If the image is not sparse in the timedomain, it also can be recovered from the measurements without represented as the basis functions Ψ whose coecients may be sparse. For instance, the geometric gure composed of a solid circle and a solid square is shown in Fig. 7.
The report has been published in http://arxiv.org/abs/0906.1487
(a) (b)
(c)
(d)
Fig. 9. The recovered geometric image by the steepest descent method: 20 × 64 observation matrix is employed. Fig. 6. The recovered circle gures by the steepest descent method from the observation matrices with dierent sizes: (a) 15 × 64; (b) 20 × 64; (c) 25 × 64; (d) 30 × 64. It is easy to imagine the derivatives of the image are sparse. We apply the total variation strategy to recover the image. covery algorithm is implemented in the waveletdomain, and the PSNR for the subplots (a) and (b) in Fig. 11 are 29.4 and 30.9, respectively. In addition, the gradientbased total variation algorithm also has a good performance for the geometric image Peppers. (Showing too many results are not necessary).
(a) (b)
Fig. 10. The general image: (a) Cameraman; (b) Boats. Fig. 7. The geometric gure. The size of the object image again is 64 × 64 and the 20 × 64 observation matrix is employed. Fig. 8 is the typical measurements y and Fig. 9 is the reconstructed image from y. The peak signal to noise ratio (PSNR) calculated is always above 90 for the repeated experiments (dierent observation matrices with the same size), which suggests that the observation matrix is large enough to recover the original image with an extremely accurate result.
(a) (b)
Fig. 11. The recovered images by the Newton’s method: 100 × 256 observation matrix is employed. (a) Cameraman; (b) Boats.
Fig. 8. The measurement of the geometric gure. The next two images Cameraman and Boats in Fig. 10 are quite wellknown in the image processing eld. We still treat the 256 × 256 image as 256 vectors. Although the result may not be so good as that we obtain by treating the 256×256 image as a long vector of size 65536×1, we do save a great amount of memory and calculation time. The size of our observation matrix M0 is 100×256 rather than 25600×65536 (25600 ≈ 65536/2.56). The re6
6.
Discussions and Future Work
The above are just some simple experiments for demonstrating that the CS was able to recover an image accurately from a few of random projections. One should understand that the main advantage of CS is not how small size it can compress the image to. In fact, if a signal is K sparse in some domains, we indeed require 3K to 5K measurements to recover the signal. An obvious advantage of CS is that it can encode the signal or image fast. In particular, the prior knowledge about
The report has been published in http://arxiv.org/abs/0906.1487
the signal is not important. For example, it is not necessary for us to know the exact positions and values of the most important components beforehand. What we care is whether the image is sparse in some domains or not. A xed observation matrix can be applied to measure dierent signals, which makes the applications of CS for encoding and decoding possible. Meanwhile, the measurements play the same role in recovering the signal or image, which makes CS very powerful in military applications (radar imaging) where we cannot aord the risk caused by the loss of the most important K components. Since each random projection (measurement) is equally (un)important, the CS is not sensitive to the noises or measurement errors and can provide the robust and stable performances. Although many researchers have made great progresses in the convex optimization problems and demonstrated the accurate results on the scale we interest (hundreds of thousands of measurements and millions of pixels), the more ecient algorithms are still required. Actually, solving the l1 minimization problem is about 3050 times as expensive as solving the least squares problem. However, the unbalanced computational burden gives us a chance that the measurements are acquired by the sensors with lower power, and then the signal or image will be recovered on the central supercomputer. The algorithms, such as the conjugate gradient method and the generalized minimal residual method, will become our next candidates for accelerating the recovery algorithm. The physical understandings and applications for CS are under way, although a singlepixel camera has shocked the eld of optics. We are aware that the CS has penetrated many elds and become a hotspot. We expect more mathematicians, physicist, and engineers make contributions for the CS eld. 7. Acknowledgement
5. Emmanuel J. Cand`s and Michael B. Wakin, “An introe duction to compressive sampling,” IEEE Signal Processing Magazine, pp. 21230, Mar. 2008. 6. Justin K. Romberg, “Imaging via compressive sampling,” IEEE Signal Processing Magazine, pp. 1420, Mar. 2008. 7. Emmanuel J. Cand`s and Justin K. Romberg, “Sige nal recovery from random projections,” Proceedings of SPIEIS&T Electronic Imaging, vol. 5674, pp. 7686, Mar. 2005. 8. Emmanuel J. Cand`s and Terence Tao, “Decoding by e linear programming,” IEEE Transactions On Information Theory, vol. 51, pp. 42034215, Dec. 2005. 9. Joel A. Tropp and Anna C. Gilbert, “Signal recovery from partial information via orthogonal matching pursuit,” IEEE Transactions On Information Theory, vol. 53, pp. 46554666, Dec. 2007. 10. Mark Schmidt, Glenn Fung, and Rmer Rosales, Fast o optimization methods for L1 regularization: A comparative study and two new approaches, Machine Learning: ECML 2007, vol. 4701, Sep. 2007. 11. Leonid I. Rudin, Stanley Osher, and Emad Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D, vol. 60, pp. 259268, Nov. 1992.
The authors are not from signal or image processing society. Hence, some technical words may not be right. We hope the report can give some help for the researchers form the signal processing, image processing, and physical societies. References
1. Emmanuel J. Cand`s, Justin K. Romberg, and Terence e Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions On Information Theory, vol. 52, pp. 489509, Feb. 2006. 2. David L. Donoho, “Compressed sensing,” IEEE Transactions On Information Theory, vol. 52, pp. 12891306, Apr. 2006. 3. Emmanuel J. Cand`s and Michael B. Wakin, “People e hearing without listening:” An introduction to compressive sampling, Report. 4. Richard G. Baraniuk, “Compressive sensing,” IEEE Signal Processing Magazine, pp. 118124, Jul. 2007.
7