Fast dictionary learning from incomplete data

This paper extends the recently proposed and theoretically justified iterative thresholding and K residual means (ITKrM) algorithm to learning dictionaries from incomplete/masked training data (ITKrMM). It further adapts the algorithm to the presence of a low-rank component in the data and provides a strategy for recovering this low-rank component again from incomplete data. Several synthetic experiments show the advantages of incorporating information about the corruption into the algorithm. Further experiments on image data confirm the importance of considering a low-rank component in the data and show that the algorithm compares favourably to its closest dictionary learning counterparts, wKSVD and BPFA, either in terms of computational complexity or in terms of consistency between the dictionaries learned from corrupted and uncorrupted data. To further confirm the appropriateness of the learned dictionaries, we explore an application to sparsity-based image inpainting. There the ITKrMM dictionaries show a similar performance to other learned dictionaries like wKSVD and BPFA and a superior performance to other algorithms based on pre-defined/analytic dictionaries.


INTRODUCTION
Many notable advances in modern signal processing are based on the fact that even high-dimensional data follows a low complexity model.One such model, which has become an important prior for many signal processing tasks ranging from denoising and compressed sensing, to super-resolution, inpainting and classification is sparsity in a dictionary, [11], [6], [8], [10], [37], [12], [36], [26].In the sparse model each datum (signal) can be approximated by the linear combination of a small (sparse) number of elementary signals, called atoms, from a pre-specified basis or frame, called dictionary.In mathematical terms, if we represent each signal by a vector y n ∈ R d and collect the entire dataset in the matrix Y = (y 1 , . . ., y N ) ∈ R d×N , the sparse model can be formalised as Y = ΦX where X is columnwise sparse. ( Here the dictionary matrix Φ contains K normalised vectors (atoms) φ k , stored as columns in Φ = (φ 1 , . . . ,φ K ) ∈ R d×K , and each vector-column x n ∈ R K of the matrix X = (x 1 , . . ., x N ) ∈ R K×N contains only few nonzero entries.Since the model expressed in Eq. ( 1) has proven to be very useful in signal processing, the natural next question is, how to automatically learn a dictionary Φ, providing sparse representations for a given data class.This problem is also known as dictionary learning, sparse coding or sparse component analysis.By now there exist not only a multitude of dictionary learning algorithms to choose from, [2], [13], [14], [17], [18], [20], [33], [19], but also theoretical results have started to accumulate, [16], [34], [4], [1], [27], [29], [15], [5], [35], [3].As our reference list is necessarily incomplete we also point to the surveys [26], [30] as trailheads for algorithms and theory respectively.One common assumption on which all algorithms and associated theories are based is that large numbers of clean signals are available for learning the dictionary.However, this assumption is often not valid in actual applications.Motivated by a real-life prediction task in diabetes therapy management, in this paper we will consider the following problem: How do we learn a dictionary when there are only a few or no clean training signals available?Fig. 1: Examples of blood glucose profile of two patients (solid and dashed lines, respectively) during inpatient stay for three days.Each curve represents blood glucose profile for a 24 hour-period from 08:00 till 07:59 next day.Notice signal dropouts for several hours for at two out of six glucose traces.
Diabetes is currently considered one of the global healthcare challenges of the century, with more than 380 million people affected worldwide.The biggest challenge in diabetes management is the prediction of blood glucose levels from past and current measurements.The most recent advances in the field are based on the observation that purely data-driven algorithms lead to more clinically accurate results than the ones based on physiological models or a combination of both [23].Due to recent technological advances, blood glucose measurements can be provided on a close-to continuous basis, every 5 to 10 minutes, by a Continuous Glucose Monitoring (CGM) device, which is inserted under the skin.However, in addition to mandatory unpleasant calibration procedures of the device several times a day, CGM quite often returns obviously wrong, e.g.rapidly oscillating or negative, estimations of the blood glucose level and suffers from frequent signal dropouts, Figure 1, [32].The latter are especially common during the night, and can be particularly dangerous since there is no warning for low glucose levels, which in extreme cases can lead to coma or even death.One task is therefore prediction of glucose levels even with signal dropouts.This can be interpreted as inpainting (into the future) and a data-driven approach could be to learn a dictionary for the class of CGM signals and to use dictionary based inpainting.The obvious problem is that the CGM signals for learning are quite difficult to obtain and suffer from dropouts themselves.Therefore, any dictionary learning algorithm, or any other data-driven approach, needs to make use of all possible information and include the corrupted signals by properly modelling the corruption and accounting for it in the learning.
To solve the problem of learning from incomplete data, we propose an algorithm called Iterative Thresholding and K residual Means for Masked data (ITKrMM).As the name suggests, it is built upon the inclusion of a signal corruption model into the theoretically-justified and numerically efficient Iterative Thresholding and K residual Means (ITKrM) algorithm, [28].
In order to model the data corruption/loss process, we will adapt the concept of the binary erasure channel.In this model, the measurement device sends a value and the receiver either receives the value or it receives a message that the value was not received ('erased').The model is used frequently in information theory due to its simplicity and its abstraction towards modelling various types of data losses.At the same time, this setting provides information on the location of the erasures and, thus, we can employ the concept of a mask M to describe the corrupted data as M y.Without loss of generality, we will think of a mask M as orthogonal projection onto the linear span of vectors from the standard basis (e j ) j or simply as diagonal matrix with M (j, j) ∈ {0, 1}.We further extend the algorithms to account for the presence of low-rank component in the data, which appears in many real-life signals and, as we will illustrate below, should be treated cautiously in the considered context.To evaluate the accuracy and efficiency of the algorithm, we perform various numerical tests on synthetic data.We also demonstrate the practical usefulness of the algorithms to inverse problems, by considering an image inpainting task.The dictionary learning community does not directly address the problem under consideration.However, several recent works by Elad and co-authors [21], [22] introduced a weighted algorithm for dictionary learning, called weighted K-SVD (wK-SVD), for handling non-homogenous noise in signals.The proposed construction is also applicable in cases with missing values, such as in colour image demosaicing and inpainting.We will show that our algorithm not only performs on par with the weighted K-SVD algorithm but also requires much less computational resources.Contribution: This paper provides an efficient and simple algorithm for dictionary learning from incomplete data and the recovery of the low-rank component also from incomplete data.For the sake of brevity and different interests across communities, we here focus on the methodological description and extensive numerical justification and aim to provide a theoretical analysis in a follow-up paper.
Outline: The paper is organised as follows: Section 2 contains the complete problem set-up, explaining the combined low-rank and sparse model and as well as the corruption model.The ITKrMM algorithm for dictionary recovery is introduced in Section 3.An adaptation of this algorithm for recovery of the low-rank component from incomplete data together with a short discussion of related works in the field of matrix completion and dimensionality reduction is provided in Section 4. Section 5 contains extensive simulations on synthetic data, while Section 6 exemplifies the usefulness of the algorithm to problems in image processing by applying it to image inpainting.Finally Section 7 offers a snapshot of the main contributions and points out open questions and directions for future work.Notation: Before finally lifting the anchor, we provide a short reminder of the standard notations used in this paper.For a matrix A, we denote its (conjugate) transpose by A and its Moore-Penrose pseudo inverse by A † .By P (A) we denote the orthogonal projection onto the column span of A, i.e.P (A) = AA † and by Q(A) the orthogonal projection onto the orthogonal complement of the column span of A, that is , where I d is the identity operator (matrix) in R d .The restriction of the dictionary Φ to the atoms indexed by I is denoted by Φ I , i.e.Φ I = (φ i1 , . . ., φ iS ), i j ∈ I.The maximal absolute inner product between two different atoms is called the coherence µ of a dictionary, µ = max k =j | φ k , φ j | and encapsulates information about the local dictionary geometry.

PROBLEM SET-UP
Our goal is to learn a dictionary Φ from corrupted signals M n y n , under the assumption that the signals y n are sparse in the dictionary Φ.There are some notable differences in this problem setting compared to the uncorrupted situation.First, we cannot without loss of generality assume that the corrupted signals are normalised, since the action of the mask distorts the signal energy, M y 2 ≤ y 2 , which makes simple renormalisation impossible.
Another issue in modelling a natural phenomenon is that the signals might not be perfectly sparse but can only be modelled as the orthogonal sum of a low-rank and a sparse component.An example for such signals are images, where one usually subtracts the foreground or in other words the signal mean before learning the dictionary, which consequently will consist of atoms with zero mean [2].Without taking into account the existence of the low-rank component one would likely end up with a very ill-conditioned and coherent dictionary, where most atoms are distorted towards the low-rank component.Similarly, in our motivating example of the blood glucose data (see Figure 1), we can see at first glance that the signals vary around a baseline signal and that imposing a sparse structure in a dictionary makes sense only after subtracting this common component.As before, the atoms in this dictionary should then be orthogonal to the baseline signal.
In the case of uncorrupted signals one can simply determine the common low-rank component Γ = (γ 1 . . .γ L ) using one's preferred method such as a singular value decomposition and subtract its contribution from the signals via ỹn = Q(Γ)y n .Then in a second separate step one can run the dictionary learning algorithm on the modified signals ỹn and the resulting atoms will automatically be orthogonal to the low-rank component Γ.However, in the case of corrupted signals the action of the masks destroys the structure.So, while the dictionary is orthogonal to the low-rank component, Φ Γ = 0, this orthogonality is not preserved by the action of the mask, that is Φ M Γ = 0.As we will see later, the consequence of this effect is that we have to take the presence of the low-rank component into account when learning the sparsifying dictionary.Moreover, before even going to the dictionary learning phase, we have to find a strategy to recover the low-rank component from the corrupted signals.
The third difference is that we get additional constraints on the dictionaries in order for them to be recoverable.In the case of uncorrupted signals the main criterion for a dictionary to be recoverable is that its coherence scales well with the average sparsity level S of the signals (Sµ 2 1) and that all atoms are somewhat equally and independently used.In our scenario, where we want to learn a dictionary from corrupted data, we impose another criterion for the recoverability of the dictionary, which is the robustness of the dictionary to corruption.For instance, we will not have a chance to recover an atom φ k if its presence in a signal always triggers the same corruption pattern M 0 which distorts the atom, M 0 φ k = φ k .This means that we have to assume some sort of independence between the signals y n and the corruption, represented by the masks M n .Similarly, it will be very hard to recover a dictionary, whose incoherence is not robust towards corruption.To avoid this complication, we assume that the dictionary and the low-rank component consist of 'flat' atoms, where φ k ∞ 1 resp.γ ∞ 1.A more detailed discussion why this is a suitable assumption can be found in Section 3.For the moment we just want to point out that this is in line with the potential application of the learned dictionaries to signal reconstruction tasks such as inpainting.There the information in the corrupted part of an image needs to be encoded by the rest of the image, which is the case if the image is sparsely represented by flat atoms.Summarising these considerations, we arrive at the following signal model, which we will use as inspiration for the development of the algorithms, and as basis for the planned theoretical analysis.Signal model: Given a d × L low-rank component Γ with Γ Γ = I L and a d × K dictionary Φ, where Γ Φ = 0 and L K the signals are generated as, where v 2 2 + x 2 2 = 1 and |I| = S.The scaling parameter s is distributed between s min and s max and accounts for signals with different energy levels.The low-rank component is assumed to be present in every (most signals) and irreducible, meaning the coefficients v are dense and E(vv ) is a diagonal matrix.Also the average contribution of a low-rank atom should be larger than that of a sparse atom, E(|v( )|) E(|x(k)|).The sparse coefficients x should be distributed in a way that for every single signal only S entries in x are effectively non-zero.All atoms φ k should be irreducible and on average contribute equally to the signals y n .Specifically, no two atoms should always be used together, since in this case they could be replaced by any other two atoms with the same span.For a more detailed discussion of admissible coefficient models we refer to [28].For our derivations we will keep in mind the following simple model: with constant scale and without noise.The low-rank component is one-dimensional, L = 1, and the low-rank coefficient is equally Bernoulli distributed on ±c v .The sparse coefficients are constructed by choosing a support I of size S uniformly at random and setting x(k) = ±c, iid equally Bernoulli distributed, for k ∈ I and x(k) = 0 else.In other words the coefficients restricted to the support are a scaled Rademacher sequence.Following the above considerations concerning the scalings, we have c 2 v + S • c 2 = 1 and c v cS/K.Similar to the signal model we also summarise our considerations concerning the corruption in a model.

Corruption model:
As mentioned above, the corruption of a signal y is modelled by applying a mask M , where we assume that the distribution of the mask is independent of the signal distribution.By receiving a corrupted signal, we understand that we have access both to the corrupted signal M y, and the location of the corruption in form of the mask M , meaning we receive the pair (M y, M ).For the development of the algorithms we will keep two types of corruption in mind.The first type are random erasures, where the j-th coordinate is received with probability η j independently of the reception of the other coordinates, meaning M (j, j) ∼ B(η j ) are independent Bernoulli variables.The second type are burst errors or sensor malfunctions, as can be observed in the blood glucose example.We model them by choosing a burst-length τ and a burst-start t, according to a distribution ν τ,t .Based on τ and t we then set M (j, j) = 0 for t ≤ j < t + τ and M (j, j) = 1 else.One simple realisation of such a distribution would be to have no burst, τ = 0, with probability θ and a burst of fixed size, τ = T , which corresponds, for instance, to the time the sensor needs to be reset, with probability 1 − θ.The burst-start could be uniformly distributed, if the sensor is equally likely to malfunction throughout the measurement period, or for instance with a higher weight on part of the coordinates, if the sensor is more likely to malfunction during part of the measurement period, for instance, during the night.Having defined our problem set-up we are now ready to address the recovery of the dictionary from corrupted data.

DICTIONARY RECOVERY
We will use the iterative thresholding and K residual means algorithm (ITKrM), [28], as base for recovering the dictionary.It belongs to the class of alternating projection algorithms, which alternate between sparsely approximating the signals in the current version of the dictionary and updating the dictionary based on the sparse approximations.As the name suggests, ITKrM uses thresholding as sparse approximation algorithm and residual averages for the dictionary update and as such has the advantage of being computationally light and sequential.Further, there are theoretical results concerning its local convergence and good experimental results concerning its global convergence.Together with our expectation that it will be much easier to incorporate the information about corruption into a dictionary update scheme that uses averages than into one that uses higher order statistics such as singular vectors, this makes ITKrM a promising starting point.
To see how we have to modify the algorithm to deal with corrupted data, it will be helpful to understand how ITKrM works.ITKrM can be understood as fixed point iteration, meaning the generating dictionary Φ is a fixed point and locally, around the generating dictionary, one iteration of ITKrM is a contraction, φ k − ψk ψk 2 2 < κ φ k − ψ k 2 for all k and some κ < 1.We refer to [28] for details but for the sake of completeness we will provide some perhaps intuitive background for both the fixed point and the contraction property.Assume for a moment that the signals follow the simplest sparse model, that is, they are perfectly Ssparse in a generating dictionary Φ, meaning y n = Φ In x n (I n ) for some |I n | = S and x n (i) ≈ ±c for i ∈ I n , compare to the model presented in Section 2. In particular, they all have the same scaling and contain neither a low-rank component nor are they contaminated by noise.If we are given the generating dictionary as input dictionary, Ψ = Φ, then as long as the dictionary is not too coherent compared to the sparsity level, µ 2 S 1, thresholding will recover the generating support, meaning I t n = I n .Provided that the generating support was always recovered, we have P (Ψ I t n )y n = P (Φ In )y n = y n and before normalisation the updated atom takes the form ψk = n:k∈In This means that the output dictionary is again the generating dictionary Ψ = Φ or, in other words, that the generating dictionary is a fixed point of ITKrM.Note also that before normalisation the updated atom consists of roughly To provide insight why one iteration of ITKrM acts as contraction, assume again that we know all generating supports I n and that our current estimate for the dictionary consists of all generating atoms except for the first one, For the first atom we only have some (poor) approximation, which is, however, still incoherent with all other atoms, or, in other words, the current estimate ψ 1 contains more of the first than of any other generating atom.As before, one iteration of ITKrM will preserve all atoms ψ k = φ k for k ≥ 2 and on top of that contract ψ 1 towards φ 1 .To see this, observe that as long as the current estimate contains more of the first than of any other generating atoms, and, similarly, Combining the two estimates we get ψ1 = n:1∈In which shows that also a poor approximation ψ 1 is quickly contracted towards the generating atom φ 1 .In summary, for our modifications we have to ensure to preserve both the fixed point and the contraction property.For the start, we again assume that the corrupted signals have equal scale, contain no low-rank component, and are not contaminated by noise, but are perfectly S-sparse, that is First, observe that a corrupted signal M n y n is not sparse in the generating dictionary Φ but in its corrupted version M n Φ, Still, we can recover the support I n via thresholding using the corrupted dictionary M n Φ since we have access to the mask M n .However, we have to take into account that, strictly speaking, the corrupted dictionary is not actually a dictionary in the sense that its columns are not normalised.Depending on the shape of the atoms, flat or spiky, and the amount of corruption, M n 2 F , the norm of the corrupted atoms M n φ k 2 can vary between 0 and 1 corresponding to the extreme cases of being completely destroyed, M n φ k = 0, or perfectly preserved, M n φ k = φ k .Therefore the proper dictionary representation of the corrupted signal is and, in order to recover the support I n via thresholding, we have to look at the inner products between the corrupted signal and the renormalised non-vanishing corrupted atoms, Looking back at the representation of a corrupted signal in the properly scaled corrupted dictionary (10) we can also see why we assume flatness of the dictionary atoms, i.e. φ k ∞ 1 for all k.In the ideal case where for all atoms φ k we have φ k ∞ = 1/ √ d the energy of the corrupted atoms will be constant √ d so the dynamic range of the corrupted signal with respect to the corrupted normalised dictionary is the same as the original dynamic range, However, the less equally distributed over the coordinates the energy of the undamaged atoms is, the more the energy of the corrupted atoms varies.This leads to an increase of the dynamic range, which in turn makes it is harder for thresholding to recover the generating support.
The second reason for assuming flat atoms is the increase in coherence caused by the corruption.If the coherence of two flat atoms is small this means that their inner product is a sum of many small terms with different signs eventually almost cancelling each other out.Such a sum is quite robust under erasures, since both negative and positive terms are erased.On the other hand, if the energy of two atoms is less uniformly distributed, small coherence might be due to one larger entry in the sum being cancelled out by many small entries.Thus, the erasure of one large entry can cause a large increase in coherence, which again decreases the chances of thresholding recovering the generating support.Finally, to see that the flatness-assumption is not merely necessary due to the imperfection of the thresholding algorithm for sparse recovery, assume that the atoms of the generating dictionary are combinations of two diracs φ i = (δ i − δ (i+1) )/ √ 2, the coefficients follow our simple sparse model and that the corruption takes the form of random erasures, i.e.M n (j, j) are iid Bernoulli variables with P (M n (j, j) = 0) = η.For large erasure probabilities, η > 1/2, on average about half of the maximally 2S non zero entries of the signals will be erased and so the Dirac dictionary ψ i = δ i or rather its erased version will provide as plausible an S-sparse representation to the corrupted signals as the original dictionary Φ.To see how to best modify the atom update rule, we first consider the case, where the corruption occurs always in the same locations, meaning M n = M .Since we never observe the atoms on the coordinates where M (k, k) = 0, we can only expect to learn the corrupted dictionary M Φ = (M φ 1 . . .M φ k ) or rather its normalised version (M φ k / M φ k 2 ).On the other hand, the problem reduces to a simple dictionary learning problem for M Φ instead of Φ with update rule, where we have used the fact that the projection onto a subdictionary is equal to the projection onto its normalised version and that sign( M ψ k , M y n / M ψ k 2 ) = sign( ψ k , M y n ).Provided that thresholding always recovers the correct support I n , we can conclude directly from above that the normalised corrupted dictionary will be a fixed point and that the update rule will contract towards it.Indeed, for any corruption pattern M we know that before normalisation an updated atom M ψk will be contracted towards N k = {n : k ∈ I n } scaled copies of the corrupted generating atom M φ k , n:k∈In This suggests that for the case of different corruption patterns M n we can simply replace M by M n and the updated atom will be contracted towards the sum of scaled copies of the generating atom, corrupted with the different patterns, n:k∈In To then reconstruct the generating atom from the sum of its corrupted copies we just need to count how often we observe the atom on each coordinate.If each coordinate has been observed at least once, we can then obtain the generating atom simply by rescaling according to the number of observations, meaning we calculate set ψk = W † k ψk and output Ψ = ( ψ1 / ψ1 2 , . . ., ψK / ψK 2 ).The last detail we need to account for is the possible existence of a low-rank component Γ; other than noise or different signal scalings its contribution cannot be expected to average out once we have enough observations.Fortunately removing the low-rank component is quite straightforward, once we have a good estimate Γ with P ( Γ)Γ ≈ Γ.If a signal contains a low-rank component then the corrupted signal will contain the corrupted component, M y = M Γv +M Φ I x(I) and we can remove its contribution by a simple projection M ỹ = Q(M Γ)M y.However, since the mask destroys the orthogonality between the dictionary and the low-rank component, we do not get only the sparse contribution M Φ I x(I) but also a (small) contribution of the low-rank component, Q(M Γ)M Φ I x(I) = M Φ I x(I) − P (M Γ)M Φ I x(I).Thus, to stably estimate which part of an atom in the support has not been captured yet, we need to remove also the lowrank contribution and in our update rule replace the projection onto the current estimate of the corrupted atoms in the support with the projection onto these and the (estimated) corrupted low-rank component, . Further, to ensure that the output dictionary is again orthogonal to the low-rank component, we project the updated atoms onto the orthogonal complement of the (estimated) low-rank component.Putting it all together, we arrive at the following modified algorithm. and • Set ψk = Q( Γ)W † k ψk and output Ψ = ( ψ1 / ψ1 2 , . . ., ψK / ψK 2 ).Before we can start testing the modified algorithm, we still need to develop a method for actual recovery of the low-rank component from corrupted data, which we will do in the next section.In a follow up paper we hope to provide a theoretical analysis, which confirms that -as planned -the modified algorithm retains both the fixed point and the contraction property and thus is locally convergent.

RECOVERY OF THE LOW-RANK COMPONENT
As already mentioned, in the case of uncorrupted signals the low-rank component can be straightforwardly removed, since Γ will correspond to the L left singular vectors associated to the largest L singular values of the data matrix.In the case of corrupted signals this is no longer possible since the action of the corruption will distort the left singular vectors in the direction of the more frequently observed coordinates.To counter this effect, one would have to include the mask information in the singular value decomposition.This is, for instance, done by Robust PCA which was developed for the related problem of low-rank matrix completion [7].Unfortunately, one of the main assumptions therein is that the corruption is homogeneously spread among the coordinates, which might not be the case in our setup.To recover the low rank component, we will, therefore, pursue a different strategy.Let us assume for a moment that we are looking for only one low-rank atom, L = 1.One interpretation of all (masked) signals having a good part of their energy captured by the (masked) low-rank atom is to say that all (masked) signals are 1-sparse in a dictionary of one (masked) atom.Since we already have an algorithm to learn dictionaries from corrupted signals, we can also employ it to learn the low-rank atom.Moreover, since we have an algorithm to learn dictionaries from corrupted signals that contain a low-rank component, we can iteratively learn the low-rank component atom by atom.Adapting the algorithm also leads to some simplifications.After all, we do not need to find the sparse support, since (almost) all signals are expected to contain the one new atom.Summarising these considerations, we arrive at the following algorithm. and • Set γ = Q( Γ)W † γ and output γ / γ 2 .
Note that for the first low-rank atom in each iteration the update rule reduces to a summation of the signals aligned according to sign( γ , M n y n ).We also want to point out that our iterative approach offers the possibility to automatically choose the size of the low-rank component by comparing the signal energy captured by the last low-rank 2 to the signal energy expected to be captured by a dictionary atom 1 . A natural choice for determining L is to stop adding low rank atoms once the ratio between the two energies drops below a prescribed value.Since this ratio depends very much on the problem at hand, we will not go into more details here but instead turn to finally testing our algorithms on synthetic data.

NUMERICAL SIMULATIONS: SYNTHETIC DATA
We first explore the performance of the adapted algorithms in comparison to their original counterparts not using mask information, that is, singular value decomposition for low-rank recovery and for

Signal Model
Given the generating low-rank component Γ and dictionary Φ our signal model further depends on 6 coefficient parameters,

eΓ
-the energy of the low-rank coefficients, bΓ -defining the decay factor of the low-rank coefficients, S -the sparsity level, bS -defining the decay factor of the sparse coefficients, ρ -the noise level and sm -the maximal signal scale.
Given these parameters, we choose a low-rank decay factor cΓ uniformly at random in the interval [1 − bΓ, 1].We set v( ) = σ c Γ for 1 ≤ ≤ L, where σ are iid.uniform ±1 Bernoulli variables, and renormalise the sequence to have norm v 2 = eΓ.Similarly, we choose a decay factor cS for the sparse coefficients uniformly at random in the interval where σ are iid.uniform ±1 Bernoulli variables, and renormalise the sequence to have norm x 2 = 1 − eΓ.Finally, we choose a support set I = {i1 . . .iS} uniformly at random as well as a scaling factor s uniformly at random from the interval [0, sm] and according to our signal model in (2) set where r is a Gaussian noise-vector with variance ρ 2 if ρ > 0. As low-rank component we choose the first two DCT atoms, that is the constant atom and the atom corresponding to an equidistant sampling of the cosine on the interval [0, π), while the remaining basis elements form the dictionary.For the second pair, we construct the low-rank component by choosing two vectors uniformly at random on the sphere in R d for d = 256 and setting Γ the closest orthonormal basis as given by the singular value decomposition.To create the dictionary, we then choose another 1.5d random vectors uniformly on the sphere, project them onto the orthogonal complement of the span of Γ and renormalise them.These two representation pairs exhibit different complexities.The first forms an orthonormal basis, thus is maximally incoherent, and every element has γ ∞ = φ k ∞ = 2/d ≈ 0.088.The second dictionary is overcomplete with coherence 0.2788 and the supremum norm of both the low-rank and the dictionary atoms varies between 0.1529 and 0.2754 and averages at 0.1897.Signals: To create our signals, we use the signal model in (2) with a particular choice of distributions for the sparse and low-rank coefficients, the scaling factor and the noise, described in Table 1.For the first experiment, we set the parameters to e Γ = 1/3, b Γ = 0.15, S = 6, b S = 0.1, ρ = 1/(4 √ d) and s m = 4, resulting in 6-sparse signals with dynamic coefficient range between 1 and 0.9 −6 ≈ 1.88 and the low-rank component containing a third of the energy.The signal-to-noise ratio is 16 and the scaling is uniformly distributed on [0,4].Corruption: We consider two types of corruptions, whose distributions are described in Table 2.The random erasure patterns depend on 4 parameters determining (the difference in) the erasure probabilities of the first and second half of the coordinates (p 1 , p 2 ) and one half and the other half of the signals (q 1 , q 2 ).The expected average corruption corresponds to 1 − E( k M (k, k)) = 1 − (p 1 + p 2 )(q 1 + q 2 )/4 and in our experiments varies between 10% and 90%.The burst error patterns also depend on 4 parameters determining the burstlength T , the probability of no burst, a burst of size T or of size 2T occurring (p 0 , p T , p 2T where p 0 = 1 − p T − p 2T ), as well as the probability of the burst occurring among the first half of the coordinates (q).In our experiments, we consider burstlengths T = 64, 96 with varying burst location and occurrence probabilities, leading to an empirical average corruption varying between 10% and 60%.Experimental setup: We use two kinds of experimental setups.In the first, we learn the low-rank component and then the dictionary always using random initialisations.In particular, to learn the lowrank component with the adapted algorithm we use 10 iterations for every atom and 30,000 (new) signals per iteration.As initialisation, we use a vector drawn uniformly at random from the sphere in the orthogonal complement of the low-rank component recovered so far.For the unadapted low-rank recovery, we use a singular value decomposition, where the low-rank component corresponds to the first L left singular vectors of the last 30,000 signals generated for the adapted algorithm.As measure for the final recovery error, we use the operator norm of the difference between the generating low-rank component Γ and its projection onto the recovered component Γ, that is Γ − P ( Γ)Γ 2 .This corresponds to the worst case approximation error of a signal in the span of the generating low-rank component by the recovered one.We then learn the dictionary using 100 iterations of ITKrM(M) and 100,000 (new) signals per iteration from a random initialisation, where the initial atoms are drawn uniformly at random from the sphere in the orthogonal complement of the respective low-rank component.We measure the recovery success by the percentage of recovered or rather not recovered atoms, where we use the convention that a generating atom φ k is recovered if there exists an atom ψj in the output dictionary Ψ for which | φ k , ψj | ≥ t for t = 0.99.To provide a more complete picture, we also indicate the percentage of atoms, that are potentially recoverable using more training samples or iterations, that is | φ k , ψj | ≥ t for t = 0.90.In the second setup, we do not learn the low-rank component but provide the dictionary learning algorithms with the true low-rank component.We then learn the dictionary from a close-by initialisation,

Erasure Model
Our erasure model depends on 4 parameters, p1 -the relative signal corruption of the first half of coordinates, p2 -the relative signal corruption of the second half of coordinates, q1 -the corruption factor of one half of the signals, q2 -the corruption factor of the other half of the signals.
Based on these parameters we generate a random erasure mask as follows.First we choose q ∈ {q1, q2} uniformly at random and determine for every entry the probability of being non-zero as ηj = qp1 for j ≤ d/2 and ηj = qp2 for j > d/2.We then generate a mask as a realisation of the independent Bernoulli variables M (j, j) ∼ B(ηj), that is P (M (j, j) = 1) = ηj.

Burst Error Model
Our burst error model depends on 4 parameters,

pT
-the probability of a burst of length T , p2T -the probability of a burst of length 2T , T -the burst length, q -the probability of the burst starting in the first half of the coordinates.
Based on these parameters, we generate a burst error mask as follows.First we choose a burstlength τ ∈ {0, T, 2T } according to the probability distribution prescribed by {p0, pT , p2T }, where p0 = 1 − pT − p2T .We then decide according to the probability q whether the burst start t occurs among the first half of coordinates, t ≤ d/2, or the second half, t > d/2.Finally, we draw the burst start t uniformly at random from the chosen half of coordinates and in a cyclic fashion set M (j, j) = 0 whenever t ≤ j < t + τ or j < t + τ − d and M (j, j) = 1 else.Here, the initial dictionary atoms are 1:1 linear combinations of a generating atom and a random vector in the sphere orthogonal to it, which are then projected onto the orthogonal complement of the respective low-rank component.In case of the close-by initialisation, we measure the recovery success by the maximal error between a generating atom and its closest match in the output dictionary, Note that for an atom the recovery threshold t = 0.99 corresponds to an error of √ 0.02 ≈ 0.14.Again to provide a more complete picture, we also indicate the mean error between all generating atoms and their closest matches in the output dictionary, Figure 2 shows the recovery results for various corruption levels using the corruption adapted algorithms (ITKrMM) and their unadapted counterparts (ITKrM).We can see that for both representation pairs incorporating the corruption information into the learning algorithms clearly improves the performance.Another fact immediately visible is that for the adapted algorithms the success rates differ for the two erasure modalities and decrease with increasing corruption but do not depend much on the particular distribution of the erasures or bursts as long as they lead to the same average corruption level.In contrast, the success rates of the unmodified algorithms depend very much on the corruption distribution, and signals with similar average corruption can lead to very different error rates.Distinguishing between the different error modalities, we note that, for the low-rank recovery and the dictionary recovery from a close-by initialisation, the corruption adapted algorithms outperform the unadapted ones in any setting.For the dictionary recovery from a random initialisation, we again see that overall the modified algorithm outperforms the unmodified one.So, for both dictionaries ITKrMM recovers more than 95% of the atoms in all except two cases, while recovery, as well as potential recovery, via ITKrM completely breaks down at around 60% corruption.We also observe that corruption can improve the recovery rates of both the unmodified and the modified algorithms.A similar phenomenon has already been observed for ITKrM in connection with noise and a lower sparsity level, [28].So while one might expect the global recovery rates to decrease with increasing noise and with S, they actually increase.The reason for this is that a little bit of noise or lower sparsity, like a little bit of corruption, breaks symmetries and suppresses the following phenomenon.Two atoms converge to the same generating atom and, therefore, another atom has to do the job (is a 1:1 linear  combination) of two generating atoms.For uncorrupted signals there are ongoing efforts to counter this phenomenon with replacement strategies, which have a straightforward extension to corrupted signals, [31].With this extension, we expect all (potential) recovery rates around 90% to increase to 100%, as happens in the uncorrupted case.
To find out when we gain most from incorporating the mask information, let us have a more detailed Fig. 3: Detailed recovery performance of the corruption adapted versus the unadapted learning algorithms for the random representation pair for various types of random erasures (a,c) and burst errors (b,d).
look at the recovery rates for different types of parameter settings.In case of the random erasures, we distinguish 4 types.'type00' indicates that p 1 = p 2 with p 1 varying between 0.2 and 0.8 and q 1 = q 2 = 1, leading to a uniform erasure probability for all coordinates and all signals.'type20 (30)' indicate that p 2 = p 1 + 0.2(0.3)with p 1 varying between 0.1 and 0.7(0.6)and again q i = 1, leading to higher erasure probabilities for the first half of the coordinates, which are however uniform across signals.Finally, 'type22' indicates that p 2 = p 1 + 0.2 and q i = p i for p 1 varying between 0.4 and 0.8, leading to different erasure probabilities across coordinates and across signals.
For conciseness, we focus on the random low-rank component and dictionary and show only the error of the low-rank component and the maximal error of the dictionary from a close-by initialisation, which we take as indicator of global recovery rates with replacement strategies, Figure 3. Distinguishing between the different types, we can now see that incorporating the corruption information gives the highest benefits when the corruption is most unevenly distributed over the signal coordinates.So, for the evenly distributed random erasures and burst errors, 'type00' and 'type5', the low-rank component is still recovered by both the unadapted and the adapted algorithm, but as soon as there is intercoordinate variance in the corruption level, type20/22/30' and 'type7', the unadapted algorithm starts to lag behind.For the dictionary recovery, the adapted algorithm already shows advantages for the homogeneous corruption distributions, 'type00' and 'type5', which again become more and more pronounced with increasing intercoordinate variance of the corruption, 'type20/22/30' and 'type7'.
The second experiment explores the sensitivity of the algorithms to the flatness/spikyness of the representation pairs, measured by γ ∞ and φ k ∞ .This is done by looking at the recovery of representation pairs, which form orthonormal bases and whose atoms have their energy concentrated on supports of size m for m = 4, 8, 16, 32, 64, 128, 256.
Dictionaries & low-rank components: For a given support size m we choose d vectors z k from the unit sphere in R m and d supports I k = i 1 . . .i m of size m uniformly at random and set B(I k , k) = z k and zero else.We then calculate the closest orthonormal basis to B using the singular value decomposition.The first two elements of this orthonormal basis are chosen as the low-rank component, while the remaining elements form the dictionary.Signals, corruptions & setup: For the signal generation we use the same parameters as in the last experiment and for the corruption we use the random erasure masks of 'type22' with p 1 = q 1 = 0.7/0.5 and p 1 = q 1 = 0.7/0.9corresponding to 36% and 64% of corruption.The experimental set up for the recovery of each representation pair is again as in the last experiment.
Figure 4 shows the spikyness of the representation pairs for various support sizes as well as the corresponding recovery results for the two corruption types.Let us first point out that our construction based on decreasing atom support sizes indeed leads to representation pairs with increased spikyness (Figure 4c).As usual the recovery errors incurred by the modified algorithms are much lower than those of the unmodified ones.For the low-rank component (Figure 4a) the recovery error is very stable and only starts to deteriorate for m = 4, when the low-rank atom carrying less energy is indeed almost a spike, γ 2 ∞ = 0.8997, meaning 80% of its energy are concentrated on one coordinate.Also for the dictionary recovery the robustness to spikyness of the adapted algorithm is quite surprising.So for the 1:1 initialisation the maximal atom recovery error stays below the critical error until m = 8 for the low corruption level (36%) and until m = 16 for the higher corruption level (64%).Even more interesting are the recovery rates for the random initialisation, which only break down and go below 94% for m = 4 at the higher corruption level.As in the last experiment, we observe the effect that spikyness like corruption can lead to better global recovery rates.The effect is more pronounced for the higher corruption level (64%), where for m = 16 we even have 100% recovery.We briefly investigated the effect of the signal scaling on the recovery rates of the modified algorithms for the DCT representation pair and the 'type22' erasure mask with 36% corruption, with a same setup as in the first experiment, but found that there was no strong influence.That is, for s m varying between 2 and 128, the low-rank recovery error varies between 0.031 and 0.036, the atom recovery error from the close-by initialisation varies between 0.007 and 0.012 and the recovery rates from the random initialisation stay between 95% and 96%.Similarly, exploring the effect of the sparsity level S, we do not gain much more insights over the experiments already conducted in the uncorrupted case, [28].So, fixing all mask and signal parameters except for the sparsity parameter S, which increases from 4 to 16, the low-rank recovery error stays constant, the dictionary recovery error from the close-by initialisation increases while the number of not recovered dictionary atoms from a random initialisation decreases.In order not to overload the paper, we do not detail these experiments here but refer the interested reader to the ITKrMM MATLAB toolbox 1 , which can be used to reproduce all the presented experiments and many more.Instead we now turn to the application of dictionary learning from corrupted data to image  inpainting.

APPLICATION: IMAGE INPAINTING
To demonstrate the practical value of the ITKrMM algorithm, we here conduct an image inpainting experiment.Inpainting is the process of filling in missing information or holes in damaged signals, and our motivating task, the prediction of blood glucose levels, can be cast as inpainting problem.Image inpainting, in particular, is used for restoration of old analog paintings, denoising of digital photos, and for removal of objects like text or date stamps from images and has become an active field of research in the mathematical and engineering communities, with a variety of specifically developed methods and approaches.Since our goal is to evaluate the ITKrMM algorithm as a robust and simple tool for dictionary learning from incomplete data, we do not compare our inpainting scheme with state-of-the-art methods for image inpainting in general, but instead compare the performance of various learned dictionaries for dictionary based inpainting.Apart from comparing to the dictionaries and low-rank components of different size learned by ITKrMM, we also compare to dictionaries learned by weighted KSVD (wKSVD), a modification of KSVD, [2], for signals affected by non-homogeneous noise or erasures.wKSVD is the most likely algorithm to fit into our setup and has already been successfully used for image inpainting, [21], [22].Let us now describe our setup for dictionary based inpainting.Data: For our experiments we consider grayscale images of size 256×256 from a standard image database.
In the first set of experiments, the images are corrupted by erasing each pixel iid with probability 0.3, 0.5 or 0.7, resulting in 30, 50 or 70% erased pixels on average.In the second set of experiments, part of the pixels from the images are removed using a structured mask, see Figure 6b.We then extract all possible patches of size p × p pixels from the corrupted image as well as the corresponding mask.We set p = 8 in case of the randomly erased pixels and p = 12 in case of the structured mask.The vectorised corrupted patch/mask pairs are then given to the dictionary learning algorithm.
Dictionary & low-rank component: Via ITKrMM we learn dictionaries of size K = 2d − L atoms, where d = p 2 is the dimension of the patches and L denotes the size of the low-rank component.We consider L = 1 and L = 3. Via wKSVD we learn a dictionary of size K = 2d with the option of keeping the first atom always equal to the constant atom φ 1 ≡ c.This corresponds to ITKrMM with K = 2d − 1 and L = 1.The ratio 2 between patch dimension and the size of the dictionary together with the low-rank component makes the dictionary redundant and allows to capture different features, while still allowing for reasonable computational complexity.
Initialisation & sparsity level: We use the same initialisation strategies as for the synthetic experiments, ie.random vectors that are orthogonal to the low-rank component resp.atoms that have already been learned.This means that in case L = 1, before subtracting the low-rank component, the initial dictionaries for ITKrMM and wKSVD are the same.As sparsity level for the learning step of ITKrMM, we use S = p − L to keep the comparison between different sizes of the low-rank component fair.Since within wKSVD the contribution of the constant low-rank atom counts in the sparse approximation step, we use as sparsity level S = p.For learning the low-rank components we use 10 iterations per low-rank atom and all available patch/mask pairs and for learning the dictionary we use 40 iterations on all available patch/mask pairs for both algorithms.Note that the wKSVD version adapted to erasures was originally designed to adapt an existing patch dictionary, pre-learned on some image data base, to the image patches at hand rather than to calculate the dictionary from scratch.Still, abusing it to do just that, it works very well.
Dictionary based inpainting: Dictionary based inpainting relies on the concept that every image patch y is sparse in a flat patch-dictionary Φ and therefore every damaged/masked patch M y is sparse in the damaged/masked patch dictionary M Φ, that is for if for To reconstruct the original patch we simply need to recover coefficients xI ≈ x I by sparsely approximating M y in M Φ and to set ỹ = Φx I .The flatter the dictionary is, the more stable the atoms are to erasures and the easier it is to recover the correct coefficients.We first reconstruct every damaged image patch via sparse approximation and then reconstruct the full image by averaging every pixel over all reconstructed patches in which it is contained.As sparse approximation algorithm we use Orthogonal Matching Pursuit with a sparsity level of S = 20 in case of the random erasures and S = 30 in case of the structured mask, [9], [25].Note that since the damaged dictionary is not normalised we need to account for this in the OMP selection step and rescale by 1/ M φ k 2 , simarly to thresholding in the ITKrMM algorithm.Without this renormalisation less damaged atoms take precedence over better fitting ones.Comparison/Error: We measure the recovery success of the different pairs of dictionaries and low-rank components by the peak signal-to-noise ratio (PSNR) between the original image Y and the recovered 2. In the original version of wKSVD, which uses OMP as sparse approximation algorithm, the OMP selection step seems not to have included normalisation.However, we find that including the normalisation step both for learning and inpainting improves the final performance of the wKSVD dictionary, especially for the 70% corruption level, where it increases the PSNR by up to 2.9dB.
version Ỹ .For Y, Ỹ both images of size d 1 × d 2 the PSNR is defined as In case of the random erasure mask we average over 5 runs, each with a different mask and initialisation, to account for the variability of the PSNR for different mask realisations.The results of the experiment are presented in Table 3 and examples of two images, one corrupted with the 50% mask and the other with the structured mask, together with their reconstructions and the original images can be found in Figure 5 and Figure 6.Table 3 shows that when randomly erasing around 30% of the pixels, inpainting with the ITKrMM dictionaries always outperforms inpainting with the wKSVD dictionaries.In case of 50% and 70% random erasures as well as for the structured mask, the wKSVD and ITKrMM dictionaries perform about equally, where for the more textured images (Barbara, Mandrill, Pirate) ITKrMM tends to have a slight advantage, while for the smooth images (Cameraman, House, Peppers) wKSVD is slightly better.Another trend we can observe is that for the lower corruption levels ITKrMM with a low-rank component of size L = 3 improves over a low-rank component of size L = 1, while for the higher corruption levels it is the other way round.We also conducted experiments with ITKrMM for L = 2.
The performance was for all settings inbetween the case L = 1 and the case L = 3, so we do not include them here.
To understand why the ITKrMM dictionaries perform better for the small erasure probabilities and textured images, we have a look at the dictionaries learned with ITKrMM (L=1) and wKSVD on patches with 30% and 70% corruption for Barbara (Figure 7) and Peppers (Figure 8).For 30% random erasures we can see (especially on Barbara) that the dictionaries are very similar but that ITKrMM produces more textured (high frequency) atoms, while wKSVD prefers smooth (low frequency) atoms.Therefore the ITKrMM dictionary can inpaint finer details and outperforms wKSVD.Looking at the dictionaries learned on patches from 70% corrupted imagegs, we now see different trends for the two dictionary types.For ITKrMM the 70% erasure dictionaries are simply noisy versions of their 30% counterparts.For wKSVD on the other hand the 70% erasure dictionaries contain only noisy versions of the smooth atoms from their 30% counterparts but not many recognisable noisy versions of the high frequency atoms.In the case of highly corrupted textured images this puts wKSVD at a disadvantage, while in the case of highly corrupted smooth images it prevents inpainting with noisy details.This difference in number of textured vs smooth atoms also explains the different performance of the two dictionary types for the structured  After verifying that ITKrMM indeed learns meaningful and applicable dictionaries also on real data we now turn to a discussion of our results.

CONCLUSION AND FUTURE DIRECTIONS
Motivated by a real-life problem with corrupted data in diabetes therapy management, we here extended the iterative thresholding and K residual means (ITKrM) algorithm for dictionary learning to learning dictionaries from incomplete/masked data (ITKrMM).To account for the presence of a low-rank component in the data we further introduced a modified version of the ITKrMM algorithm to recover also the low-rank component and adapted the ITKrMM algorithm to the potential presence of such a low-rank component.In extensive tests on synthetic data, we demonstrated that incorporating information about the corruption (missing coordinates) dramatically improves the dictionary learning performance and that ITKrMM is able to recover dictionaries from data with up to 80% corruption.We further showed that the algorithm also learns meaningful dictionaries on corrupted image data and provides noticeable improvements in terms of required computational resources compared to wKSVD, a state-of-the-art algorithm for dictionary learning/refinement in the presence of erasures.Moreover, when used for inpainting, the ITKrMM dictionaries perform on a par with their wKSVD counterparts and in case of textured images ITKrMM leads to notable improvements, since the high frequency atoms in the learned dictionaries are better preserved.All of the experiments reported in this paper can be reproduced with the ITKrMM Matlab toolbox, which is freely available at http://homepage.uibk.ac.at/ ∼ c7021041/code/ITKrMM.zip.One slight disappointment is that in synthetic experiments with a random initialisation ITKrMM does not recover the full dictionary.Instead, it recovers some atoms twice and some atoms are 1:1 linear combinations of two other ground truth atoms.This phenomenon has already been observed in the case of ITKrM and there are ongoing efforts to counter it with replacement strategies, which also open up the road to adaptively choosing the sparsity level, the dictionary size and the size of the low-rank component, [31].Once these strategies for ITKrM are finalised they will have straighforward extensions to the case of corrupted data, that is ITKrMM.Most of our active research efforts are currently devoted to extending the local convergence results for ITKrM to the case of ITKrMM with random erasures, which will hopefully be available in the near future, [24].Another interesting future direction consists of designing faster and more efficient algorithms, for instance, by adaptive patch selection for dictionary learning from incomplete data as it was suggested in seismic data recovery [38].More generally, we are interested in extending the concept of learning dictionaries from data with erasures to more general types of corruption such as for instance blurring, where the resulting dictionaries can then be used for deblurring.

Algorithm 3 . 2 ( 2 .•
ITKrM for corrupted data -one iteration).Given an estimate of the low-rank component Γ, an input dictionary Ψ with Ψ Γ = 0, a sparsity level S and N corrupted training signals y M n = (M n y n , M n ) do: • For all n set M n ỹn = Q(M n Γ)M n y n .• For all n find I t n = arg max I:|I|=S i∈I:Mnφi =0 | Mnφi,Mn ỹn | Mnφi For all k calculate ψk = n:k∈I t n

Fig. 2 :
Fig. 2: Recovery performance of the corruption adapted versus the unadapted learning algorithms for the DCT (a,c,e) and the random (b,d,f) representation pair in terms of low-rank recovery error (a,b), percentage of recovered dictionary atoms from a random initialisation (c,d) and dictionary recovery error from a close-by initialisation (e,f).

!Fig. 4 :
Fig. 4: Recovery performance of the corruption adapted versus the unadapted learning algorithms for random representation pairs with varying atom support sizes and two corruption levels.

Algorithm 6 . 1 (
OMP masked).Given a damaged signal M y together with the mask M , a dictionary Φ and a sparsity level S, initialise r = M y, I = ∅ and while |I| < S do• Atom selection: find j = arg max M φk =0 | r, M φ k |/ M φ k 2 .•Approximation: Set I = I ∪ {j}, x I = (M Φ I ) † M y and r = M y − M Φ I x I .Output x I .

TABLE 1 :
Signal Model dictionary learning ITKrM on the signals projected onto the orthogonal complement of the low-rank component.To do this, we look at two representation pairs, consisting of a low-rank component and a dictionary, and test the recovery using 6-sparse signals with corruptions of two types, random erasures and burst errors.Dictionary & low-rank component: The first representation pair corresponds to the Discrete Cosine Transform (DCT) basis in R d for d = 256.

TABLE 2 :
Mask Models using 10 iterations with again 100,000 (new) signals per iteration.

TABLE 3 :
Comparison of the PSNR (in dB) for inpainting of images with various corruption levels based on dictionaries learned with wKSVD and ITKrMM on all available corrupted image patches.