arXiv:2312.04327v1 [cs.LG] 7 Dec 2023 Learning to sample in Cartesian MRI Thèse n. 9981 présentée le 3 juin 2022 Faculté des sciences et techniques de l’ingénieur Laboratoire de systèmes d’information et d’inférence Programme doctoral en informatique et communications pour l’obtention du grade de Docteur ès Sciences par Thomas Sanchez Acceptée sur proposition du jury : Prof Alexandre Alahi, président du jury Prof Volkan Cevher, directeur de thèse Prof Dimitri van de Ville, rapporteur Dr Ruud van Heeswijk, rapporteur Dr Philippe Ciuciu, rapporteur On God rests my salvation and my glory; my mighty rock, my refuge is God. — The Bible, Psalm 62:8 Acknowledgements This thesis was a journey that would have been impossible without the help and support of many people. First, I would like to thank my advisor Volkan Cevher. He never gave up on me despite some tumultuous steps, and fought for me to reach the end of my thesis. Few would have done this. I am also thankful that he let me pursue freely my research interests, and I am grateful to him for having built an amazing group with colleagues who are both brilliant researchers and great friends. Second, I would to express my gratitude to Prof. Alexandre Alahi, Prof. Dimitri van de Ville, Dr. Ruud van Heeswijk and Dr. Philippe Ciuciu for being part of my thesis committee and for the interest that they took in reading and evaluating my thesis, as well as for their useful suggestions. I would like to thank in particular Dr. Ruud van Heeswijk and Dr. Philippe Ciuciu for all the insight that they gave me about MRI, and Dr. Philippe Ciuciu hosting me at Neurospin, despite the visit being shortened by the start of a pandemic. I have very good memories of my short time there. My thanks then naturally go to my friends and colleagues, who supported me during these four years and made them unforgettable. I can never repay them enough for their kindness. Thank you Baran, Paul, Ya-Ping, Ahmet, Fatih, Ali K., Kamal, Thomas, Fabian, Leello, Pedro, Igor, Chaehwan, Maria, Arda, Armin, Yura, Nadav, Grigoris, Ali R., Andrej, Luca, Stratis, Kimon, Fanghui, and all the others for your friendship and the moments that we have shared. Of course, a couple of special mentions are in order. I want to thank Baran for introducing me to the world of MRI, and teaching me much about research and life. I want to thank Igor for the close collaboration that we had throughout my PhD, for putting up with me through manifold argumentative and extended meetings, pesky bugs and seemingly incomprehensible experimental results, and for being so patient and supportive. I am also thankful for his healthy German honesty and for the good friend he is. I want also to thank Leello for being one of the most fun people on this planet – I don’t think that someone made me laugh more than him – for his optimistic look on life, for his practical wisdom and overall for being such an amazing friend. Thank you for bringing joy and friendship to everyone around you. Of course, I ought to thank Gosia, the secretary of our lab, for her amazing work and i Acknowledgements efficiency in every practical matter. I suspect that she is the reason that the lab has not yet collapsed into chaos. I am also grateful to all my friends for their support, in particular to Clément and Nathan Dupertuis, Peter Lecomte, Marion Conus and Christian Cuendet, as well as everyone from church, from the house group and from GBU. They are too numerous to be listed individually, and I feel very fortunate to be surrounded by so many kind and generous people. Of course, without the support from my family, this thesis would not have been possible. I have been blessed by the unconditional love and encouragement from my parents, Esther and David, from my sister Myriam and her fiancé Sam, and from my brother Nicolas. I also cannot thank enough my parents-in-law, Karin and Jean-Paul. Finally, my words turn to my best friend and amazing wife, Aline. Thank you for who you are. Thank you for your love, your support and unfailing encouragement. Without you, I could not have completed this thesis. You make every day brighter and push me to become a better man. Lausanne, May 12, 2022 ii T. S. Abstract Magnetic Resonance Imaging (MRI) is a non-invasive, non-ionizing imaging modality with unmatched soft tissue contrast. However, compared to imaging methods like X-ray radiography, MRI suffers from long scanning times, due to its inherently sequential acquisition procedure. Shortening scanning times is crucial in clinical setting, as it increases patient comfort, decreases examination costs and improves throughput. Recent developments thanks to compressed sensing (CS) and lately deep learning allow to reconstruct high quality images from undersampled images, and have the potential to greatly accelerate MRI. Many algorithms have been proposed in the context of reconstruction, but comparatively little work has been done to find acquisition trajectories that optimize the quality of the reconstructed image downstream. Although in recent years, this problem has gained attention, it is still unclear what is the best approach to design acquisition trajectories in Cartesian MRI. In this thesis, we aim at contributing to this problem along two complementary directions. First, we provide novel methodological contributions to this problem. We first propose two algorithms that improve drastically the greedy learning-based compressed sensing (LBCS) approach of Gözcü et al. (2018). These two algorithms, called lazy LBCS and stochastic LBCS scale to large, clinically relevant problems such as multi-coil 3D MR and dynamic MRI that were inaccessible to LBCS. We also show that generative adversarial networks (GANs), used to model the posterior distribution in inverse problems, provide a natural criterion for adaptive sampling by leveraging their variance in the measurement domain to guide the acquisition procedure. Secondly, we aim at deepening the understanding of the kind of structures or assumptions that enable mask design algorithms to perform well in practice. In particular, our experiments show that state-of-the-art approaches based on deep reinforcement learning (RL), which have the ability to adapt trajectories on the fly to patient and perform long-horizon planning, bring at best a marginal improvement over stochastic LBCS, which is neither adaptive nor does long-term planning. Overall, our results suggest that methods like stochastic LBCS offer promising alternatives iii Abstract to deep RL. They shine in particular by their scalability and computational efficiency and could be key in the deployment of optimized acquisition trajectories in Cartesian MRI. Keywords: magnetic resonance imaging, experiment design, inverse problems, compressed sensing, reinforcement learning, deep learning, generative adversarial networks iv Résumé L’imagerie par résonance magnétique (IRM) est une modalité d’imagerie non invasive, non ionisante et offrant un contraste inégalé des tissus mous. Cependant, en raison de procédures d’acquisition inhéremment séquentielles, la vitesse d’acquisition en IRM est lente comparé à des méthodes telles que la radiographie par rayons X. La réduction des temps d’acquisition est cruciale en milieu clinique, car elle profite au confort du patient, diminue les coûts d’examen et améliore le rendement. Les innovations récentes grâce à l’acquisition comprimée (CS) et dernièrement l’apprentissage profond permettent de reconstruire des images de haute qualité à partir d’images sous-échantillonnées et ont le potentiel de grandement accélérer l’IRM. De nombreux algorithmes ont été proposés dans le contexte de la reconstruction, mais comparativement peu de travaux ont été réalisés afin de trouver des trajectoires d’acquisition qui optimisent la qualité de l’image reconstruite en aval. Bien que ces dernières années, ce problème ait gagné en attention, il n’est toujours pas clair quelle est la meilleure approche pour concevoir des trajectoires d’acquisition en IRM cartésienne. Dans cette thèse, nous visons à contribuer à ce problème selon deux directions complémentaires. Premièrement, nous proposons de nouvelles contributions méthodologiques à ce problème. Nous proposons tout d’abord deux algorithmes qui améliorent considérablement l’approche gloutonne d’acquisition comprimée basée sur l’apprentissage (LBCS) de (Gözcü et al., 2018). Ces deux algorithmes, appelés LBCS paresseux et LBCS stochastique, s’étendent à des problèmes importants et cliniquement pertinents, tels que l’IRM parallèle 3D et dynamique, qui étaient inaccessibles à LBCS. Nous montrons également que les réseaux antagonistes génératifs (GAN), utilisés pour modéliser la distribution a posteriori dans les problèmes inverses, fournissent un critère naturel pour l’acquisition adaptative en utilisant leur variance dans le domaine de la mesure pour guider la procédure d’acquisition. Deuxièmement, nous cherchons à approfondir la compréhension du type de structures ou d’hypothèses qui permettent aux algorithmes de conception de masques d’être performants en pratique. En particulier, nos expériences montrent que les approches de pointe basées sur l’apprentissage par renforcement (RL) profond, qui ont la capacité v Résumé d’adapter leurs trajectoires au patient à la volée et d’effectuer une planification à long terme, apportent au mieux une amélioration marginale par rapport à LBCS stochastique, qui n’est ni adaptatif ni ne fait pas de planification à long terme. Dans l’ensemble, nos résultats suggèrent que des méthodes comme LBCS stochastique offrent des alternatives prometteuses au RL profond. Elles brillent notamment par leur extensibilité et leur efficacité de calcul et pourraient être déterminantes dans le déploiement de trajectoires d’acquisition optimisées en IRM cartésienne. Mots-clé : imagerie par résonance magnétique, planification d’expériences, problèmes inverses, acquisition comprimée, apprentissage par renforcement, apprentissage profond, réseaux antagonistes génératifs. vi Contents Acknowledgements i Abstract (English/Français) iii 1 Introduction 1 2 MRI Background 2.1 MRI physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Magnetization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Excitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Image acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.5 Parallel imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.6 Dynamic MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Mathematical description . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Accelerated MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 9 10 11 12 16 18 19 20 3 Reconstruction methods 3.1 Model-driven approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Compressed Sensing . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Data-driven approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Dictionary learning . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Unrolled neural networks . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Deep image prior . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 A Bayesian interpretation of reconstruction . . . . . . . . . . . . 3.2.5 Generative approaches . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 23 25 25 26 27 28 29 29 4 Optimizing sampling patterns 4.1 A taxonomy of mask design methods . . . . . . . . . . . . . . . . . . . . 4.2 Compressed sensing and MRI: Model-based, model-driven approaches . 4.2.1 The contribution of Lustig et al. (2007) . . . . . . . . . . . . . . 4.2.2 Model-based, model-driven sampling in the sequel . . . . . . . . 4.3 Towards learning-based, data-driven approaches . . . . . . . . . . . . . . 31 31 32 32 33 34 vii CONTENTS 4.4 A matter of performance: how do these methods compare to each other? 4.5 The advent of deep learning: Learning-based, data-driven approaches . . 4.6 Formalizing learning-based, data-driven mask design . . . . . . . . . . . 4.6.1 Non-adaptive sampling masks . . . . . . . . . . . . . . . . . . . . 4.6.2 Adaptive sampling masks . . . . . . . . . . . . . . . . . . . . . . 4.6.3 On the optimality of the discrete mask optimization problem . . 35 36 39 40 41 42 5 Scalable learning-based sampling optimization 5.1 Learning-based Compressive Sensing (LBCS) . . . . . . . . . . . . . . . 5.2 LBCS motivation: Submodularity . . . . . . . . . . . . . . . . . . . . . . 5.3 Scaling up LBCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Limitations of LBCS . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Stochastic LBCS . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Lazy LBCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Experiments - Stochastic LBCS . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Experimental setting . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Comparison of greedy algorithms . . . . . . . . . . . . . . . . . . 5.4.3 Single coil results . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Multicoil experiment . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 Large scale static results . . . . . . . . . . . . . . . . . . . . . . . 5.5 Experiments - Lazy LBCS . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 2D multicoil setting . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 3D multicoil setting . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 47 47 49 52 52 52 54 55 56 58 61 70 71 72 73 74 77 6 On the benefits of deep RL in accelerated MRI sampling 6.1 Reinforcement Learning to optimize sampling trajectories . . . . . . . . 6.1.1 A short RL primer . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 The special case of MRI . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Double Deep Q-Networks . . . . . . . . . . . . . . . . . . . . . . 6.1.4 Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.5 LBCS as a simple RL policy . . . . . . . . . . . . . . . . . . . . . 6.1.6 The questions at hand . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The MRI data processing pipeline . . . . . . . . . . . . . . . . . . . . . 6.3 Re-examining deep RL for MRI sampling . . . . . . . . . . . . . . . . . 6.3.1 Experimental setting . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Main results on Bakker et al. (2020) . . . . . . . . . . . . . . . . 6.3.3 Main results on Pineda et al. (2020) . . . . . . . . . . . . . . . . 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Relation to other works . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 82 83 85 86 88 89 89 90 93 93 95 97 98 101 101 7 Uncertainty driven adaptive sampling via GANs 103 viii CONTENTS 7.1 7.2 7.3 7.4 7.5 7.6 7.7 Generative adversarial networks (GANs) for posterior modeling . . . . . 7.1.1 Generative adversarial networks (GANs) . . . . . . . . . . . . . . 7.1.2 Conditional GANs . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Conditional GANs for posterior modeling . . . . . . . . . . . . . 7.1.4 Learning the posterior or learning moments . . . . . . . . . . . . GAN based adaptive sampling . . . . . . . . . . . . . . . . . . . . . . . Comparison with Zhang et al. (2019b) . . . . . . . . . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 MRI adaptive sampling . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Beyond Fourier sampling: image-domain conditional sampling . . Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Conclusion and future works 104 104 105 107 108 110 112 113 113 119 121 123 125 127 A Appendix for Chapter 6 133 A.1 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 A.2 Further experiments and results on Bakker’s setting . . . . . . . . . . . 135 A.2.1 Data processing differences . . . . . . . . . . . . . . . . . . . . . 135 A.2.2 Exact reproduction of Bakker et al. . . . . . . . . . . . . . . . . 136 A.2.3 Further results in the Bakker experiment . . . . . . . . . . . . . 137 A.3 LBCS vs long-range adaptivity . . . . . . . . . . . . . . . . . . . . . . . 140 A.4 LBCS complexity and parameters . . . . . . . . . . . . . . . . . . . . . . 142 A.5 Visual comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 A.6 Detail on the practical recommendations . . . . . . . . . . . . . . . . . . 147 B Appendix for Chapter 7 149 B.1 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 B.1.1 Knee experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 149 B.2 Extended results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 B.2.1 Additional MRI knee results . . . . . . . . . . . . . . . . . . . . . 151 B.2.2 Brain experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 B.2.3 Additional MNIST results . . . . . . . . . . . . . . . . . . . . . . 153 B.2.4 CIFAR10 experiment . . . . . . . . . . . . . . . . . . . . . . . . . 155 Bibliography 159 ix 1 Introduction Since its inception in the 1970s, Magnetic Resonance Imaging (MRI) has deeply impacted radiology and medicine. It allows for a non-invasive, non-ionizing imaging of the body, and enables among other things to visualize anatomical structures, physiological functions and metabolic information with unmatched precision (Wright, 1997; Feng et al., 2016b). Magnetic Resonance Imaging, as its name indicates, constructs an image of an object of interest by exploiting magnetic resonance. Namely, when submitted to a strong static magnetic field B0 , the protons in the object, mostly in water molecules, will resonate when excited through a radio frequency pulse. In order to encode spatial information in the observed signal instead of a receiving global response, additional time and spatially varying magnetic fields, referred to as gradient fields, are superimposed to the original magnetic field B0 . However, this process does not give direct access to an image, but rather informs us about the different frequencies that construct it. By varying the gradient fields, it is possible to obtain frequency information relating to different locations and represent them in what is known as the Fourier space, or k-space. Figure 1.1 shows the difference between an image and its Fourier space representation. The object of interest is excited several times using different varying gradient fields, which defines a k-space trajectory along which data are acquired. When sufficient information has been acquired, an image can be readily obtained by computing the inverse Fourier transform of the signal. There is considerable flexibility in the design of k-space trajectories, but a particularly common approach is to cover the k-space in a grid-like fashion, known as Cartesian sampling. A grid-like covering of k-space implies that the image can be directly obtained by applying a fast Fourier transform (FFT). Other approaches, known as non-Cartesian, do not cover k-space in a grid-like fashion, but can consist of radial spokes (Lauterbur, 1973), spirals (Meyer et al., 1992) or space-filling curves (Lazarus et al., 2019). They can provide a faster acquisition than Cartesian sampling and benefits such as resistance to motion artifacts, but require a more involved and slower reconstruction through 1 Introduction Fourier space Image space Inverse Fourier Transform Figure 1.1: Fourier domain and image domain representation of a knee MRI scan. techniques such as gridding (O’sullivan, 1985; Jackson et al., 1991) or non-uniform fast Fourier transform (NUFFT) (Fessler and Sutton, 2003), as well as density compensation (Pipe and Menon, 1999; Pauly, 2007). In addition to this, non-Cartesian trajectories can be sensitive to errors in the magnetic field gradients, due to field imperfections of various sources (Vannesjo et al., 2013). As a result, Cartesian trajectories have been the most popular in practice (Lustig et al., 2008; Feng et al., 2016b). However, MRI examinations suffer from being relatively slow compared to other imaging modalities, due to the sequential nature of the acquisition. For instance, MRI acquisition can easily exceed 30 minutes (Zbontar et al., 2019). Improving imaging speed has been a major concern for researchers of the community and has led to many innovations along the years, such as rapid imaging sequences (Frahm et al., 1986; Hennig et al., 1986) or parallel/multi-coil imaging (Sodickson and Manning, 1997; Pruessmann et al., 1999; Griswold et al., 2002a). But there is a need to further accelerate imaging. As scanning time is directly related to the number of acquisitions, collecting fewer of them directly enables faster imaging. In Cartesian imaging, this can be achieved by undersampling the Fourier space, which is implemented by not acquiring some lines in Fourier space. However, as we see on Figure 1.2, such undersampling degrades the quality of the image by introducing artifacts. This is called aliasing. As a result, an additional step of reconstruction is required in order to obtain a cleaner estimation of the original image. The problem of reconstructing data from undersampled measurements is a typical instance of an ill-posed inverse problem. It is called an inverse problem because we aim at constructing an image of an object of interest from indirect measurements, in this case frequency information about the image. It is ill-posed because there is not sufficient information to directly retrieve the image of interest from the undersampled measurements. More precisely, there are infinitely many signals that could produce these partial measurements. The typical way of solving this problem is by exploiting prior knowledge about the data, by leveraging some structure that is present in them. A particularly successful approach 2 Introduction Unobserved data Fully sampled image Observed data Undersampled Fourier spectrum Undersampled image ifft Reconstructed image Reconstructor ifft Undersampling Fourier spectrum Sampling mask Figure 1.2: Accelerated MRI by undersampling. Undersampling is governed by a sampling mask that tells what locations are acquired. The undersampled Fourier spectrum is then mapped back to image domain through an inverse Fourier transform, which gives an aliased image which needs to go through a reconstruction method in order to obtain a de-aliased estimation of the original, unobserved image. Here ifft stands for inverse fast Fourier transform. relies on sparsity-based priors, where it is assumed that if the signal can be represented sparsely, or compactly in some transform domain, then, it can be reconstructed from undersampled measurements. This approach is known as Compressed Sensing (CS) (Donoho, 2006; Candès et al., 2006), and has been successfully applied to MRI (Lustig et al., 2008), for instance by representing the signal sparsely in the Wavelets domain. The signal is retrieved by solving an optimization problem with two components: a data-fidelity term that ensures that the reconstructed image is close to the measurements, and a regularization term that encourages the reconstructed to be sparse in some domain. In the latter half of 2010s, deep learning-based approaches allowed pushing even further the boundaries of the state-of-the-art in MRI reconstruction (Sun et al., 2016; Schlemper et al., 2017; Mardani et al., 2018). By leveraging increasingly large amount of data, these methods are able to learn the structure directly from data, rather than imposing it through a mathematical model such as sparsity. These methods not only improved the quality of reconstruction, but also enabled fast reconstruction times, where CS methods often relied on slow iterative procedures. However, in parallel to these developments, only few works tackled the problem of how the k-space should be undersampled. Although CS prescribed random sampling, practitioners quickly turned to heuristics favoring low frequency sampling, where a lot of energy and information about the structure of the signal is located (Lustig et al., 2008). Nevertheless, there were only a few works in the early 2010s aiming at directly optimizing the sampling 3 Introduction mask1 , which can be attributed to the difficulty of the problem. Optimizing a sampling mask is a combinatorial problem, for which commonly used gradient-based optimization techniques are not readily available. As a result, two main trends emerged: approximate solutions to the combinatorial problem (Ravishankar and Bresler, 2011a; Liu et al., 2012), or the optimization of an underlying probability distribution from which a random mask is then sampled (Knoll et al., 2011b; Chauffert et al., 2013). In more recent years, there has been a renewed interest in this problem, and a flurry of approaches have been proposed (Gözcü et al., 2018; Jin et al., 2019; Sherry et al., 2019; Bahadir et al., 2019; Huijben et al., 2020; Pineda et al., 2020; Bakker et al., 2020; Yin et al., 2021). This is mainly due to the wide availability of training data, the fast reconstruction times of deep learning-based methods and the wider use of auto-differentiation frameworks such as PyTorch and TensorFlow, which facilitate the routine computation of derivatives through complex models. While these methods are very diverse, they fit under two distinctive categories. Whereas previous approaches leveraged an underlying model from which candidate sampling masks were drawn, modern approaches are learning-based in the sense that they learn this model directly from data. Furthermore, earlier methods were model-driven, designing masks to minimize a mathematical criterion such as coherence (Lustig et al., 2008). In contrast, recent works are data-driven, as they learn policies that minimize the reconstruction error against a ground truth. In this thesis, we aim at contributing to the problem of optimizing sampling masks along two complementary directions. First, we will provide novel methodological contributions to the problem of sampling optimization. More specifically, we aim at developing mask optimization algorithms that do not depend on a specific reconstruction algorithm, as this will allow the method to remain usable in a field where the state-of-the-art evolves very quickly. In addition, we are concerned with designing scalable methods. Most clinically relevant settings involve high resolution, multi-coil images, and it is imperative for sampling optimization methods to scale up to such challenging settings in order to provide practical value. Secondly, we aim at deepening the understanding of the kind of structures or assumptions that enable mask design algorithms to perform well in practice. It is not clear from the current literature what components are fundamental in a mask design algorithm. Should the mask design algorithm plan several steps of ahead? Should it adapt to each different patient? What kind of feedback should be available to the mask design algorithm in order to achieve the best performance? Amidst the race to develop ever more complex models, we aim at identifying the key components that drive the performance of modern mask design algorithms, in order to efficiently reach state-of-the-art performance. We will refer to this problem interchangeably as the problem of mask design, sampling optimization, mask optimization, sampling mask optimization, sampling pattern optimization, or optimizing the acquisition trajectory. 1 4 ( AAAB8HicbVDLSsNAFL2pr1pfUZduBovgqiQi6s5SNy4r2Ie0sUymk3ToZBJmJkIJ/Qo3LhRxq3/gZ+jKv3HSutDWAxcO59zLvff4CWdKO86XVVhYXFpeKa6W1tY3Nrfs7Z2milNJaIPEPJZtHyvKmaANzTSn7URSHPmctvzhRe637qhULBbXepRQL8KhYAEjWBvppltjYdjNUKlnl52KMwGaJ+4PKVf9z9r5m/1e79kf3X5M0ogKTThWquM6ifYyLDUjnI5L3VTRBJMhDmnHUIEjqrxscvAYHRilj4JYmhIaTdTfExmOlBpFvumMsB6oWS8X//M6qQ7OvIyJJNVUkOmiIOVIxyj/HvWZpETzkSGYSGZuRWSAJSbaZJSH4M6+PE+aRxX3pHJ8ZdK4hSmKsAf7cAgunEIVLqEODSAQwT08wpMlrQfr2XqZthasn5ld+APr9RtUg5M2 Chapter 2 MRI Background Chapter 4 Optimizing sampling Double shift: From model based to learning-based From model-driven to data-driven Chapter 3 Reconstruction From model-driven to data-driven approaches ( Introduction >tixetal/<2M5gUtR9rPA+dl5nsahtZqX2rfQrlMpw80TwQASDOEqLVIEnugAc7fAsKmSh4KdZ8JHp3XxRa+EP+6M4HSJZabSJASWRuZGSYGSkzTEpZWvH/jyxIVOIimOkUVNJJyIvO7Qq6M//X8SWo6BsMmuvFpBlOmxEfTdTaIhmYJ4jliRHYAvcsxrqjEIUHnmDhMJBRTV3L5InjUDLyYfi6MuqWhTTKgo0M5X3fk97e1/m5r9z9fVKP4+JaGwMK25lnlKUNjdYjtlppvBWjEAYhK8LQRpeXbBLUhq736eRhzvtcmPHSRU7nSTzNAamKvyHtZJPEPIaJNlim2Z7sfrN3Yt1W6aKepFXYhVVX68OKdWC4ffvLz95OcxAWDtuSH3vKj+Zg/3qxRhL3oQ/JIkJmJBZoT3kmyUs0eI2r4yNS5s6iQiqgvoBudZUfp1rp2LFANsSLDVbciH8BAAA>"=4l1qW+5LoPkLxyHQEevEO4SXigl"=46esab_1ahs tixetal< Chapter 6 Limitations of deep RL Deep RL does not bring a large added value over sLBCS Chapter 5 Scalable LBCS Scaling up mask optimization Chapter 7 GAN-based sampling Chapter 8 Conclusion and insights What are the components that enable to reach SotA efficiently? Posterior modeling for reconstruction, uncertainty quantification and adaptive sampling Figure 1.3: Outline of the different chapters in this thesis. Grey boxes refer to chapters dealing with background material. Red boxes denote chapters with methodological contributions, and blue boxes denote chapters that contribute to a better understanding of the problem of optimizing sampling. Arrows show particular, direct dependencies between chapters. Organization and contributions Figure 1.3 illustrates the organization of this thesis, and each chapter is briefly described below, along with their respective contributions. Chapter 2 – MRI Background In this chapter, we provide a brief introduction to the physics underlying image acquisition in MRI, focusing on Cartesian sampling. We introduce multi-coil acquisition as well as dynamic imaging. We then provide the mathematical model that will be used to describe the acquisition in accelerated MRI. Chapter 3 – Reconstruction methods In this chapter, we review the algorithms used to reconstruct images from undersampled measurements. We describe the shift that has taken place from model-driven approaches based on Compressed Sensing (Lustig et al., 2007) towards data-driven approaches that construct a model from training data. We then focus on the different trends that have taken place within data-driven approaches, focusing on trained reconstruction models and generative approaches. Finally, we conclude by highlighting the trade-offs between the speed of reconstruction and the dependence on large training datasets. 5 Introduction Chapter 4 – Optimizing sampling patterns In this last introductory chapter, we review the different approaches that have been taken to optimize sampling masks. We argue that, in a similar way that reconstruction methods moved from model-driven to data-driven approaches, a double shift has taken place in mask design. We show that mask design methods gradually moved from model-based, model-driven to learning-based, data-driven approaches. We discuss how this trend can be seen in the literature, and how it ties in with the empirical performance of these methods. Then, we proceed to formalizing the problem of learning-based, data-driven mask design and show how this problem changes when one aims at designing fixed policies – where a mask is trained and re-used as is at test time – or adaptive policies – that adapt to the current patient by integrating on-the-fly the information as it acquired. We conclude by proving the optimality of the discrete mask optimization problem, compared to the problem of finding the optimal sampling distribution. Related publication. Sanchez, T., Gözcü, B., van Heeswijk, R. B., Eftekhari, A., Ilıcak, E., Çukur, T., and Cevher, V. (2020a). Scalable learning-based sampling optimization for compressive dynamic MRI. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8584–8588. Chapter 5 – Scalable learning-based sampling optimization In this chapter, we extend the Learning-Based Compressive Sensing (LBCS) approach of Gözcü et al. (2018). We primarily aim at drastically improving their scalability of their method. We first discuss how their approach is rooted in submodular optimization (Krause and Golovin, 2014) and how algorithms proposed in this field could provide solutions to the limitation of LBCS. Their method scales as O(P 2 ), where P is the resolution of the image, such a complexity is prohibitive in high dimensional settings such as 3D MRI or dynamic MRI, and is not convenient when reconstruction becomes slow, such as in multi-coil MRI. To address this, we first propose lazy LBCS (lLBCS), for the multi-coil and 3D settings. This algorithm leverages lazy evaluations (Minoux, 1978), where a list of upper bounds is precomputed and then traversed sequentially, drastically reducing the amount of computation over LBCS. Secondly, in the context of dynamic (multi-coil) MRI and large datasets, we propose stochastic LBCS (sLBCS) and show that it can achieve a very significant computational gain compared to LBCS, while retaining its performance. Related publications. Gözcü, B., Sanchez, T., and Cevher, V. (2019). Rethinking sampling in parallel MRI: A data-driven approach. In 27th European Signal Processing Conference (EUSIPCO). 6 Introduction Sanchez, T., Gözcü, B., van Heeswijk, R. B., Eftekhari, A., Ilıcak, E., Çukur, T., and Cevher, V. (2020a). Scalable learning-based sampling optimization for compressive dynamic MRI. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8584–8588. Chapter 6 – On the benefits of deep RL in accelerated MRI sampling In this chapter, we assess the performance of state-of-the-art (SotA) deep reinforcement learning (RL) methods for mask designs. These methods improve in principle over LBCS by designing policies that adapt to the patient at hand and leverage long-horizon planning in order to make better decisions. We mainly aim at making sense of seemingly contradictory conclusions. The work of Pineda et al. (2020) seems to indicate that long term planning could be the most important component in deep RL, as their results show that a non-adaptive, long term planning policy model trained on the dataset can perform as well as an adaptive, long-term planning policy. On the contrary, the contribution of Bakker et al. (2020) highlights the importance of adaptivity, as a greedy policy, that does not do long-term planning is found to closely match policies that do long term planning. Our results, surprisingly show that sLBCS, a non-adaptive method that does not attempt long term planning, can perform as well as the state-of-the-art approaches of Pineda et al. (2020); Bakker et al. (2020). We further show that the current SotA RL methods only bring at best a limited added value, and that other factors such as the architecture of the reconstruction algorithm or the masks used to train it have a much larger impact on the overall performance. This work highlights the need for further discussions in the community about standardized metrics, strong baselines, and careful design of experimental pipelines to evaluate MRI sampling policies fairly. Related preprint. Sanchez, T.∗ , Krawczuk, I.∗ and Cevher, V. (2021). On the benefits of deep RL in accelerated MRI sampling. Available at https:// openreview.net/ pdf?id=fRb9LBWUo56 . ∗ denotes equal contribution. Chapter 7 – Uncertainty driven adaptive sampling via GANs In this chapter, we take a step back from the question of designing the best sampling policy, and primarily aim at exploring how conditional Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) can be used to model inverse problems in a Bayesian fashion. GANs have been used mainly as reconstruction models in the context of inverse problems (Yang et al., 2018; Chen et al., 2022), but a few works took an additional step to show the value of learning the full posterior distribution in order to have the ability to provide uncertainty quantification as well (Adler and Öktem, 2018a). 7 Introduction We argue that they do more and that conditional GANs for MRI reconstruction can naturally provide a criterion for adaptive sampling: by sequentially observing the location with the largest posterior variance in the measurement domain, the GAN provides an adaptive sampling policy without ever being trained for it. Such a criterion is not restricted to acquisition in Fourier domain, like in MRI, but can also be readily used for image domain sampling. In this context, we show that our GAN-based policy can strongly outperform the non-adaptive greedy policy from sLBCS. However, in Fourier domain, the GAN-based policy does not match the performance of sLBCS. We provide an explanation to this phenomenon rooted in the concept of the information horizon used to make a decision at each step. We show that our model uses less information than LBCS to design its policy, and argue that this is the reason leading to its inferior performance. However, when compared with models that use the same amount of information to inform their policy, our model largely outperforms the competition. Related preprint and workshop paper. Sanchez, T., Krawczuk I., Sun, Z. and Cevher V. (2019). Closed loop deep Bayesian inversion: Uncertainty driven acquisition for fast MRI. Preprint available at https: //openreview.net/pdf?id=BJlPOlBKDB. Sanchez, T., Krawczuk, I., Sun, Z. and Cevher V. (2020). Uncertainty-Driven Adaptive Sampling via GANs. In NeurIPS 2020 Workshop on Deep Learning and Inverse Problems. Available at https://openreview.net/pdf?id=lWLYCQmtvW. Chapter 8 – Conclusion and future works In this last chapter, we review the contributions of the different chapters and summarize the insights that our contributions bring to the design of masks for Cartesian MRI sampling. We then provide an outlook of the opportunities for future works. Remark (Bibliographic note). At the end of Chapters 4, 5, 6 and 7, we include a “bibliographic note” section, where we distinguish the contributions of the co-authors of the abovementioned publications. Additional publication not included in this thesis. Sun, Z., Latorre, F., Sanchez, T., and Cevher, V. (2021). A plug-and-play deep image prior. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8103-8107. 8 2 MRI Background Before diving in greater detail in the problems of reconstruction from undersampled measurements and optimization of the sampling masks, we first present the physical underpinnings of MRI. Although this survey will be very short, it hints at the remarkable depth and flexibility of MRI. This introduction is based upon the well-known textbook of Brown et al. (2014)1 . 2.1 MRI physics 2.1.1 Magnetization Let us then consider a proton under a strong external magnetic field B0 . In practice, such a field is obtained by running a large electric current through a coil around the scanner. High current is required because typical magnetic fields in MRI are very large, around 3 T, and the strength of the magnetic field is directly proportional to the one of the electric current2 . The spin of a proton, which is positively charged, can be represented as a gyroscope precessing about the field direction, as illustrated on Figure 2.1. The precession angular frequency can be quantified as the Larmor frequency ω0 , with ω0 = γB0 (2.1) where γ is known as the gyromagnetic ratio, and is a particle dependent constant. For instance, in water, the hydrogen proton has γ ≜ γ/(2π) ≈ 42.6 MHz T−1 . Note that deriving this formula is involved and requires tools from relativistic quantum mechanics. This description considers a single hydrogen nucleus in a static magnetic field, but in 1 2 Part of the content was presented in Sanchez (2018) For comparison, the strength of the magnetic field of the earth is typically between 40 and 70 µT. 9 CHAPTER 2. MRI BACKGROUND Figure 2.1: Left: Illustration of precession of a particle when subject to an external field B0 (large green arrow). The small red arrow corresponds to the spin of the particle, and its temporal evolution is illustrated by the black arrow. Right: Illustration of the transverse magnetization appearing when a body is subject to an RF pulse, due to the protons precessing synchronously, and reaching the higher energy level, which results in arrows pointing downwards. This Figure was first used in Sanchez (2018). practice, we are dealing with an entire population. Depending on their levels of energy, different protons of a population will either align in a parallel fashion against an external field B0 , while some will take an antiparallel position. Although there is only a very little amount of spins parallel to the magnetic field exceeding the number of spins antiparallel to it, this is sufficient to have a spin excess that yields a noticeable net magnetization M0 at a population level. M0 will be parallel to B0 because different protons will have incoherent phases, and will overall cancel each other out, leading to a net magnetization parallel to the static field. We call this resulting state the equilibrium of the system, which is illustrated in the middle of Figure 2.1, where in this case M0 = Mz . However, this net magnetization cannot be detected in the equilibrium, as we need to excite the protons in order to induce a change in the magnetization, which in turn will induce an electrical current in a receive coil. This coil will allow to record a signal corresponding to the response of the protons to the excitation. The different components of the MRI that will be discussed throughout this chapter are illustrated on Figure 2.2. 2.1.2 Excitation Let us denote by ẑ the direction of the external field B0 , and let us consider a body that we wish to image, for instance the knee of a patient. It is clear from the previous section that the hydrogen nuclei of the body will produce a net magnetization M0 = M0 ẑ, where M0 denotes the intensity of magnetization. The atoms will precess around ẑ at the Larmor frequency ω0 . In the excitation step, the system is destabilized by sending a radiofrequency (RF) pulse. It is easier to describe it and consider its effect by placing ourselves in a frame rotating at an angular frequency ω0 around the axis ẑ. This gives the following rotating unit 10 2.1. MRI PHYSICS Main magnet coils Receive coils RF transmit coils Gradient coils Patient table Main magnet coils Figure 2.2: Illustration of an MRI scanner and the different coils and components required to produce an image. vectors x̂′ , ŷ′ x̂′ = cos(ω0 t)x̂ − sin(ω0 t)ŷ ŷ′ = sin(ω0 t)x̂ + cos(ω0 t)ŷ and ẑ remains unchanged. We define the RF pulse as a circularly rotating pulse of intensity B1 , namely B1 = B1 x̂′ . Applying this RF pulse will as a result cause the nuclei to resonate and flip the equilibrium magnetization M0 into a new net magnetization M in the x̂′ − ŷ′ plane. M will now have both a transverse component in the x̂′ − ŷ′ plane, and a longitudinal component along the ẑ axis. We refer to the angle between M and the ẑ-axis as the flip angle (FA), and it quantifies how strongly the magnetization M has been moved away from its equilibrium state M0 . This process is illustrated in the right of Figure 2.1. In particular, we see that the transverse component arises due to the synchronization of the precessing phases of the hydrogen nuclei, which cancelled each other out at equilibrium. 2.1.3 Relaxation After applying the RF pulse, the system will gradually return to equilibrium, where M0 is aligned with B0 . This will require the nuclei to re-emit the energy that they absorbed during the excitation phase. The relaxation is governed by two main components, namely a longitudinal relaxation, characterized by a time constant T1 , and a transverse relaxation, characterized by T2 . T1 Relaxation. The T1 relaxation describes the regrowth of the longitudinal component of the magnetization, i.e. how the ẑ component of M grows back over the relaxation. This is also known as spin-lattice relaxation, as this describes a local process due to the 11 CHAPTER 2. MRI BACKGROUND interaction of the spins with their surroundings. The process is formally described as   Mz (t) = M0 1 − exp (−t/T1 ) + Mz (0) exp (−t/T1 ) (2.2) where Mz (t) is the longitudinal magnetization, Mz (0) describes the amount of longitudinal magnetization left after the RF pulse, and T1 is a constant that depends on the type of tissue considered. We see that the longitudinal magnetization will be recovered at an exponential speed, and the speed will be determined by T1 . T2 Relaxation. The T2 relaxation process describes how clusters of spins gradually lose phase with each other after excitation. This dephasing will cause the transverse magnetization Mx̂′ ,ŷ′ or M⊥ to progressively disappear, as the equilibrium state is reached (cf. middle of Figure 2.1). We have M⊥ (t) = M0 exp (−t/T2 ) (2.3) where T2 is a tissue-dependent time constant that describes how fast the transverse magnetization vanishes. In practice, T2 ≪ T1 , making the T2 relaxation much faster. Also, in practice, the signal is likely to suffer additional suppression due to dephasing from inhomogeneities in the external field B0 . As T1 and T2 are tissue dependent constants, constructing images by performing measurements at different steps of the relaxation allows to highlight different characteristics of tissues. They are one of the key components of the flexibility of MR imaging. 2.1.4 Image acquisition Until now, we have discussed the global response of a sample to an RF pulse, but we have not addressed how to encode spatial information into the signal, in order to be able to construct an image. As previously discussed, the signal will be recorded by a receive coil, where an electric current is induced depending on the relaxation of the spins. But the question of localizing the signal remains whole so far. To solve this issue, we will try to give to each location in the body a unique combination of phase and frequency, in order to be able to retrieve its contribution from the signal in the receive coil. This is achieved by introducing three additional coils to the scanner, known as gradient coils, whose role will be to produce some spatially varying magnetic fields in the x̂, ŷ and ẑ directions respectively. As a result, the tissues of specific areas of the object will be excited, and this will allow to observe their local responses. Then, by repeating a measurement with different gradients, we will be able to sequentially construct a map of how the tissue responds to the RF pulses depending on its location. This is why an acquisition is often referred to as a pulse sequence, as we gather a sequence of 12 2.1. MRI PHYSICS measurements in order to obtain a full image. There exists several ways to construct an image with MRI, and we will illustrate how it works by considering a 2D gradient echo sequence, and we will explain its different components. Sequences are frequently represented as diagrams, such as the one of Figure 2.3, where each line represents how different component vary in intensity through time. If we simply excite a body with an RF pulse, neglecting the relaxation effects, we can write the signal observed in the receive coil as S(t) = Z (2.4) ρ(x)ei(ω0 t+ϕ(x,t)) dx where ρ(x) is the effective spin density, and is the quantity that we wish to measure, and ϕ(x, t) is the accumulated phase ϕ(x, t) = − Z t 0 ω(x, t′ )dt′ . (2.5) We see that in Equation 2.4, without additional considerations, the signal that we observe is integrated over the whole volume being imaged. In Equation 2.5, the main reason that ω is now a function of space and time is that we will add spatially varying gradient fields that in turn make ω dependent on space and time. One can also see that Equation 2.4 shows that measurements will not be provided directly in image space, but rather in Fourier space, often referred to as k-space: MRI allows us to obtain frequency information about our object of interest, and then an image can be obtained by mapping back these measurements into image space. Slice Selecting (SS) Gradient. As we see on Figure 2.3, during the RF pulse, a constant z-gradient is applied during a time τRF with intensity GSS , and reversed afterwards. We will not detail here why this gradient is reversed, but this has the effect of making all components in the slice in phase, with a common accumulated phase ϕ = 0. This enables to select the slice of interest and make sure that the signal will be strong for the acquisition. As a result, for a slice at location z0 with thickness ∆z, the signal will be integrated out on the slice only, i.e. we will have the spin density S(t1 ) = Z Z "Z z0 + ∆z 2 z0 − ∆z 2 # ρ(x, y, z)dz dydx ≈ Z Z ρ(x, y, z0 )dydx. Phase Encoding (PE) Gradient. Let us now add the phase encoding y-gradient to the picture. Assume that it is applied during a time τPE at intensity Gmin ≤ Gy ≤ Gmax . This will result in gradients acquiring a y-dependent phase, and as a result S(t2 ) = Z Z −i2π γ Gy τPE y ρ(x, y, z0 )e  dy dx 13 CHAPTER 2. MRI BACKGROUND Figure 2.3: A 2D gradient echo sequence diagram. Each line describes how different components interact during the time of a single excitation. On the first line, we see the RF pulse enabled at t = 0. On the second line, we have the slice selecting gradient Gz,SS , on the third line we have the phase encoding gradient Gy,PE , on the fourth we have the readout gradient Gx,R and finally the analog-to-digital converter (ADC). 14 2.1. MRI PHYSICS Readout (R) Gradient. Finally, the x-gradient is first turned on negatively, dephasing the phases, followed by a rephasing stage where the observation through the ADC is carried out. It is enabled at the strength Gx and for a duration TS . Given the shifted variable t′ = t − TE , we have as a result ′ S(t ) = Z Z ρ(x, y, z0 )e −i2π γ Gy τPE y  dy e −i2π γ Gx t′ x dx, − TS /2 ≤ t′ ≤ TS /2. (2.6) Equation 2.6 is very interesting as one can rewrite the (x, y)-directions as spatial frequencies depending on t and τy , namely (kx , ky ) = ( γ Gx t′ , γ Gy τPE ). This in turns enables us to rewrite Equation 2.6 as S(kx , ky ) = Z Z ρ(x, y, z0 )e−i2π(kx x+ky y) dxdy (2.7) which turns out to be the Fourier transform of the spin density. More precisely, the signal obtained in the presence of a gradient echo structure in the readout direction after its encoding by a fixed Gy describes a line in the k-space of the Fourier transform of ρ(x). This implies that by acquiring several lines at different ky by varying Gy , we can effectively cover the k-space, as illustrated in Figure 2.4. As a result, when the k-space has been sufficiently densely covered, the spin-density can be recovered by taking the inverse Fourier transform of Equation 2.7, i.e. ρ(x, y, z0 ) = Z Z S(kx , ky )e+i2π(kx x+ky y) dkx dky (2.8) This approach is referred to as Cartesian sampling because it covers k-space in a grid-like fashion, as illustrated on Figure 2.4. Note that there exist different sequences that can perform Cartesian sampling, and even more sequences that are not Cartesian. We refer to such approaches as non-Cartesian sampling approaches, and they can include radial sampling (Lauterbur, 1973), spiral sampling (Meyer et al., 1992) or even some more flexible coverages of k-space (Lazarus et al., 2019). Cartesian MRI has the advantage over these methods of enabling the use of the fast Fourier transform to compute the inversion of Equation 2.8, which is not the case for non-Cartesian methods. These approaches require some more sophisticated inversion, involving gridding (O’sullivan, 1985) and non-uniform Fourier transforms (Fessler and Sutton, 2003). The differences are however deeper than merely the inversion technique, as radial trajectories are known to be more robust to motion artifacts, while spiral acquisitions enable fast imaging but are prone to other artifacts, and Lustig et al. (2008) discuss these issues in greater depth. 15 CHAPTER 2. MRI BACKGROUND Figure 2.4: Traversal of k-space for a gradient echo pulse sequence. The combined application of the phase encoding gradient Gy,PE and readout gradient Gx,R males the signal move in k-space to the desired location in a diagonal trajectory. The phase encoding gradient is then stopped, and the readout gradient is reversed, bringing the spins back through a complete set of kx value while keeping ky constant. The experiment is then repeated by returning to a different ky value until the whole of k-space has been covered in a grid-like fashion. 2.1.5 Parallel imaging The need to cover k-space through sequential readouts makes MR imaging a slow imaging technique, and the quality of the images is greatly dependent on the absence of motion during the acquisition. This also implies that for body MRI, the patient should not breathe during the whole acquisition process, as this motion would yield a reconstruction image with many undesired artifacts (Glockner et al., 2005). This has motivated further research into ways to accelerate the acquisition. A first step in this direction is the use of parallel imaging. This technique relies on a multicoil MRI acquisition, which was introduced at the beginning of the 1990s by Roemer et al. (1990). The idea was to use several receive coils positioned at different locations instead of a single large one. The coils record different signal intensities depending on their spatial location, which enables to achieve a larger signal to noise ratio (SNR) compared to a single coil acquisition. It was subsequently observed that instead of increasing the SNR compared to a single coil acquisition, exploiting the geometry of the coil array could help accelerate MRI by resolving the aliasing due to spacing out successive k-space phase encodes by a factor R known as the acceleration factor (Sodickson and Manning, 1997). This is what is known as parallel imaging. However, this acceleration introduces a tradeoff between acceleration and SNR as increasing R will reduce the SNR by a factor √ R. Reconstructing the image becomes also non-trivial, as the images from each coil 16 2.1. MRI PHYSICS Image from the second coil k-space sampling of the second coil R=4 R=3 R=1 Reconstruction with Reconstruction with sum of squares coil sensitivities Figure 2.5: Example of parallel imaging on a cardiac image, simulated with three coils, whose sensitivities are three superimposed rectangles, each covering one third of the image. First column: combination via the sum of squares (SoS) technique, where the images from each coil are simply summed weighted by their modulus. Second column: Reconstruction taking into account the sensitivity maps, allowing for precise reconstruction for an acceleration rate R up to 3. Third column: Image reconstructed for the second coil when taking into account its sensitivity. Fourth column: k-space of the second channel at different undersampling rates. have a decreased field of view (FoV), resulting in aliasing when extended to the full FoV, as illustrated on Figure 2.5. Various approaches have tackled this challenge, but they generally rely on some patientdependent calibration. For instance, SENSE3 (Pruessmann et al., 1999) considered coil sensitivity maps, which describe how each coil responds to a signal at each location of the FoV. They are typically estimated from data from a prescan (Blaimer et al., 2004). GRAPPA4 (Griswold et al., 2002a) uses a small part of the signal near the center of k-space, called autocalibration signal, in order to compute how the data from different channels (coils) should be combined. Several reviews (Blaimer et al., 2004; Glockner et al., 3 4 Sensitivity encoding Generalized Autocalibrating Partially Parallel Acquisitions 17 CHAPTER 2. MRI BACKGROUND 2005; Deshmane et al., 2012) examine the differences and similarities of these techniques in greater detail and detail various clinical applications. Calibrationless methods have also been explored in more recent years (Trzasko and Manduca, 2011; Majumdar and Ward, 2012; El Gueddari et al., 2019). An example of parallel imaging with coil sensitivities. We illustrate the role of sensitivity maps to provide additional information to resolve the aliasing due to the acceleration in Figure 2.5. This example assumes three coils with box-like sensitivities covering each a third of the image. We see that in the case R = 1, there is no aliasing, and reconstruction is simply achieved by an inverse Fourier transform. In the case R = 3. We see that adding the information of each coil allows to still resolve the aliasing, but this is not sufficient for R = 4, where the three coils do not provide enough information in order to resolve aliasing. We can see also on the last column how increasing R results in acquisition lines getting more spread out in Fourier space. Note that this is an idealized example, as in this case the coil-sensitivities do not overlap, which means that summing up individual images multiplied by their respective sensitivities suffices to reconstruct the image. The SENSE algorithm is more involved and can resolve images with overlapping coil sensitivities. 2.1.6 Dynamic MRI Before terminating this chapter, we will also cover the case of dynamic MRI, which will be relevant to Chapter 5 of this thesis. Dynamic MRI, as its name suggests aims at imaging dynamic, moving objects. It can be used for instance for cardiac imaging (Tsao et al., 2003) or vocal tract imaging (Echternach et al., 2010). The data are acquired in k-space at different times, resulting in k-t raw data, and the direct reconstruction is done with a frame by frame inverse Fourier transform. In this setting, imaging speed is most critical for several reasons: if one wants to image a fast-moving object precisely, then a high temporal resolution will be required, which presents a challenge for MRI (Tsao et al., 2003). Accelerating the acquisition in this case will not only improve the comfort of the patient, but might also provide higher temporal resolution. However, the motion between two frames should not be too large, and in practice, which would require breath-held examinations for Cartesian cardiac imaging (Gamper et al., 2008; Jung et al., 2009), which is often impractical for patients. In this case, techniques such as radial sampling can be greatly beneficial, as radial spokes continually sample the center of the k-space, and then motion can be retrospectively corrected in free-breathing samples (Winkelmann et al., 2007; Uecker et al., 2010; Feng et al., 2014, 2016b). If one wishes to perform Cartesian sampling in this case, higher acceleration rates would 18 2.2. MATHEMATICAL DESCRIPTION be required, and parallel imaging might not be sufficient to resolve aliasing. As a result, one might use a priori structure in the images to constrain the reconstruction of the image and obtain a good quality outcome. This is known as Compressed Sensing (Candès et al., 2006; Donoho, 2006), and will be discussed in Chapter 3. 2.2 Mathematical description We close this chapter by introducing the mathematical framework that will be used in the sequel. So far, we have described our problem in a continuous fashion, where the signal is a function of a continuous time t or k-space location (kx , ky ). However, in a Cartesian acquisition, the data can be represented on a finite grid as a discrete set of values, depending on the total number of readouts. Then, within each readout, samples will be recorded at a time interval ∆t, and so the resolution of k-space can be described as ∆kx = γ Gx ∆t ∆ky = γ ∆GPE τPE Note that in order to guarantee an image without aliasing, ∆t and ∆GPE must be sufficiently small, and their precise value is dictated by the famous Nyquist-Shannon sampling theorem. Assuming a total of Ny phase encoding lines with each continuing Nx samples, we can represent the signal S(kx , ky ) as a matrix S ∈ CNx ×Ny , which can be written in a vectorized form s ∈ CP , where P = Nx · Ny . In the Cartesian case, the spin density ρ(x, y) will be obtained by computing an inverse Fourier transform on s. It is common for the resulting image to be described as ρ ∈ CP , with the same dimensions as s. The inversion operation is performed by the inverse Fourier transform F −1 , applied to s, namely ρ = F −1 (s). (2.9) This operation is often represented as matrix vector multiplication, where the operator F −1 is explicited as the FFT matrix F −1 , and in addition, one can model the imperfections in the measurement as additive Gaussian noise ϵ ∈ CP , which yields ρ = F −1 s + ϵ. (2.10) Note that in practice, Equation 2.10 is not evaluated explicitly, but rather the inverse Fourier transform is directly computed using the Fast Fourier Transform (FFT) algorithm. 19 CHAPTER 2. MRI BACKGROUND 2.2.1 Accelerated MRI In the case of accelerated MRI, the full k-space is not observed, and in practice, entire phase encoding lines are skipped at once, resulting in an acceleration of the scanning time proportional to the number of lines removed. We represent this operation as masking our unobserved coefficients, which is achieved by defining a diagonal masking matrix Pω that masks out coefficients indexed by the set ω, namely Pω,ii = 1 if i ∈ ω and 0 otherwise. ω is defined as the sampling mask and has cardinality |ω| = N . ω is typically structured so that entire lines are skipped at once, and a formal presentation of this structure will be discussed in Chapter 5. We refer to the ratio N/P as the sampling rate. This masking results in observing the partial image ρ = F −1 Pω s + ϵ (2.11) Remark (Sampling mask and trajectories). There is a difference between a sampling mask and a sampling trajectory. The former only provides a fixed view of all locations acquired, while the latter comprises a dynamic component. The sampling trajectory tells not only what locations were acquired, but also the temporal ordering of the acquisition, detailing the order in which the samples are acquired throughout the multiple acquisition steps. In the sequel, we will consider problems where we aim at learning how to reconstruct data from partial Fourier measurements, and how to optimize sampling masks to obtain the best reconstruction quality. In order to be aligned with the commonly used terminology, we will need to redefine some of the notation used until here. We assume that there exists some underlying ground truth image x ∈ CP from which we acquire partial Fourier measurements (or observations) yω that can be noisy, i.e. yω = Aω x + ϵ = Pω F x + ϵ. (2.12) This will be our basic acquisition model for the whole of this thesis. Aω is defined as a shorthand for Pω F . Note that Equations 2.11 and 2.12 can be related by setting s = F x and ρ = F −1 yω . This reformulation allows to describe the image x as the fundamental unknown, and the partial measurements to be described in the Fourier domain. Remark. In the case of parallel or multicoil acquisition, Equation 2.12 is slightly adapted to account for acquisition on j = 1, . . . , C coils with different spatial sensitivities yω,j = Aω Sj x + ϵj = Pω F Sj x + ϵj , (2.13) where the coil sensitivities can be represented as a P × P diagonal matrix where Sj,ii describes the spatial sensitivity for the i-th pixel. We will also sometimes use the 20 2.2. MATHEMATICAL DESCRIPTION compact notation yω = Aω Sx + ϵ (2.14) for multicoil MRI, and in this case, yω ∈ CCP will represent the stacked observation, Aω will be the coil-wise Fourier transform, S ∈ RCP ×P will be the stacked sensitivities. Recovering the ground truth image x from partial measurement yω is challenging, as the problem of recovering an image from partial measurements is fundamentally ill-posed: there is mathematically infinitely many solutions that are consistent with yω , but they are not physically meaningful. Such problems are often referred to as ill-posed inverse problems, where we aim at recovering a reference image from partial information that entail a loss of information. In this context, Equation 2.12 is referred to as a forward model. In the next chapter, we will discuss the approaches that have been studied in the literature to construct an estimate x̂ of x from partial measurements y. 21 3 Reconstruction methods In this chapter, we consider the solutions that have been explored to the problem of accelerated MRI in the literature. We have defined this problem formally in Section 2.2.1: we consider partial measurements yω obtained through the forward model of Equation 2.12, and aim at constructing an estimate x̂ of the underlying, unknown ground truth x. The difficulty of this problem resides in its ill-posedness: there exist infinitely many solutions consistent with the observations that are not physically meaningful. To counter it, it is necessary to impart additional information to the problem in order to reach a sensible solution. This was the approach taken in Compressed Sensing (CS) applied to MRI (Lustig et al., 2008). The ill-posedness of the problem is alleviated by imposing additional structure on it. All subsequent approaches rely on imposing some form of structure to estimate x from y, but have relied on increasingly complex, data-driven approaches. We will trace back the evolution from CS approaches, that used simple mathematical models to obtain structure, to recent deep learning-based approaches that construct their models from training data: we moved from model-driven approaches to datadriven approaches. A more exhaustive presentation of these approaches can be found in Ravishankar et al. (2019); Doneva (2020). 3.1 Model-driven approaches 3.1.1 Compressed Sensing It has been long known that a band-limited signal can be exactly reconstructed from a set of uniformly sampled frequencies, provided that their density if at least twice the highest frequency of the original signal. This result is famously known as the Nyquist-Shannon sampling theorem. However, sampling at the Nyquist rate can remain expensive, and there has been extensive research to achieve sub-Nyquist rate sampling and reconstruction. 23 CHAPTER 3. RECONSTRUCTION METHODS In their seminal works, Donoho (2006); Candès et al. (2006) introduced the formalism of Compressed Sensing (CS), where they proved that by leveraging the structure present in natural images, such as sparsity or low-rankness, one can perfectly reconstruct a signal sampled much below the Nyquist rate. Achieving this required using a non-linear reconstruction method, framed as the following prototypical problem 1 x̂ = arg min ∥Aω x − yω ∥22 + λR(x) 2 x (3.1) where the first term, known as data-fidelity, enforces consistency with the measurement, and the second term, called regularization, defines a statistical model, capturing the desired structure of the signal. In the literature, R(x) has taken various forms, but typical models include sparsity in the wavelet domain, R(x) = ∥W x∥1 (Candès et al., 2006; Donoho, 2006; Lustig et al., 2007), or sparsity with Total Variation (TV) kind of constraints (Block et al., 2007; Knoll et al., 2011a). An alternative approach would rather impose a low-rank structure on the resulting image, typically using R(x) = ∥x∥∗ (Lingala et al., 2011), or more sophisticated structures such as the Hankel matrix (Jin et al., 2016). The common trend in all of these methods is the idea that the structure of the signal implies a redundancy that can be exploited to represent the signal in a compact fashion when transforming it appropriately. As expressing the structure in a mathematical fashion is not a straightforward task, these methods are the expression of various attempts to capture this idea. Note that all these approaches typically introduce a trade-off between acquisition speed (by lowering the number of k-space lines acquired) and reconstruction speed. Solving Equation 3.1 requires using an iterative method, and while the computation is carried out offline, the reconstruction can last up to hours for specific settings with large accelerations (Feng et al., 2016a). We note also for completeness that CS relies not only on leveraging additional structure in the image to be reconstructed, but also on incoherent sampling (Lustig et al., 2008). In practice, this often involves random-like sampling strategies, which enable the aliasing to be incoherent, i.e. to feature a noise-like structure, as illustrated on Figure 3.1. We will defer a more complete discussion of sampling to the next chapter. Overall, it is clear that such approaches rely on statistical models, constructed to implement the designer’s assumptions, rather than being learned from data. While such methods enabled great progress in the last decade, the statistical models restrict the expressivity of the approach: they express abstract mathematical structures encoding our assumptions rather than focusing on the real data distribution expressed through the training set. This is why we will refer such approaches as model-driven. 24 3.2. DATA-DRIVEN APPROACHES (a) (b) (c) (d) (e) (f) Figure 3.1: Random (a) and regular (d) sampling masks, acquiring 25% of the total locations, and their corresponding point spread functions (PSF) in (b) and (e). The convolution of the PSF with the Shepp–Logan phantom is shown respectively in figures (c) and (f). We see that the random undersampling produces incoherent or noise-like aliasing, which still allows us to distinguish the original image. On the other hand, the aliasing produced by regular, grid-based sampling (known as fold-over artifacts) yields a low-intensity result where the original image cannot be distinguished. 3.2 Data-driven approaches Many approaches tried to learn their model directly from data, and in this Section, we survey a couple of the most significant representatives of data-driven methods. 3.2.1 Dictionary learning Dictionary learning approaches rely on the assumption that an image can be represented as a sparse combination of atoms from a dictionary, i.e. x = Dz where D is a dictionary and z the sparse coefficient vector. This is sometimes referred to as a synthesis formulation, as opposed to 3.1, which shows an analysis formulation. Synthesis models are often applied patch-wise to images, and they aim at simultaneously learning a reconstruction x, a dictionary D and an encoding Z (Ravishankar and Bresler, 2011b; Caballero et al., 2014) N X 1 2 min ∥Aω x − yω ∥2 + β ∥Pj x − Dzj ∥22 x,D,Z 2 (3.2) j=1 s.t. ∥zj ∥0 ≤ s, ∥di ∥2 = 1, ∀i, j 25 CHAPTER 3. RECONSTRUCTION METHODS 3.2.2 Unrolled neural networks Unrolled neural networks are a data-driven, deep learning-based approach inspired by the iterative algorithms used to solve Equation 3.1. This problem is known as a composite optimization problem, and can be solved by various approaches. A famous approach is known as the Iterative Soft-Thresholding Algorithm (ISTA) (Daubechies et al., 2004), or Proximal Gradient method (Combettes and Pesquet, 2011), solves the problem by iteratively computing   xt+1 = proxλR xt − αA† (Axt − yω ) . (3.3) The proximal operator is defined as proxg (x) ≜ arg minv g(v) + 12 ∥v − x∥22 , and for some functions, can have a closed-form solution. This is typically the case when R(x) = ∥W x∥1 , where the proximal takes the form of a soft-thresholding operator, i.e. proxλ∥W ·∥1 (x) = sign(W x) ⊙ max(W x − λ, 0). Motivated by the formulation of Equation 3.3, unrolled neural networks parameterize the proximal operator with a neural network f (yω , ω; θ), and then train a model of the form  xt+1 = f xt − αA† (Aω xt − yω ); θ(t)  (3.4) for a fixed, small number of iterations t = 1, . . . , T . The output of the model is thus f (yω , ω; θ) = xT where θ = {θ1 , . . . , θT }. As the proximal term capture the statistical properties of the signal that wish to recover, parametrizing it with a neural network enables to learn a model directly from data. The model being more flexible, it allows for improved reconstruction performance over methods using statistical models. A flurry of approaches has been motivated by unrolling iterative algorithms, and we highlight a few examples hereafter. A more complete discussion of unrolled neural networks can be found in Liang et al. (2020). In Mardani et al. (2018), the parameters of the proximal mapping can be shared across iterations, or different at each iteration (Schlemper et al., 2017). The unrolling can leverage the structure of a different optimization algorithm, like the Alternating Direction Method of Multipliers (ADMM) in Sun et al. (2016), or the Primal-Dual Hybrid Gradient in Adler and Öktem (2018b). In the noiseless case, it is also common to replace the inner step xt − αA† (Aω xt − yω ) with a step called Data Consistency (DC) (Schlemper et al., 2017; Zhang et al., 2019b), that replaces the reconstruction with the exact values at observed locations. Essentially, the inner step becomes DC(xt , yω ) = F † (PωC F xt + yω ) = F † (PωC F xt + Pω F x). (3.5) We see that at the observed locations ω, we simply keep the observation, while at the non-observed locations ω C . Note that here, Pωc = I − Pω . 26 3.2. DATA-DRIVEN APPROACHES These unrolled approaches require training the models fθ on a dataset of reference images and observations {xi , yi }m i=1 by minimizing a loss ℓ(·, ·): θ∗ = arg min θ m 1 X ℓ(xi , f (yi ; θ)). m i=1 (3.6) Various losses have been used in the literature. Initial works used ℓ2 loss (Schlemper et al., 2017; Hammernik et al., 2018), but lately practitioners have turned to using ℓ1 loss, SSIM or compounds of these Muckley et al. (2021). ℓ1 was indeed found to outperform ℓ2 on vision tasks Zhao et al. (2016). The structural similarity (SSIM) (Wang et al., 2004) is a metric developed to match more closely the human perception, and recent work have proposed that a weighted sum between ℓ1 loss and the multiscale SSIM (MS-SSIM) as a compromise (Zhao et al., 2016). Deep learning-based reconstruction methods further shift the burden of computation, compared to compressed sensing approaches. While these enabled faster acquisition at the cost of a slower, iterative reconstruction procedures, deep learning-based methods retain the benefit of enabling sub-Nyquist sampling, but also enable fast reconstruction, in the order of 10−1 second for a slice (Jin et al., 2017; Hammernik et al., 2018). However, the computational burden is only shifted an additional step: these methods need to be trained with large, fully-sampled1 datasets (Zbontar et al., 2019). 3.2.3 Deep image prior In a spirit closer to dictionary learning and classical CS, Ulyanov et al. (2018) have shown that untrained CNN can provide a sufficiently strong prior to enable image reconstruction, even without training data. Given an untrained CNN fθ a single measurement instance yω , the idea in deep image prior (DIP) is simply to minimize min ∥yω − Aω fθ (z)∥22 θ (3.7) where z is a fixed, randomly sampled vector, and where the parameters θ are randomly initialized. Then, using an iterative optimization algorithm, fitting the weights to the observations enables the network to output a high-quality reconstructed image without supervision or training data. This method was successfully applied to MRI in (Darestani and Heckel, 2021; Yoo et al., 2021). A common pitfall of DIP is that this flexibility comes at the price of an increase inference time: the network weights must be retrained for each individual image. In addition, DIP can suffer from overfitting and so heuristics such as early stopping are used to counteract this (Ulyanov et al., 2018; Sun et al., 2021). In their recent work, Yaman et al. (2020) proposed approaches to train only using undersampled data, but having access to a dataset of undersampled images is still required. 1 27 CHAPTER 3. RECONSTRUCTION METHODS 3.2.4 A Bayesian interpretation of reconstruction So far, the discussion has been framed as moving from mathematically-defined to datadriven regularization in order to improve the recovery of Problem 3.1. However, this approach can also be framed from a Bayesian perspective, where the data consistency term 12 ∥Aω x − yω ∥22 is viewed as a likelihood and the regularization λR(x) as a prior. This was initially discussed by Ji et al. (2008) for compressed sensing. The idea revolves around Bayes rule, where the posterior p(x|y) can be expressed as likelihood prior z }| { z }| { p(y|x) p(x) p(x|y) = ∝ p(y|x)p(x). p(y) (3.8) | {z } marginal Under Gaussian noise, we can derive an explicit likelihood model for Equation 2.12, where, assuming ϵ ∼ N (0, σ 2 I), i.e assuming additive Gaussian noise, we get p(y|x) =  1 1 exp − 2σ2 ∥y − Aω x∥22 ∼ N (y; Pω F x, σ 2 I). The regularization term can then (2πσ 2 )P/2 be viewed as a prior from which the ground truth x is sampled, in the case of classical CS,  P the prior can we specified as the Laplace density function p(x) = λ2 exp (−λ∥W x∥1 ) (Ji et al., 2008). Such an approach enables to view Problem 3.1 as a maximum a posteriori (MAP) estimation, where one aims to find the reconstruction x̂ that maximizes posterior probability x̂ = arg max p(x|y) = arg max x x p(y|x)p(x) p(y) by definition = arg max p(y|x)p(x) as the optimization is indep. from y = arg min − log (p(y|x)p(x)) by taking the log and negative of the problem x x = arg min x 1 ∥y − Aω x∥22 + λ∥W x∥1 2σ 2 explicit the functions and discard the constants The Bayesian perspective gives a different interpretation of reconstruction and of the role of regularization as a prior, and lends itself naturally to an extension to data-driven models, which act as learned prior models. The Bayesian framing of Equation 3.1 has also the advantage of naturally defining a criterion to acquire observations that have large uncertainty according to the model (Ji et al., 2008). We will discuss these implications in greater depth in the next Chapter as well as in Chapter 7. For now, we turn to our last category of deep learning-based approaches which make use of this Bayesian insight and attempt to explicitly learn meaningful, data-driven priors. 28 3.3. TRADE-OFFS 3.2.5 Generative approaches The role of the reconstruction methods can be seen as learning a data-driven prior which has proven to enable higher quality of reconstruction than mathematically motivated models. While reconstruction methods embed the full iterative resolution of Problem 3.1 into a deep architecture, other approaches, relying on deep generative models, propose to explicitly parametrize the prior with a neural network (Bora et al., 2017; Jalal et al., 2021; Patel and Oberai, 2021). This has the advantage of keeping the optimization procedure more transparent, as it will be about explicitly finding a sample from a deep generative prior that is consistent with the observation. The approach, proposed initially by Bora et al. (2017) relies on training a generative adversarial network (GAN) to capture the prior distribution. We first briefly introduce GANs. Generative adversarial networks. GANs (Goodfellow et al., 2014) are a specific type of deep learning models where a generator gθ (z) and a discriminator dϕ (x) are trained in a competitive fashion: starting from random noise z ∼ p(z), the generator tries to construct samples that imitate a distribution of references images p(x) from which we are given a large set of samples {xi }m i=1 . The discriminator, on the other hand, tries to classify a sample based on how likely it is to originate from the reference distribution. The problem can formally be described as a game between the generator and the discriminator, where we solve min max Ep(x) [dϕ (x)] − Ep(z) [dϕ (gθ (z))] . θ ϕ (3.9) Generative priors. Given a trained generator gθ∗ (z) approximating the image distribution of interest p(x), the approach of Bora et al. (2017), applied also to MRI in Patel and Oberai (2021), aims at finding a sample from the generator consistent with the observations by solving min ∥Aω gθ∗ (z) − yω ∥2 (3.10) z While this approach does not enable to directly construct a maximum a posteriori (MAP) estimate, due to the GAN only yielding samples and no probabilities, the lowdimensionality of the latent space z enables to efficiently carry out Markov Chain Monte Carlo (MCMC) computations (Patel and Oberai, 2021) to construct posterior samples. 3.3 Trade-offs While all the methods presented in this chapter enable reconstruction from sub-Nyquist sampling rates, they feature different advantages and drawbacks. The state-of-the-art in terms of reconstruction performance is occupied by deep learning-based reconstruction methods, such as unrolled algorithms. They allow for the best quality and fastest inference among all methods presented here, but require large training datasets and are not robust 29 CHAPTER 3. RECONSTRUCTION METHODS to distribution shifts (Jalal et al., 2021; Darestani et al., 2021). Generative priors also require large amounts of data to train the generative model, and generally perform less well than trained reconstruction networks, while also having a slower inference time (Darestani et al., 2021). They have the advantage however of being robust to some distribution shifts like change in the distribution of masks ω and their inference is more interpretable than deep learning methods, as the conditioning on the observations is explicitly computed from the prior, rather than through a black box like in an end-to-end reconstruction algorithm. Compressed sensing (CS), dictionary learning (DIL) and deep image prior (DIP) are generally slower at inference, as they all rely on iterative reconstruction, and require the optimization of increasingly large amounts of variable: from the image in CS to the image and dictionary in DIL to the parameters of a deep network in DIP. However, DIP is found to generally outperform CS-based approaches. The trade-offs between DIL, DIP and Generative priors is however less clear, as these methods have not been extensively compared in the literature. Overall, this survey highlights a couple of the general trends observed in the literature and is not exhaustive. It shows however a clear trend where different methods try to exploit the redundancy due to the structure in the data. Earlier works attempted to explicitly embed this structure by compactly representing the signal in a variety of different bases, such as Wavelet domain (Lustig et al., 2007), or through representing the signal as low-rank (Lingala et al., 2011). Later, approaches aimed rather at learning how to leverage this structure from data, and use the powerful tools of deep learning to represent the signal in complex ways, following a data-driven paradigm. All these methods rely however on the idea that the structure of the signal allows to alleviate the ill-posedness of the inverse problem, and retrieve a signal from undersampled observations, because of its underlying structure. In the next chapter, we will detail the evolution in mask designs ω used to generate the observation and how they have been optimized in order to provide the most informative observations to enable the best reconstruction downstream. 30 4 Optimizing sampling patterns In the previous chapter, we discussed the evolution of reconstruction methods for accelerated MRI, where a shift from model-driven approaches to data-driven approaches was observed. A similar shift can be observed in the design of sampling patterns. 4.1 A taxonomy of mask design methods While the evolution of reconstruction methods can generally be described along the axis of moving from model-based approaches towards learning-based ones, the picture is less clear for the mask design methods. It is nonetheless to consider two axes of discussion that run in parallel to the evolution witnessed in reconstruction methods. These two axes pertain to the two main components needed to optimize sampling patterns (or masks or trajectories). One needs i) a prior distribution from which candidates patterns are drawn, and ii) a criterion, a metric to quantify its performance. We will see that at the onset of compressed sensing, sampling patters were drawn from heuristically designed parametric distributions, a setting widely known as variable density sampling, and gradually moved towards distributions tailored for a specific dataset. Similarly, the criteria used to establish the quality of a mask initially were mathematical quantities such as coherence, but the criteria became increasingly related to the reconstruction performance enabled by subsampling data using a given mask. One can clearly see a shift from i) model-based approaches towards learning-based ones, and from ii) model-driven criteria towards data-driven ones. Remark. We chose to denote the distribution from which candidate sampling patterns are drawn as the base for the sampling optimization procedure, while the criterion used as what drives it. In the rest of this chapter, we will survey the literature from the first application of compressed sensing to MRI onwards to show how mask design evolved from model31 CHAPTER 4. OPTIMIZING SAMPLING PATTERNS based, model-driven approaches towards learning-based, data-driven approaches. We will conclude by formalizing the problems linked with optimizing mask design in this context. 4.2 Compressed sensing and MRI: Model-based, modeldriven approaches 4.2.1 The contribution of Lustig et al. (2007) With the onset of Compressed Sensing (CS), random sampling patterns were used, as they allowed to obtain noise-like, incoherent aliasing, as illustrated on Figure 3.1. This was indeed theoretically motivated, as random sampling easily enables to achieve incoherent sampling, which in turn meant that a small number of measurements were sufficient to exactly solve the problem 3.1 for R(x) = ∥W x∥1 with high probability. Definition. Let (A, W ) be two orthonormal bases of Cp , where A is the observation domain, and W is the sparsifying (representation) basis. Then, the coherence between these bases is defined as (Candès and Wakin, 2008) µ(A, W ) = √ p · max |⟨ai , wj ⟩|. 1≤i,j≤p (4.1) The coherence measures the maximal correlation between any element of A and √ W . µ(A, W ) ∈ [1, p], where coherence 1 implies maximal incoherence between the √ bases, and p maximal coherence. Coherence then allows to bound the minimal number of measurements N required for exactly solving 3.1 with high probability. Indeed, for a given signal x ∈ Cp observed in domain A, assuming that x is S-sparse in the transformed domain W , then a typical compressive sensing result (Candes and Romberg, 2007) dictates that, given N random measurements, exact reconstruction will occur with high probability when N ≥ C · S · µ2 (A, W ) · log P (4.2) for some positive constant C. While the theory of CS obtains results for random measurements, practitioners have quickly sought to optimize the measurements, and a natural initial criterion has been to minimize the coherence between measurements, i.e. to select a sampling pattern ω with |ω| = N that enables to minimize µ(Aω , W ) (Lustig et al., 2007). However, when applying compressed sensing to MRI, it was quickly observed that purely random undersampling of the Fourier space is not practical, for several reasons (Lustig et al., 2007, 2008): i) it does not satisfy hardware and physical constraints that require smooth trajectories in Fourier space and the use of trajectories robust to system imperfections, 32 4.2. COMPRESSED SENSING AND MRI: MODEL-BASED, MODEL-DRIVEN APPROACHES ii) it does not consider the energy distribution of the signal in k-space, which is very concentrated around the center For this reason, the seminal work of Lustig et al. (2007) proposed to use a variable-density sampling approach, where the k-space is less undersampled near the origin (low frequency, high energy region) and more towards the periphery (high frequency but low energy)1 . They used a distribution centered at the origin of k-space, where the probability of acquiring a location decayed at a polynomial rate as a distance of the origin, namely P (kx , ky ) = 2 q 1− √ kx + ky p !d (4.3) with decay d > 1, and −p/2 ≤ kx , ky ≤ p/2. This heuristic was proposed to imitate the decay in the spectrum observed on real data. The mask to be used is then estimated by a Monte-Carlo procedure: a set of masks are randomly sampled from Equation 4.3, and the mask with minimal coherence is retained. The idea proposed by Lustig et al. (2007) of using variable-density sampling for MRI was not totally novel, as variable-density sampling had previously been used in the MRI community to design undersampling trajectories with incoherent aliasing when using linear reconstructions (Marseille et al., 1996; Tsai and Nishimura, 2000). However, Lustig et al. (2007) were the first to integrate this heuristic within the framework of CS. It should be clear from the presentation that the approach of Lustig et al. (2007) is a model-based approach, as the masks are sampled from a heuristic probability distribution (eq. 4.3) and also a model-driven approach, as the criterion used to select the masks is coherence, a mathematical structure prescribed by the theory of compressed sensing. 4.2.2 Model-based, model-driven sampling in the sequel A similarly model-based, model-driven was followed by Gamper et al. (2008), which considers dynamic MRI, although from a different initial mask: they attempted at randomizing a deterministic mask design proposed by Tsao et al. (2003) to achieve incoherent sampling. The prior in this case is given by a structured mask. Some refinements to variable-density sampling (VDS) were developed for instance in (Wang et al., 2009), but in the sequel, two broad approaches were followed to improve the design of sampling patterns. The first approach leveraged learning-based approaches, and will be discussed in the next section, while the second attempted at developing theoretical models that more closely captured the structure of real-world signals. The name variable-density sampling originates from the very fact that k-space is not uniformly subsampled, but rather with a variable density (Tsai and Nishimura, 2000). 1 33 CHAPTER 4. OPTIMIZING SAMPLING PATTERNS The work of Wang and Arce (2009) is a first step in this direction, where improved VDS functions are proposed by exploiting a priori statistical information of real-world images in the wavelets domain. One can notice that this approach chose to tie the sampling pattern specifically to a type of regularization, namely sparsity in the wavelet domain, and obtains better sampling density for this problem but lose in generality over competing approaches. The work of Puy et al. (2011) proposes a coherence-driven optimization of the variable density sampling procedure. Instead of relying on a heuristic distribution such as equation 4.3, the procedure proposes an objective that yields a VDS distribution that minimizes the coherence between the measurement and representation bases. A large body of subsequent works (Krahmer and Ward, 2013; Roman et al., 2014; Adcock et al., 2015, 2017) rely on some additional theoretical tools such as local coherence, asymptotic sparsity, and multilevel sampling to give a solid ground to heuristics such as VDS that show good performance but lack theoretical justification. 4.3 Towards learning-based, data-driven approaches In parallel to these works, there was a motivation to move towards learning-based approaches, due to some weaknesses in the VDS approach of Lustig et al. (2007). One of them is that the VDS distribution of Equation 4.3 is a heuristic that requires tuning to work well on different types of data, and that heuristically tuned variable density distributions generally performed better than the ones solely prescribed by CS theory (Chauffert et al., 2013). Learning-based, model-driven sampling. A first work of Knoll et al. (2011b) aimed at solving this issue by constructing the VDS distribution from data rather than from a tuned heuristic. They simply average the spectra of several similar images, normalized it and sampled their mask from the resulting distribution, minimizing coherence to select the final mask, as in Lustig et al. (2007). This approach was then improved by Zhang et al. (2014); Vellagoundar and Machireddy (2015). Early learning-based, data-driven approaches. Ravishankar and Bresler (2011a); Liu et al. (2012) proposed similar methods where they aimed at designing a sampling pattern that minimized the reconstruction error in Fourier space, rather than coherence. Both algorithms partition k-space in blocks, and iteratively design a mask by removing samples from low-error regions in order to re-assign them in high-error regions. Liu et al. (2012) made the important point that the design of a mask should reconstruction-based and not observation-based. Until then, approaches did not include the reconstruction algorithm into the design of the mask, but relied on model-based criteria relying on the observations, such as coherence. This insight proved to be impactful in the performance 34 4.4. A MATTER OF PERFORMANCE: HOW DO THESE METHODS COMPARE TO EACH OTHER? of the methods, and while no quantitative comparison was available in the paper, further works confirmed the impact of including the reconstruction algorithm in the mask design (Zijlstra et al., 2016). While these methods capture the paradigm that will be overwhelmingly dominant in the deep learning era, there was little follow-up work until much later, around 2018. The case of Seeger et al. (2010). The work of Seeger et al. (2010) lies in its own category. This approach builds on a sequential Bayesian experiment design perspective for sparse linear models (Seeger et al., 2007; Seeger and Nickisch, 2008). First a model of the posterior distribution p(x|yω ) is built, and then, a sampling pattern is iteratively built by successively acquiring the readouts that have the highest entropy. This approach is able to sequentially tailor the sampling pattern to the need of an unseen image, given the previously acquired measurements. However, the approach features a quite high computational cost for a single image. Liu et al. (2012) raise that is not practical, as the matrix inversion required to compute the next measurement that must be acquired is too slow to match the readout time of a scanner. While methods to optimize sequentially the sampling mask are common in the deep learning era, the approach of Seeger et al. (2010) is unique in its time, as all other works focused on optimizing the sampling pattern for a fixed sampling rate, and not in a sequential fashion. While more principled than other approaches, and using maximum entropy as the criterion to optimize the acquisition trajectory, this method remains model-based (using a Laplace prior) and model-driven (maximum entropy). 4.4 A matter of performance: how do these methods compare to each other? Very few of the works discussed until then feature an extensive comparison of performance of their proposed approaches. Most works rely on visual comparisons, or state the weaknesses of previous methods, but do not compare against them. In this regard, the contributions that bring a comparison of various approaches to mask design are particularly valuable, and in this section, we will discuss two of them, namely Chauffert et al. (2013) and Zijlstra et al. (2016). The work of Chauffert et al. (2013) aims at designing a better VDS density, while also providing a comparison between theoretically informed sampling patterns, and the ones obtained using the polynomial of Equation 4.3 tuned at various degrees d ∈ {1, 2, . . . , 6}, and a hybrid distribution mixing some theoretically prescribed components and heuristics. It is interesting to observe that in their data, the tuned hybrid distribution performs best, followed by the tuned polynomial and the optimal theoretically predicted distribution. This theoretically optimal distribution is based on the arguments from Rauhut (2010) uses 35 CHAPTER 4. OPTIMIZING SAMPLING PATTERNS rejection of already acquired locations (Puy et al., 2011) and minimizes coherence, while the polynomial was tuned on the data using directly the final evaluation metric, namely PSNR. The authors conclude that the theory would need to capture additional structure of real-world images for sampling patterns obtaining better practical performance. The article of Zijlstra et al. (2016) compares the performance of coherence-based VDS using Equation 4.3, of Knoll et al. (2011b), of Liu et al. (2012) and of VDS optimized to minimize normalized root mean squared error (NRMSE). The results show that coherence-based VDS is generally outperformed by Knoll et al. (2011b), which is matched by the optimized VDS. Both are outperformed by the method of Liu et al. (2012). The conclusion is however careful: When putting the presented results in perspective, it is important, to remember that data-driven optimization of an undersampling pattern for CS is just one aspect of the CS implementation. Data-driven approaches should be taken to optimize other aspects of the CS implementation, such as the sparse domain and the reconstruction algorithm. Overall, while these works do not exhaustively compare the existing methods in the literature, they give us reasons to view learning-based, data-driven methods as the most direct avenue towards improvement in mask design optimization. 4.5 The advent of deep learning: Learning-based, datadriven approaches Interest for learning-based, data-driven approaches to mask optimization was renewed around 2018, with the advent of the first deep learning-based reconstruction methods applied to MRI (Sun et al., 2016; Schlemper et al., 2017; Mardani et al., 2018). This enabled the possibility to scale up the complexity of mask design methods, as deep learning methods enable fast reconstruction times, which was not the case for CS methods. The first work that we discuss in this section comes slightly before this time and is the one of Baldassarre et al. (2016), where the authors propose a first principled, learning-based approach to optimizing a deterministic sampling pattern given a set of training signals. Note that in our taxonomy, their approach is both learning-based and data-driven. While the optimization only considers the energy of the signal in the case of Fourier sampling, and while the paper on linear reconstruction, the authors show improvement about the sampling patterns prescribed in Lustig et al. (2007); Roman et al. (2014). However, as they optimize only the sampling pattern to capture the energy of the signal and do not include the reconstruction algorithm in their optimization procedure, they fail to obtain significant improvement by turning to non-linear reconstruction methods. 36 4.5. THE ADVENT OF DEEP LEARNING: LEARNING-BASED, DATA-DRIVEN APPROACHES However, the authors published a sequel to this work with Gözcü et al. (2018), that introduced a sequential optimization of the sampling pattern in a fully data-driven fashion. The principle consists, at each step, to evaluate the improvement for a given performance metric of including a candidate location in the final sampling pattern. Once all candidate locations have been evaluated, the algorithms picks the one that brings the largest improvement and moves to the next step. This method was shown to outperform existing approaches, such as the one of Lustig et al. (2007) and Knoll et al. (2011b) by a large margin, due to its approach being both learning-based and data-driven. Part of the improvement is due also to not solely adding sampling locations based on the current error, but computing how adding a location would actually decrease the error in the reconstruction. The distinction is important as the method of Gözcü et al. (2018) is able to take into account how the newly acquired location affects the behavior of the non-linear reconstruction algorithm, rather than acquiring locations based on the current reconstruction error. We will describe the setting of Gözcü et al. (2018) in greater depth in Chapter 5 as this is the basis upon which this thesis was built. At roughly the same time as Gözcü et al. (2018), Haldar and Kim (2019) published another work aiming at optimizing sampling, based on an experiment design approach to minimize the Cramér-Rao bound (CRB). The CRB provides a lower-bound to the covariance matrix that is based on the measurement matrix Aω . This approach is independent on the reconstruction algorithm, and provides a different take on a learningbased, model-driven approach, aiming at minimizing the covariance of an estimator. In the context of MRI sampling, this amounts to finding the measurement matrix that will encode the most information about the original signal x. The last non-fully deep learning-based work that we will discuss is the one of Sherry et al. (2019), where a bilevel optimization procedure is proposed: the first level solves an empirical risk minimization problem for the mask, while the second level consists in an iterative algorithm to solve a Problem similar to 3.1. The work takes an original approach whereby relaxing the mask optimization problem into a continuous optimization problem, they are able to tackle the mask optimization and the reconstruction under a unified framework. While the reconstruction remains model-based, the mask stage is clearly learning-based and data-driven, as training data are used, and aim at minimizing some loss between the original image and the reconstruction. Due to the overwhelming performance of deep learning methods, most subsequent works focus exclusively on deep learning-based reconstruction methods, and all follow a learningbased, data-driven paradigm. All methods initialize the prior from which masks are drawn randomly, and aim at refining it throughout training in order to maximize some performance metric. There are however some trends that can be highlighted within these methods. Some of them aim at jointly train a reconstructor and a sampling pattern for some fixed 37 CHAPTER 4. OPTIMIZING SAMPLING PATTERNS sampling budget. Others tackle a problem of designing a mask sequentially, by adding new sampling locations based on the currently acquired information. Finally, some methods aim at training a model that is able to propose a patient-adaptive mask design. These methods often leverage tools from reinforcement learning (RL) and this is why we refer to such models as policy models. The works of Bahadir et al. (2019); Aggarwal and Jacob (2020); Weiss et al. (2020); Huijben et al. (2020) relax the problem of optimizing the mask to a continuous optimization problem, and train the sampling pattern along with the neural network using backpropagation. After training, the mask is then mapped back as a boolean vector. Other approaches use a pre-trained reconstruction model and train a policy model that performs patient-adaptive sequential mask optimization using reinforcement learning (Pineda et al., 2020; Bakker et al., 2020). The approach of Zhang et al. (2019b) is unique, as they aim at training an evaluator that tries to estimate the current reconstruction error for each location in k-space, while using this evaluator to inform the training of the reconstruction model. Finally, a fourth category of approaches tackles the challenging problem jointly learning a reconstruction model and a policy model for patient-adaptive sampling (Jin et al., 2019; Van Gorp et al., 2021; Yin et al., 2021). Remark (A short discussion of non-Cartesian sampling methods). The problem of optimizing sampling trajectories becomes more complex when moving to the non-Cartesian realm. The problem is inherently continuous, and there is a lot of freedom in designing non-Cartesian trajectories as long as hardware constraints such as bounded first and second derivatives of the trajectory (Boyer et al., 2016). As a result, there is great diversity in the methods proposed. Traditional approaches such as radial spokes (Lauterbur, 1973) and spirals (Meyer et al., 1992) are typically model-based, model-driven, but more recent space filling trajectories like SPARKLING (Lazarus et al., 2019) fall under this category. Weiss et al. (2019); Chaithya et al. (2022) attempted to train non-Cartesian sampling and reconstruction jointly, in an end-to-end fashion, yielding learning-based, data-driven approaches, while Wang et al. (2021) proposed to parametrize the trajectories with B-spline to facilitate learning, resulting in a model-based data-driven method. Finally, (Chaithya et al., 2021) also proposed to refine the SPARKLING algorithm by learning from data the underlying variable-density distribution, yielding a learning-based, model-driven approach. These approaches were compared in Chaithya and Ciuciu (2022), where additional challenges due to non-Cartesian sampling are highlighted. In this setting, it appears that the algorithm used to optimize the sampling trajectory greatly matters, as it is possible for trajectories to remain stuck close to initialization. As a result, merely being a learning-based, data-driven approach does not guarantee an improved performance over model-based model-driven alternatives. Nonetheless, the best 38 4.6. FORMALIZING LEARNING-BASED, DATA-DRIVEN MASK DESIGN performing method was found to be the hybrid learning approach of Chaithya et al. (2022). 4.6 Formalizing learning-based, data-driven mask design Recall the observation setting described in Section 2.2.1. We consider the problem of accelerated MRI, where we obtain partial information yω ∈ CP about a ground truth signal x ∈ CP yω = A ω x + ϵ = P ω F x + ϵ where ϵ a complex Gaussian noise vector. F is the Fourier transform operator, and Pω is our masking operator, implemented as a diagonal matrix with diagonal entries Pω,ii = 1 if i ∈ ω and 0 otherwise. ω denotes the sampling mask and |ω| = N denotes is cardinality. As described in Chapter 3, we construct an estimate x̂ of the ground truth x using a reconstruction algorithm fθ such that x̂ = fθ (y, ω), where θ denote the parameters of the reconstruction algorithm. This can correspond to the regularization parameter of a compressed sensing method, or denote all the parameters of a neural network. A learning-based approach requires access to a data distribution p(x) that can give us access to reference images, while a data-driven approach requires a way to quantify how close an estimation x̂ is from the ground truth x. This is done by using a performance metric η(·, ·) : CP × CP → R, which we will aim to maximize2 . The process is illustrated on Figure 4.1. An ideal sampling algorithm would tailor the mask to each instance of x ∼ p(x), solving max η(x, x̂θ (yω = Pω Ax)), ω:|ω|≤N (4.4) However, it is evident that such an approach is infeasible in practice, because solving Equation 4.4 would require access to the ground truth image x at the time of evaluation. This is not the case in reality, where we only have access to fully sampled training signals for training. We turn now to two main approaches to this problem that have been discussed before. They rely on either i) using non-adaptive masks, designed on a training set offline and deployed at testing time or ii) using adaptive masks, where a sampling policy or heuristic πϕ is trained offline and then takes patient-adaptive decision during the acquisition at evaluation time. A similar reasoning would hold if one would consider a loss ℓ(·, ·) : CP × Cp → R instead of a performance metric. In this case, one would then need to find the sampling mask that minimizes it. 2 39 CHAPTER 4. OPTIMIZING SAMPLING PATTERNS Ground truth Undersampled Fourier spectrum Observation ifft Reconstruction Reconstructor fft Training data Sampling policy Undersampling Fourier spectrum Sampling mask Figure 4.1: Overview of sampling in accelerated MRI, using a knee image. Acquisition physically happens sequentially in Fourier space (full readout lines acquired at once), and a sampling policy can decide ahead of time what sampling mask to use (non-adaptive sampling), or integrate information from the current reconstruction to decide on the next location to acquire (adaptive sampling). Blue dashed lines indicate optional relations: not all policies and reconstruction methods rely on training data, and not all policies are patient-adaptive. The light grayed area indicates that we do not have access to the full ground truth and spectrum at evaluation time, but that we rather directly acquire undersampled data. 4.6.1 Non-adaptive sampling masks In this first setting, given access to a set of training data {x1 , . . . , xm } ∼ p(x), we aim at finding the mask that exhibits the best average performance, namely m 1 X η(xi , fθ (yω,i ; ω)) s.t. yω,i = Aω xi + ϵ. ω:|ω|≤N m i=1 max (4.5) However, solving Equation 4.5 remains challenging, due to the combinatorial nature of the problem. It also brings questions of generalization, of whether a mask ω that shows good performance on the training set would also show good performance on a testing set of unseen samples from the same distribution p(x). While standard learning-theoretic can give answer on the generalization problem (Gözcü et al., 2018), the combinatorial nature of the problem dooms to failure any naive approach, as the total number of different masks grows exponentially, with a complexity O(2N ). Two main types of approaches have been proposed for Problem 4.5, relying either on tackling the combinatorial problem in an approximate fashion (Gözcü et al., 2018; Zibetti et al., 2020) or relaxing the mask to a continuous variable that is then solved using gradient-based optimization (Bahadir et al., 2019; Aggarwal and Jacob, 2020; Weiss et al., 2020; Huijben et al., 2020; Sherry et al., 2019). Combinatorial approaches typically 40 4.6. FORMALIZING LEARNING-BASED, DATA-DRIVEN MASK DESIGN tackle Problem 4.5 as a sequential problem, starting from some initial mask ω0 , and gradually growing the mask until the sampling budget N is exhausted. On the contrary, continuous approaches choose to directly optimize the mask at the maximal sampling budget N , a setting which we refer as fixed sampling. 4.6.2 Adaptive sampling masks In contrast to Equation 4.5, adaptive sampling generally relies on a two-step procedure, where one must first train a policy model πϕ , and subsequently evaluate it on testing data. The problem of training the policy reads max ϕ    ω = ωt−1 ∪ vt   t m 1 X η(xi , fθ (yωT ,i )), where vt ∼ πϕ (fθ (yωt−1 ))  m i=1   |ω | ≤ N t t = 1, . . . , T. (4.6) Here, T describes the number of acquisition rounds that the policy does. While the problem may look similar to Problem 4.5, there are two fundamental differences to be noted. First, the optimization here is on the parameters ϕ of the policy model, rather than on the mask itself. Secondly, the problem is inherently sequential, as the mask is gradually built from atoms vt , t = 1, . . . , T that are all given from the policy πϕ at different stages in the acquisition. Solving the problem remains challenging, as gradient-based optimization of ϕ remains challenging, due to the sampling operation vt ∼ πϕ ((fθ (yωt−1 )) preventing a direct computation of the gradients. As a result, two main approaches can be found to tackle this problem, relying either again on relaxing the mask to a continuous variable (Van Gorp et al., 2021; Yin et al., 2021), or leveraging reinforcement learning (Bakker et al., 2020; Pineda et al., 2020; Jin et al., 2019), which is naturally suited to consider the problem of searching the policy that yields the best performance given a changing environment. In our case, the environment is described by the unknown underlying ground truth. The second stage of adaptive sampling is the inference step, where the trained policy πϕ is then used to guide the acquisition of an unseen image xtest : vt+1 ∈ arg max[πϕ (fθ (yωt ,test ))]v (4.7) v∈[P ] where [·]v denotes the v-th entry of the vector, ωt = ωt−1 ∪ vt , for t = 1, . . . , T . At inference time, we see that the test image is gradually sampled at the locations vt provided by the policy and based on the reconstruction at the previous step. Remark (Using heuristics). The inference of Equation 4.7 is not restricted to policies trained following Equation 4.6, but can be paired with different heuristics. 41 CHAPTER 4. OPTIMIZING SAMPLING PATTERNS For instance, Zhang et al. (2019b) trained an evaluator that estimates the error made by the reconstruction model at the different locations to be acquired. They then use this heuristic for sampling, by acquiring at each step the location that is estimated to have the highest reconstruction error. We refer to this approach as a heuristic rather than a policy because it is not specifically trained to estimate what would be the best location to answer next; it only trains a model that estimates the current reconstruction error, and then uses it in a greedy fashion. We will expand upon the use of heuristics for adaptive sampling in Chapter 7, where we propose to use the variance of a generative adversarial network (Goodfellow et al., 2014) to perform adaptive sampling. Remark (Non-sequential adaptive sampling). Although a sequential acquisition is a natural choice for adaptive sampling, it is possible to perform patient-adaptive sampling with a fixed (non-sequential) mask. This was done for instance by Bakker et al. (2021), where in the case of multi-coil MRI, the authors used the auto-calibration signal (ACS), a small set of measurements that is systematically acquired, to design a patient-adaptive fixed mask in a one-shot fashion. They inputted the measurements into a policy model that outputs a single mask for the rest of the acquisition as a result. However, their results surprisingly show that their best performing sampling policies explicitly learn to be non-adaptive, and their results provide then a state-of-the-art non-adaptive sampling mask. 4.6.3 On the optimality of the discrete mask optimization problem In the previous sections, we defined Problems 4.4 and 4.5 as optimization problems over masks. However, compressed sensing approaches Puy et al. (2011); Chauffert et al. (2013) rather optimized the probability distribution from which masks are subsequently drawn. These arguments were also constructed in the context of model-based, model-driven methods, aiming at minimizing coherence. Can we relate the optimization of a probability distribution to the optimization of a discrete mask such as Problem 4.5? How does the problem of finding an optimal probability distribution change when moving from a model-based, model-driven context to a learning-based, data-driven context? These are the questions that we will answer in this section. While in Chauffert et al. (2013), the optimization was done on the distribution π from which masks are subsequently sampled, in our case, the performance metric will be optimized directly for a fixed mask rather than a distribution. We will formally show below that in a data-driven context, we can construct an optimal sampling distribution π 42 4.6. FORMALIZING LEARNING-BASED, DATA-DRIVEN MASK DESIGN from an optimal mask. The optimal mask describes the support of an optimal probability distribution. We model the mask designing process as finding a probability mass function (PMF) P π ∈ S P −1 , where S P −1 := {π ∈ [0, 1]P : Pi=1 πi = 1} is the standard simplex in RP . π assigns to each location i in the k-space a probability πi to be acquired. The mask is then constructed by drawing without replacement from π until the cardinality constraint |ω| = N is met. We denote this dependency as ω(π, N ). The problem of finding the optimal sampling distribution is subsequently formulated as max η(π), η(π) := Eω(π,N ) [η (x, fθ (y, ω))] , (4.8) π∈S P −1 x∼p(x) where the index set ω ⊂ [P ] is generated from π and [P ] := {1, . . . , P }. This problem corresponds to finding the probability distribution π that maximizes the expected performance metric with respect to the data p(x) and the masks drawn from this distribution. To ease the notation, we will use η (x, fθ (y, ω)) ≡ η (x; ω). In practice, we do not have access to Ep(x) [η(x; ω)] and instead have at hand the training images {xi }m i=1 drawn independently from p(x). We therefore maximize the empirical performance by solving m 1 X E [η(ω, xi )] . (4.9) max ηm (π), ηm (π) := m i=1 ω(π,N ) π∈S P −1 Note that Problem 4.9, while being an empirical maximization problem, is distinct from Problem 4.5 above, as it seeks to optimize the distribution from which masks are drawn, and as a result, performs an expectation over all the masks drawn from it. Given that Problem 4.9 looks for masks that are constructed by sampling N times without replacement from π, the following holds. Proposition 1. There exists a maximizer of Problem 4.9 that is supported on an index set of size at most N . bn be a maximizer of Problem 4.9. We are interested in Proof. Let the distribution π P bn . Because |ω|=N Pr[ω] = 1, note that finding the support of π 1 Xm η(xi ; ω) · Pr[ω|π] i=1 m π∈S P −1 |ω|=N max ηm (π) := max π∈S P −1 X 1 Xm η(xi ; ω) i=1 π∈S P −1 |ω|=N m 1 Xm = max η(xi ; ω). i=1 |ω|=N m ≤ max max b N be an index set of size N that maximizes the last line above. The above Let ω 43 CHAPTER 4. OPTIMIZING SAMPLING PATTERNS b N ] = 1 and Pr[ω] = 0 for ω ̸= ω b N and π = π bN . This holds with equality when Pr[ω b b in turn happens when πN is supported on ω . That is, there exists a maximizer of Problem 4.9 that is supported on an index set of size N . □ While this observation does not indicate how to find this maximizer, it nonetheless allows bN has us to simplify Problem 4.9. More specifically, the observation that a distribution π a compact support of size N implies the following: Proposition 2. m 1 X η(xi ; ω) |ω|=N m i=1 Problem 4.9 ≡ max (4.10) Proof. Proposition 1 tells us hat a solution of Problem 4.9 is supported on a set of size at most n, which implies Problem 4.9 ≡ max π∈S P −1 :|supp(π)|=N ηm (π). (4.11) That is, we only need to search over compactly supported distributions π. Let SΓ denote the standard simplex on a support Γ ⊂ [P ]. It holds that Problem 4.11 ≡ max max ηm (π) |Γ|=N π∈SΓ 1 Xm η(xi ; Γ) · Pr[Γ|π] i=1 |Γ|=N π∈SΓ m 1 Xm η(xi ; Γ) = max max i=1 |Γ|=N π∈SΓ m 1 Xm = max η(xi ; Γ). i=1 |Γ|=N m = max max (4.12) To obtain the second and third equalities, one observes that all masks have a common support Γ with N elements, i.e. π ∈ SΓ allows only for a single mask ω with N elements, namely ω = Γ. □ The framework of Problem 4.9 captures most variable-density based approaches of the literature that are defined in a learning-based fashion (Knoll et al., 2011b; Ravishankar and Bresler, 2011a; Vellagoundar and Machireddy, 2015; Haldar and Kim, 2019; Bahadir et al., 2019; Sherry et al., 2019), and Proposition 2 shows that Problem 4.12, that we tackled in Gözcü et al. (2018); Gözcü et al. (2019); Sanchez et al. (2020a) and develop in Chapter 5, also aims at solving the same problem as these probabilistic approaches. Note that while the present argument considers sampling points in the Fourier space, it is readily applicable to the Cartesian case, where full lines are added to the mask at once. 44 4.6. FORMALIZING LEARNING-BASED, DATA-DRIVEN MASK DESIGN Several additional observations and remarks can be made. The argument presented here is thus mainly a justification of the objective of Equation 4.5 that we use for a greedy mask optimization: we do not aim at solving a fundamentally different problem than the one of finding the optimal sampling density in a data-driven approach when trying to find the best performing mask. Indeed, Proposition 2 shows that we do not need to find a density, but that finding the support of cardinality N of an optimal distribution is sufficient. And proposition 1 shows the existence of an optimal distribution with such a support. The argument does not formally prove the suboptimality of variable-density sampling, although one can argue for this from the intuition that these results bring. Indeed, any distribution not compactly supported on a set of N optimal location will open the possibility to a suboptimal mask being sampled by doing sampling without replacement. The heuristic of picking sampling locations at random without replacement could be a cause of the worse practical performance of VDS compared to LBCS, as we will see in Section 5.4 of next Chapter. Finally, these propositions do not prescribe any particular algorithm that should be used to solve Problem 4.5. Rather, they are limited to establishing an equivalence between problems, and do not speak either about the expected generalization of these solutions. Generalization is however addressed by standard learning theoretic arguments (Gözcü et al., 2018). Bibliographic note Propositions 1 and 2 are due to Armin Eftekhari. 45 5 Scalable learning-based sampling optimization In this chapter, we present our extensions to the framework of Gözcü et al. (2018). We primarily aim at drastically improving the scalability of their method. We begin by presenting in detail their learning-based CS (LBCS) framework in Section 5.1. In Section 5.2, we take a short detour in the field of submodular function maximization, from which Gözcü et al. (2018) took inspiration, and discuss how algorithms that have been proposed in this context could help improve the LBCS method. In Section 5.3, we present our two improvements over LBCS, namely lazy LBCS and stochastic LBCS. We then carry out extensive validation of these methods in Sections 5.4 and 5.5, and discuss the results in Section 5.61 . 5.1 Learning-based Compressive Sensing (LBCS) LBCS aims at designing a non-adaptive mask in a sequential fashion, and tackles Problem 4.5. Indeed, solving this problem in a naive, brute-force approaches would only be possible in very simple cases, as the total number of masks grows exponentially, with a rate O(2P ). Therefore, one must use approximate solutions to this problem. Gözcü et al. (2018) proposed a greedy, parameter-free approach, where a mask is grown sequentially by adding elements to the mask in an iterative fashion. This approach, called Learning-Based Compressed Sensing (LBCS) allows to reduce the number of configurations evaluated to a complexity O(P 2 ). The simplified procedure is illustrated on Figure 5.1. A couple of elements must be introduced in order to formally define their algorithm. To implement the constraints of sampling in MRI, typically acquiring full lines in the The work in this chapter is based on the following publications: Gözcü, B., Sanchez, T., and Cevher, V. (2019). Rethinking sampling in parallel MRI: A data-driven approach. In 27th European Signal Processing Conference (EUSIPCO). Sanchez, T., Gözcü, B., van Heeswijk, R. B., Eftekhari, A., Ilıcak, E., Çukur, T., and Cevher, V. (2020a). Scalable learning-based sampling optimization for compressive dynamic MRI. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8584–8588. 1 47 CHAPTER 5. SCALABLE LEARNING-BASED SAMPLING OPTIMIZATION Figure 5.1: Illustration of the LBCS greedy approach. On the right, we start from an empty mask, and select the line that yields to the best performance. Once it is acquired, we keep this mask, try adding to it the line that yields the best performance given what has been acquired, and continue until the final sampling budget is reached. On the left, we see how the performance is assessed: we use a set D of training data, that are subsampled according to the current mask, then reconstructed and evaluated using the performance metric η. The computation is repeated for each different mask. Fourier space at once, we define a set of subsets S of {1, . . . , P } and assume that the final mask will be constructed as ω= t [ Sj , s.t. Sj ∈ S (5.1) j=1 for some t > 0. Gözcü et al. (2018) also introduce a cost function c(ω) ≥ 0 and a maximal cost Γ that the final mask ω must not exceed. In the case of Problem 4.5, we have c(ω) = |ω| and Γ = N . With these two additions, the LBCS method (Gözcü et al., 2018) is displayed in Algorithm 1. The procedure is very simple: while budget is available (l.2), the algorithm tests all feasible candidates S ∈ S (l.3-7) and adds permanently to the sampling mask the candidate that yields the largest performance gain (l.8). The greedy approach of LBCS brings multiple advantages over previous heuristics such as variable-density sampling. First, the mask ω has a nested structure: by recording the order in which elements were added to the mask, it is possible to immediately adapt to different costs Γ. This feature is not possible for many approaches that directly optimize the sampling mask for a given sampling rate. In addition, this approach does not have any parameter that requires to be tuned, making it particular easy to implement in 48 5.2. LBCS MOTIVATION: SUBMODULARITY Algorithm 1 LBCS Input: Training data x1 , . . . , xm , decoder fθ , sampling subsets S, cost function c, maximum cost Γ Output: Sampling pattern ω 1: ω ← ∅ 2: while c(ω) ≤ Γ do 3: for S ∈ S such that c(ω ∪ S) ≤ Γ do 4: ω′ = ω ∪ S 5: For each j, set yj ← Pω′ F xj , x̂j ← fθ (yj , ω ′ ) 1 Pm 6: η(ω ′ ) ← m j=1 η(xj , x̂j ) 7: end for 8: ω ← ω ∪ S∗, S ∗ = arg max η(ω ∪ S) − η(ω) S : c(ω∪S)≤Γ 9: end while 10: return ω practice. It also does not rely on any probability distribution from which candidate masks are drawn, which makes it fully learning-based. Finally, the algorithm is not tied to a specific reconstruction method or performance metric, which makes it easy to integrate in novel approaches. 5.2 LBCS motivation: Submodularity The idea to apply a greedy approach to the problem of mask design was initially motivated by techniques used in the field of submodular function maximization (Baldassarre et al., 2016; Krause and Golovin, 2014). We discuss in greater depth these concepts here as the motivation for scaling up LBCS are also motivated by algorithms proposed in the context of submodular function maximization. Submodular functions have been studied for a long time (Nemhauser et al., 1978; Minoux, 1978), and formalize among other things the idea of diminishing returns. This is the idea that adding an element to a smaller set increases the objective more than when it is added to a larger set. Formally Definition (Submodularity). A set function η(ω) mapping subsets ω ∈ [P ] to real numbers is said to be submodular if, for all ω1 , ω2 ⊂ [P ] with ω1 ⊆ ω2 and all i ∈ [P ], it holds that η(ω1 ∪ {i}) − η(ω1 ) ≥ η(ω2 ∪ {i}) − η(ω2 ). (5.2) The function is said to be modular if the same holds true with equality in place of inequality 49 CHAPTER 5. SCALABLE LEARNING-BASED SAMPLING OPTIMIZATION An example of this is sensor placement. As one places more sensors to cover an area more fully, often, the area covered by sensors will be overlapping, and so adding a new sensor will bring less information if a large set of sensors is already available. This is illustrated on Figure 5.2. This typically translates into problems of submodular function maximization, where a common constraint lies on the cardinality of the solution: max η(ω) subject to |ω| ≤ N. (5.3) ω⊆[P ] In the case of linear reconstruction in MRI, it can be shown that the problem of finding the sampling mask that gives minimal mean squared error is a modular optimization problem: there is no diminishing returns, but each component can be optimized individually (Baldassarre et al., 2016). In the case of a non-linear reconstruction in MRI, there is no formal proof that shows that submodularity exactly holds, but empirical results show a diminishing return throughout acquisition. This is what motivated Gözcü et al. (2018) to leverage tools from submodular optimization for mask design in MRI. Sensor 1 Sensor 2 Sensor 3 ⌘({1} [ {3}) AAACOnicbVDLSgMxFM34rPU16tJNsAgtaJmpRV0W3bhswdZCU0omvW2DmYdJRijDfJcbv8KdCzcuFHHrB5g+8FE9EDiccy4393iR4Eo7zqM1N7+wuLScWcmurq1vbNpb2w0VxpJBnYUilE2PKhA8gLrmWkAzkkB9T8CVd30+8q9uQSoeBpd6GEHbp/2A9zij2kgdu0ZA0zxJXJJiwuIIk+SIpAV8iL+NAiZ9uPkSDkok/T86cgodO+cUnTHwX+JOSQ5NUe3YD6QbstiHQDNBlWq5TqTbCZWaMwFplsQKIsquaR9ahgbUB9VOxqeneN8oXdwLpXmBxmP150RCfaWGvmeSPtUDNeuNxP+8Vqx7p+2EB1GsIWCTRb1YYB3iUY+4yyUwLYaGUCa5+StmAyop06btrCnBnT35L2mUiu5xsVwr5ypn0zoyaBftoTxy0QmqoAtURXXE0B16Qi/o1bq3nq03630SnbOmMzvoF6yPT9+bqds= ⌘({1}) ⌘({1, 2} [ {3}) ⌘({1, 2}) Figure 5.2: Illustration of the diminishing return property of submodular functions. If h quantifies the total coverage of sensors, we see that adding sensor 3 to sensor 1 increases the total coverage more than when adding it to the set {1, 2}, as there is more surface coverage surface overlapping in this case. Indeed, greedy algorithms have a special place in submodular optimization because of the celebrated result by Nemhauser et al. (1978), who proved that the greedy algorithm 50 5.2. LBCS MOTIVATION: SUBMODULARITY provides a good approximation to the optimal solution of NP-hard instances of submodular function maximization (Krause and Golovin, 2014). Theorem 1 (Nemhauser et al. (1978)). Let η : 2P → R+ be a nonnegative monotone submodular function. Let ωt , t ≥ 0 be the sequence obtained by the greedy algorithm. Then, for all positive t, N ,  η(ωt ) ≥ 1 − e−t/N  max η(ω). ω:|ω|≤N (5.4) In particular, for t = N , η(ωN ) ≥ (1 − 1/e) maxω:|ω|≤N η(ω). For many classes of submodular functions h, this result is the best that can be achieved with any efficient algorithm (Krause and Golovin, 2014). In particular, Nemhauser and Wolsey (1978) proved that any algorithm that can only evaluate h on a polynomial number of set will not be able to obtain an approximation guarantee that is better than (1 − 1/e). This property could be key to explaining the strong empirical performance of the LBCS algorithm of Gözcü et al. (2018) in the context of MRI. Although the approximation guarantee cannot be generally improved, many works have proposed more efficient approaches than the simple greedy algorithm. Two works are of particular interest to us, as they serve as the motivation to the algorithms that we will describe in the next section. These algorithms are the lazy greedy (Minoux, 1978) and stochastic greedy, called lazier than lazy greedy (Mirzasoleiman et al., 2015). The lazy greedy algorithm relies on the fact that if at the i-th iteration of the greedy algorithm, an element S is expected to bring a marginal benefit ∆(S|ωi ) = η(ωi ∪S)−η(ωi ), then it holds that ∆(S|ωj ) ≤ ∆(S|ωi ) for all j ≥ i for a submodular function η. Exploiting this fact, one can keep a list of upper bounds on the marginal benefits of each element S, called ρ(S), initialized at +∞. Then, at each iteration, the element S with largest value ρ(S) is picked and updated as ρ(S) = ∆(S|ω). If ρ(S) ≥ ρ(S ′ ) ∀S ′ ∈ S, the marginal benefit of this element S is larger than the upper bounds of each other marginal contribution, and is consequently added permanently to ω. This can speed up the algorithm by several orders of magnitude, since ρ(S) ≥ ρ(S ′ ) can be satisfied after a few trials, instead of trying each available S at every iteration (Krause and Golovin, 2014). The stochastic greedy algorithm (Mirzasoleiman et al., 2015) samples at each step of the greedy algorithm a subset of candidates Siter ⊆ S, with |Siter | = k at line 3 of Algorithm 1. As a result, given a nonnegative monotone submodular function η : 2P → R+ and taking P k=N log (1/ϵ), the stochastic greedy algorithm achieves a (1 − 1/e − ϵ) approximation guarantee in expectation to the optimum, using only O (P log(1/ϵ)) evaluations. Both of these algorithms inspired the methods that we proposed to scale up LBCS, and their implementation in the context of MRI will be respectively covered in Sections 5.3.3 51 CHAPTER 5. SCALABLE LEARNING-BASED SAMPLING OPTIMIZATION and 5.3.2. Remark (Submodular optimization under noise). Although the cases so far mostly discussed optimizing submodular functions without measurement noise, such a setting is not realistic in the case of prospective MRI acquisitions, where repeated lines will have a different noise, and resampling can improve the average signal. Nonetheless, submodular optimization under noise has been studied, and algorithms that adapt to such settings have been proposed (Singla et al., 2016; Hassidim and Singer, 2017). 5.3 Scaling up LBCS 5.3.1 Limitations of LBCS As we have hinted above, in many practical scenarios, evaluating O(P 2 ) configurations remains very expensive. This is also due to the fact that each configuration requires reconstructing m data points. In particular, the LBCS approach becomes prohibitively expensive when either of the following occurs: 1. When the number of candidate locations grows significantly. This is particularly the case, in 3D MRI, where one needs to undersample entire volumes, or dynamic MRI, where one undersamples multiple frames for a single 2D image. 2. When the size of the dataset grows, it becomes impractical to evaluate LBCS on the whole testing set at once. 3. When the running time of the reconstruction method becomes large. This is particularly the case for multi-coil MRI, which is the setting most commonly used in practice. This issue also occurs when dealing with larger data, from 3D or dynamic MRI. As a result, we wish to develop variants of LBCS that allow to tackle these different settings, while retaining the performance of LBCS as much as possible. We will discuss two variants of LBCS that were respectively proposed in Gözcü et al. (2019) and Sanchez et al. (2020a), and that aim at tackling mostly the issues of multi-coil and 3D MRI as well as dynamic MRI. We start by explaining how they differ from LBCS. 5.3.2 Stochastic LBCS With stochastic LBCS (sLBCS), we aim at addressing two fundamental drawbacks of the approach of Gözcü et al. (2018), namely that 1. its scaling is quadratic with respect to the total number of readout lines, 2. its scaling is linearly with respect to the size of the dataset. With this approach, we also aim at integrating a specific constraint of 52 5.3. SCALING UP LBCS dynamic MRI (dMRI), namely that the sampling mask should have a fixed number of readouts by temporal frame. Scalability with respect to the size of the dataset is essential for the applicability of sLBCS to modern datasets, which can contain several thousands of images, making it impractical to go through the entire dataset at each inner loop of the algorithm. Before describing the algorithm, let us briefly recall the dynamic MRI setting. The main difference with static MRI is that the signal will be a vectorized video, i.e. x ∈ CP , where P = H × W × T instead of H × W . As a result, the operator Aω will be constructed differently, and in particular, the Fourier transform will be applied frame-wise. In addition, the mask ω will be constructed as a union of elements from a set S that will represent lines at different time frames, and will be described as S = S1 ∪ . . . ∪ ST . We also define the set L ⊆ {1, . . . , m} as a set of indices where the training data {x}m i=1 will be subsampled. Let us now detail how our proposed stochastic greedy method solves the limitations 1. and 2. of LBCS (Algo. 1). The issue 1. is solved by picking uniformly at random at each iteration a batch possible readout lines Siter of constant size k from a given frame St , instead of considering the full set of possible lines S (line 3 in Alg. 2), and is inspired from the stochastic greedy of Mirzasoleiman et al. (2015); the number of candidate locations is further reduced by iterating through the lines to be added from each frame St sequentially (lines 1, 3 and 11 in Algorithm 2); 2. is addressed by considering a fixed batch of training data L of size l instead of the whole training set of size m at each iteration (line 4 in Algorithm 2); Remark. We will discuss the impact of introducing the batch size k for the readout lines and the subset L to subsample training data in the results. However, we want to mention here that one has to be careful to pick the subset L before the for loop of line 5 in Algorithm 2. Indeed, failing to do this would result in the different candidates S would be evaluated on different subsets of the data. As there is some variation within the data, for instance in terms of SNR, such an approach would yield performances that are not comparable among different candidates. In turn, this would skew the results towards picking the candidate location S with an associated subset LS with high signal-to-noise ratio, at the expense of selecting a readout location for the performance improvement that it actually brings. This would be especially the case if |L| were small. Picking L before this for loop ensures that the readout line selected will indeed be the one that brings the largest performance gain among the candidates for the subset of data L. The improvements of sLBCS allow to reduce the computational complexity from  Θ mr(HT )2 to Θ (lrkHT ), effectively speeding up the computation by a factor Θ( ml HT k ), where r denotes the cost of a single reconstruction. Our results show that this is achieved 53 CHAPTER 5. SCALABLE LEARNING-BASED SAMPLING OPTIMIZATION without sacrificing any reconstruction quality. We will see that the improvement is particularly significant for large datasets, as the batch size l will typically remain small, typically up to 100 samples, whereas MRI datasets can contain more than 15 000 data points (Zbontar et al., 2019). Algorithm 2 sLBCS: Greedy mask optimization algorithm for (d)MRI Input: Training data {x}m i=1 , recon. rule g, sampling set S, max. cardinality n, samp. batch size k, train. batch size l Output: Sampling pattern ω 1: (dMRI) Initialize t = 1 2: while |ω| (≤ n do Siter ⊆ S (MRI) 3: Pick at random, with |Siter | = k Siter ⊆ St (dMRI) 4: Pick L ⊆ {1, . . . , m}, with |L| = l 5: for S ∈ Siter such that |ω ∪ S| ≤ Γ do 6: ω′ = ω ∪ S 7: For each ℓ ∈ L set x̂ℓ ← fθ (Pω′ F xℓ , ω ′ ) 1 P 8: η(ω ′ ) ← |L| ℓ∈L η(xℓ , x̂ℓ ) 9: end for 10: ω ← ω ∪ S ∗ , where S ∗ = arg max η(ω ∪ S) S:|ω∪S|≤n S∈Siter 11: (dMRI) t = (t mod T ) + 1 12: end while 13: return ω 5.3.3 Lazy LBCS In cases where the cardinality of the candidate elements S grows extremely large, even Algorithm 2 can become impractical. In 2D MRI, LBCS deals with images of size H × W while having a candidate set of size H, with H different lines. However, in the case of 3D MRI, the number of candidate locations explodes, as we move to images of size P = H × W × S, where S is the number of slices, with a candidate set of size H × W , i.e., one is allowed to sample in both phase and frequency encoding directions. While in 2D dynamic MRI, the candidate set is of size H × T with T typically much smaller than H, in the case of 3D MRI, we often have H ≈ W . In addition, we have less structure than in dMRI, as there is no option to cycle through frames. This implies that the batch of randomly sampled locations must be much larger to enable a good performance, which results in the cost remaining overall prohibitive. An additional reason for this increased cost is the reconstruction time of 3D methods. The number of slices can be much larger than the number of frames in dMRI, slowing down reconstruction considerably. Moving from single coil to multi-coil data further increases this cost, although this change does not increase the complexity of the sampling optimization problem. 54 5.4. EXPERIMENTS - STOCHASTIC LBCS Lazy LBCS proposes a solution to this limitation by leveraging another approach to speed up LBCS. It uses lazy evaluations (Minoux, 1978) which is equivalent to but faster than the greedy algorithm for submodular functions, and possess optimization guarantees for such functions (Nemhauser et al., 1978). While in our present setting, verifying submodularity is difficult because writing the performance of reconstructions in a closed form is generally not possible, we were motivated by the fact that Algorithm 3 works well in practice even if the objective function is not exactly submodular. Indeed, as we will observe in the experiments, lazy evaluations also perform well for mask optimization. The implementation of lazy evaluations for LBCS is described in Algorithm 3. We see that a list of upper bounds is initially computed (l.2), and that we sequentially traverse the list of upper bounds until we find a candidate sample S that brings a larger improvement than any of the upper bounds in ρ (ll.8-9). Algorithm 3 Lazy Learning-based Compressive Sensing (lLBCS) Input: Training data x1 , . . . , xm , decoder f , sampling subsets S, cost function c, maximum cost Γ Output: Sampling pattern ω 1: ω ← ∅ 2: ρ(S) ← +∞ ∀S ∈ S s.t. c(ω ∪ S) ≤ Γ 3: while c(ω) ≤ Γ do 4: ω ′ ← ω ∪ S, where S = arg max ρ(S ′ ) S ′ ∈S : c(ω∪S ′ )≤Γ 5: 6: For each j, set yj ← Pω′ F xj , x̂j ← fθ (yj , ω ′ ) m 1 X η(xj , x̂j ) η(ω ′ ) ← m j=1 ρ(S) ← η(ω ′ ) − η(ω) 8: if ρ(S) ≥ ρ(S ′ ) ∀S ′ ∈ S s.t. c(ω ∪ S ′ ) ≤ Γ then 9: ω =ω∪S 10: end if 11: end while 12: return ω 7: 5.4 Experiments - Stochastic LBCS In this Section, we validate the performance of sLBCS, and will discuss lLBCS in Section 5.5. We carry out extensive experiments to validate our stochastic LBCS method. We first compare sLBCS against the vanilla LBCS methods of Gözcü et al. (2018), paying particular attention to how it compares to the original method in terms of performance and computational complexity. We study the influence on the batch sizes k and l on performance and on the design of the mask. Then, we carry out experiments on both single coil and multicoil data, as well as noise free and noisy data. Finally, we show the benefit of our method on large scale datasets, for which LBCS is prohibitively expensive. 55 CHAPTER 5. SCALABLE LEARNING-BASED SAMPLING OPTIMIZATION 5.4.1 Experimental setting Dataset. Our experiments were carried out on three different datasets. • Cardiac dataset. The data set was acquired in seven healthy adult volunteers with a balanced steady-state free precession (bSSFP) pulse sequence on a whole-body Siemens 3T scanner using a 34-element matrix coil array. Several short-axis cine images were acquired during a breath-hold scan. Fully sampled Cartesian data were acquired using a 256 × 256 grid, with relevant imaging parameters including 320 × 320 mm field of view (FoV), 6 mm slice thickness, 1.37 × 1.37 mm spatial resolution, 42.38 ms temporal resolution, 1.63/3.26 ms TE/TR, 36◦ flip angle, 1395 Hz/px readout bandwidth. There were 13 phase encodes acquired for a frame during one heartbeat, for a total of 25 frames after the scan. The Cartesian cardiac scans were then combined to single coil data from the initial 256 × 256 × 25 × 34 size, using adaptive coil combination (Walsh et al., 2000; Griswold et al., 2002b). This single coil image was then cropped to a 152 × 152 × 17 image. This was done because a large portion of the periphery of the images are static or void, and also to enable a greater computational efficiency. In the experiments, we used three volumes for training and four for testing. • Vocal dataset. The vocal dataset that we used in the experiments comprised 4 vocaltract scans with a 2D HASTE sequence (T2 weighted single-shot turbo spinecho) on a 3T Siemens Tim Trio using a 4-channel body matrix coil array. The study was approved by the local institutional review board, and informed consent was obtained from all subjects prior to imaging. Fully sampled Cartesian data were acquired using a 256 × 256 grid, with 256 × 256 mm field of view (FoV), 5 mm slice thickness, 1 × 1 mm spatial resolution, 98/1000 ms TE/TR, 150◦ flip angle, 391 Hz/px readout bandwidth, 5.44 ms echo spacing (256 turbo factor). There was a total of 10 frames acquired, which were recombined to single coil data using adaptive coil combination as well (Walsh et al., 2000; Griswold et al., 2002b). • fastMRI. The fastMRI dataset was obtained from the NYU fastMRI initiative (Zbontar et al., 2019). The anonymized dataset comprises raw k-space data from more than 1,500 fully sampled knee MRIs obtained on 3 and 1.5 Tesla magnets. The dataset includes coronal proton density-weighted images with and without fat suppression. Reconstruction algorithms. We consider two reconstruction algorithms, namely k-t FOCUSS (KTF) (Jung et al., 2009), and ALOHA (Jin et al., 2016). Their parameters were selected to maintain a good empirical performance across all sampling rates considered. KTF aims at reconstructing a dynamic sequence by enforcing its sparsity in the x-f 56 5.4. EXPERIMENTS - STOCHASTIC LBCS domain. Formally, it aims at solving min ∥z∥1 subject to ∥yω − Pω F Ft−1 z∥2 ≤ ϵ z where Ft−1 denotes the inverse temporal Fourier transform. It tackles the problem by using a reweighted quadratic approach (Jung et al., 2007). ALOHA (Jin et al., 2016) is an annihilating filter based low-rank Hankel matrix approach. ALOHA exploits the duality between sparse signals and their transform domain representations, cast as a low-rank Hankel matrix, and effectively transforms CS reconstruction into a low-rank matrix completion problem. We refer the reader to their detailed derivation for further details. Mask selection baselines: • Coherence-VD (Lustig et al., 2007): We consider a random variable-density sampling mask with Gaussian density and optimize its parameters to minimize coherence. This is an instance of a model-based, model-driven approach. • Golden (Li et al., 2018): This approach designs a Cartesian mask by adding progressively lines according to the golden ratio. This is a model-based, modeldriven approach. • LB-VD (Gözcü et al., 2018; Gözcü et al., 2019): Instead of minimizing the coherence as in Coherence-VD, we perform a grid search on the parameters using the training set to optimize reconstruction according to the same performance metric as our method. This makes this approach model-based, data-driven. We also consider several variants of LBCS and sLBCS that we will use throughout the experiments. We define abbreviations for conciseness: • G-v1 is the original LBCS algorithm. • G-v2 is a version of sLBCS that uses a batch of training data l and cycles through the frames. • SG-v1 is a version of the stochastic LBCS algorithm that considers a batch of sampling locations k but uses all training data, i.e. l = |m|. • SG-v2 is sLBCS, using a batch of sampling locations k, a batch of training data l and cycling through the frames. 57 CHAPTER 5. SCALABLE LEARNING-BASED SAMPLING OPTIMIZATION Remark. In this work, we mainly focused on small scale datasets with a large set of candidate sampling locations. Our experiments therefore are carried out using SG-v1 and SG-v2. We did not observe any significant performance difference between these variants of sLBCS. However, we note that G-v2 might be of interest in deep-learning based settings, where one deals with large datasets, such as fastMRI (Zbontar et al., 2019). This will be further discussed in Section 5.4.5. 5.4.2 Comparison of greedy algorithms We first evaluate SG-v1 and SG-v2 against G-v1 to establish how sensitive our proposed method is to its parameters k and l. We carry out this experiment on the cardiac data, and report only the results on KTF, although similar trends can be observed with other reconstruction algorithms. 40 PSNR 37 34 k=5 k = 10 k = 20 k = 38 k = 76 k = 152 G-v1 31 28 0.05 0.15 Sampling rate 0.25 Figure 5.3: PSNR as a function of the rate for KTF, comparing the effect of the batch size on the quality of the reconstruction for SG-v1. The result is averaged on 4 testing images of size 152 × 152 × 17. Influence of the batch size k on the mask design. We first discuss the tuning of the batch size used in SG-v1, to specifically study the effect of different batch sizes. Figure 5.3 provides quantitative results of the performance, while Figure 5.4 shows the shapes of the masks obtained with various batch sizes. Unsurprisingly, small batch sizes yield poor results, as this algorithm has to choose from a very small set of randomly sampled candidates, being likely to miss out on important frequencies, such as the center ones. It is interesting to see however that the PSNR reaches the result from G-v1 with as few as 38 samples (out of the 152 candidates of a frame, and 152 × 17 = 2584 candidates overall). In some cases, such as k = 38, SG-v1 can even yield an improved performance over G-v1, with 60 times less computation that G-v1. We hypothesize that it is due to the noise introduced by sLBCS allowing to avoid suboptimal configurations encountered by G-v1. Note that this might not be expected if the problem was to be exactly submodular, but this is not the case in practice. Focusing on Figure 5.4, one can make several observations. First of all, as expected, taking a batch size of 1 yields a totally random mask, and taking a batch size of 5 yields a mask that is more centered towards low frequency than the one with k = 1 but it still 58 5.4. EXPERIMENTS - STOCHASTIC LBCS has a large variance. Then, as the batch size increases, resulting masks seem to converge to very similar designs, but those are slightly different from the ones obtained with G-v1. Overall, our initial experiments suggest, as a rule of thumb, that roughly sampling 25% of a frame enables sLBCS to match the performance of G-v1. 5 1 10 20 38 50 76 152 Greedy 0.25 0.15 Figure 5.4: Learning-based masks obtained with SG-v1 for different batch sizes k using KTF as a reconstruction algorithm, shown in the title of each column, for 15% and 25% sampling rate. The optimization used data of size 152 × 152 × 17, with a total of 2584 possible phase encoding lines for the masks to pick from. Influence of the batch size l. We then include the effect of using a smaller data batch size l into the equation, comparing the performances of G-v1 with SG-v1 and SG-v2. The results are presented on Figure 5.5. Both SG-v1 and SG-v2 behave similarly, and for k = 38, both methods end up outperforming G-v1, with respectively 60 times less and 180 times less computation. As seen before, using a too small batch size k (e.g. 10) yields a drop in performance. Using a batch of training images l (SG-v2) does not seem to significantly reduce the performance compared to SG-v1, while substantially reducing computations. As a result, in the sequel, we use k = 38 and l = 1 for SG-v2. 40 39 38 37 37 PSNR 0.21 0.23 0.25 34 SG-v1 k = 10 SG-v1 k = 38 SG-v2 k = 10 SG-v2 k = 38 G-v1 Greedy 31 28 0.05 0.15 Sampling rate 0.25 Computational costs. We discuss now the computational costs for the different variations of the greedy methods used in the single coil experiments. Table 5.1 provides the running times and empirically measured speedup for the greedy variation, and Table 5.2 provides the computational times required to obtain the learning-based variable density (LB-VD) parameters through an extensive grid-search. The empirical speedup is computed as Speedup = tG-v1 · nprocs, G-v1 tSG-v2 · nprocs, SG-v2 (5.5) Figure 5.5: PSNR as a function of the sampling rate for KTF, comparing the different reconstruction methods as well as the effect where tG-v1 , tSG-v2 refer to the measured runof the batch size on the quality of the recon- ning times of the algorithms, and nprocs, G-v1 , struction for SG. nprocs, SG-v2 to the number of CPU processes used to carry out the computations. 59 CHAPTER 5. SCALABLE LEARNING-BASED SAMPLING OPTIMIZATION Algorithm G-v1 (LBCS) Setting SG-v1 (sLBCS-v1) SG-v2 (sLBCS-v2) Time nprocs Time nprocs Speedup Time nprocs Speedup KTF 152×152×17×3 256×256×10×2 6d 23h ∼ 7d 8h∗ 152 256 11h 40 12h 20 38 64 58 (68) 57 (68) 3h 25 5h 20 38 64 170 (204) 173 (204) IST 152×152×17×3 3d 11h 152 5h 30 38 60 (68) 1h 37 38 184 (204) ALOHA 152×152×17×3 ∼ 25d 1h 152 1d 14h 25 38 62 (68) 18h 13 38 133 (204) ∗ Table 5.1: Running time of the greedy algorithms for different decoders and training data sizes. The setting corresponds to nx , ny , nframes , ntrain . The number nprocs denotes how many parallel processes are used by each simulation. ∗ means that the runtime was extrapolated from a few iterations. We used k = nprocs for SG-v1 and SG-v2 and l = 1 for SG-v2. The speedup column contains the measured speedup and the theoretical speedup in parentheses. Algo. Setting npars nprocs Time KTF 152×152×17×3 256×256×10×2∗ 1200 2400 38 64 6h 30 6h 45 IST 152×152×17×3 1200 38 3h 20 ALOHA 152×152×17×3 1200 38 1d 8h Table 5.2: Comparison of the learning-based random variable-density Gaussian sampling optimization for different settings. npars denotes the size of the grid used to optimize the parameters. For each set of parameters, the results were averaged on 20 masks drawn at random from the distribution considered. The npars include a grid made of 12 sampling rates (uniformly spread in [0.025, 0.3]), 10 different low frequency phase encodes (from 2 to 18 lines), and different widths of the Gaussian density (uniformly spread in [0.05, 0.3]) – 10 for the images of size 152 × 152, 20 in the other case. The main point of these tables is to show that the computational improvement is very significant in terms of resources, and that our approach improves greatly the efficiency of the methodof Gözcü  et al. (2018). This ratio might differ from the predicted speedup factor of Θ ml NT due to computational considerations. Table 5.1 shows that we have k roughly a factor 1.2 between the predicted and the measured speedup, mainly due to the communication between the multiple processes as well as I/O operations. Moving to the grid search computations for LB-VD, displayed in Table 5.2, we ought to do several remarks. The number of parameters was chosen in order to ensure the same order of magnitude of computations to be carried out for both LB-VD and sLBCS. Note also that in opposition to SG-v2, the grid search cannot be sped up by using a batch of training data: all the evaluation should be done on the same data. In the case of SG-v2 however, one can draw a new batch of training data at each sampling round of the algorithm. 60 5.4. EXPERIMENTS - STOCHASTIC LBCS 5.4.3 Single coil results Main results. We see on Figures 5.6 and 5.7 that the SG-v2 brings a consistent improvement over all baselines considered, even though some variable-density techniques are able to provide good results for some sampling rates and algorithms. Compared to Coherence-VD, there is always at least 1 dB improvement at any sampling rate, and it can be as much as 6.7 dB at 5% sampling rate for ALOHA. For the rest, the improvement is over 0.5 dB. Figure 5.6 also clearly indicates that the benefits of our learning-based framework become more apparent towards higher sampling rates, where the performance improvement over LB-VD reaches up to 1 dB. Towards lower sampling rates, with much fewer degrees of freedom to achieve good mask designs, the greedy method and LB-VD yield similar performance, which is expected2 . ALOHA KTF 40 PSNR 37 34 31 Coherence-VD LB-VD SGv2-KTF SGv2-ALOHA 28 0.05 0.15 Sampling rate 0.25 0.05 0.15 Sampling rate 0.25 Figure 5.6: PSNR as a function of sampling rate, evaluating the different masks (Coherence-VD, LB-VD, SG-v2-KTF, SG-v2-ALOHA) on both reconstruction methods (KTF and ALOHA). The results are averaged over 4 images. Focusing on Figure 5.7, we observe that comparing reconstruction algorithms using the model-based, model-driven approach Coherence-VD does not allow for a faithful comparison of their performance. In this case, the performance difference between KTH and ALOHA is marginal. However, when moving to the model-based, data-driven approach LB-VD, the difference becomes quantitatively clearer. Moving to the learningbased, data-driven approach SG-v2 makes the difference even more noticeable: ALOHA with its corresponding mask clearly outperforms KTF, and this conclusion could not be made by looking solely at reconstructions with VD-based masks. This trend is also reflected on Figure 5.6. This is an important observation, as most works have developed new reconstruction algorithms that were then tested using random masks designed with criteria such as minimal coherence. However, our results suggest that this approach does not reflect the true performance of a reconstruction method, and that one should look to optimize the At low sampling rates, learning-based approaches will prioritize low-frequency phase encoding lines, as those contain most of the energy at a given frame. 2 61 CHAPTER 5. SCALABLE LEARNING-BASED SAMPLING OPTIMIZATION KTF decoder ALOHA decoder PSNR=33.56 SSIM=0.86 PSNR=33.39 SSIM=0.85 PSNR=34.41 SSIM=0.88 PSNR=34.99 SSIM=0.89 PSNR=35.18 SSIM=0.88 PSNR=35.94 SSIM=0.9 PSNR=33.78 SSIM=0.87 PSNR=36.59 SSIM=0.91 Ground truth LB-VD ALOHA SGv2-ALOHA SGv2-KTF LB-VD KTF Coherence-VD Mask 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Figure 5.7: Comparison of the different reconstruction masks and decoders, for a sampling rate of 15% on a single sample with its PSNR/SSIM performances. mask and the reconstruction algorithm jointly. The results on Figure 5.6 highlight also an interesting asymmetry between the performance of the SG-v2 mask optimized on KTF (SG-v2-KTF) or ALOHA (SG-v2-ALOHA): the one from KTF is more centered towards low frequencies than the one of ALOHA, and also performs well, and better than LB-VD on ALOHA. However, the reverse is not true: the mask from ALOHA is very scattered throughout k-space and leads to a very poor 62 5.4. EXPERIMENTS - STOCHASTIC LBCS performance when applied to KTF. This trend is also observed on multi-coil data, as highlighted on Table 5.3. This suggests that ALOHA, not relying explicitly on ℓ1 sparsity, might succeed in exploiting varied information, from both low-frequency emphasizing masks from models promoting sparsity, but also higher frequency information. Remark. This point does not hold directly for deep learning methods, that will be discussed in later chapters. Indeed, the reconstruction methods that we use in these experiments are all based on an underlying model, such as sparsity or low-rank. They are not data-driven approaches to reconstruction. The question of what truly reflects the performance of a reconstruction method becomes even more complex in the case of deep learning-based reconstruction method, where the model is trained on a distribution of different sampling masks, making reconstruction and sampling even more intricate. We will come back to this question in Chapter 6. Remark. The results in the sequel are obtained with the SG-v1 algorithm, rather than SG-v2. Recall that SG-v1 already accelerated G-v1 by a factor 60, and that we did not observe, in our experiments, a significant performance difference between SG-v1 and SG-v23 . Cross-performances of performance measures. Up to here, we used PSNR as the performance measure, and we now compare it with the results of the greedy algorithm paired with SSIM, a metric that more closely reflects perceptual similarity (Wang et al., 2004). For brevity, we only consider ALOHA in this section. In the case where we optimized for SSIM, we noticed that unless a low-frequency initial mask is given, the reconstruction quality would mostly stagnate. This is why we chose to start the greedy algorithm with 4 low-frequency phase encodes at each frame in the SSIM case. The reconstructions for PSNR and SSIM are shown on Figure 5.8, where we see that the learning-based masks outperform the baselines across all sampling rates except at 2.5% in the SSIM case. The quality of the results is very close for both masks, but each tends to perform slightly better with the performance metric for which it was trained. The fact that the ALOHA-SSIM result at 2.5% has a very low SSIM is due to the fact that we impose 4 phase encodes across all frames, and the resulting sampling mask at 2.5% is a low pass mask in this case. A visual reconstruction is provided in Figure 5.9, we see that there is almost no difference in reconstruction quality, and that the masks remain very similar. Overall, we observe in this case that the performance metric selection does not have a dramatic effect on the The main reason of having these experiments on SG-v1 is that they wer