Introduction
Recent research [1, 2] in brain image segmentation have made significant progress in coupling local, data-efficient convolution operations with global, expressive spatial-mixing layers such as Transformer [3] and Mamba [4]. Enhanced by the latest backbone structures in the general domain, these models overcome the locality of convolutions and effectively capture long-range spatial dependencies. However, they do not have inherent structure that attends to cross-scale vision information beyond those encoded by the backbone, forfeiting powerful inductive biases that are particularly important and relevant to medical segmentation.
Theoretical Work
Theoretically, we explore and generalize diagonal plus low-rank (DPLR) linear maps [5, 6] to a larger class of parametrized transformations and interpret ResNet-like architectures [7] through this lens.Empirical Observations
Empirically, we propose CS-UNet, a UNet-like [8] 3D medical segmentation model that transforms latent features with strong visual priors, and test its throughput and performance on BraTS 2023 [9].Introduction
Recent research [1, 2] in brain image segmentation have made significant progress in coupling local, data-efficient convolution operations with global, expressive spatial-mixing layers such as Transformer [3] and Mamba [4]. Enhanced by the latest backbone structures in the general domain, these models overcome the locality of convolutions and effectively capture long-range spatial dependencies. However, they do not have inherent structure that attends to cross-scale vision information beyond those encoded by the backbone, forfeiting powerful inductive biases that are particularly important and relevant to medical segmentation.
Theoretical Work
Theoretically, we explore and generalize diagonal plus low-rank (DPLR) linear maps [5, 6] to a larger class of parametrized transformations and interpret ResNet-like architectures [7] through this lens.Empirical Observations
Empirically, we propose CS-UNet, a UNet-like [8] 3D medical segmentation model that transforms latent features with strong visual priors, and test its throughput and performance on BraTS 2023 [9].