Fifteen papers by ECE researchers to be presented at the Conference on Neural Information Processing Systems
ECE researchers will be presenting fifteen papers at the 38th Annual Conference on Neural Information Processing Systems (NeurIPS), held in Vancouver, Canada, on December 10–15, 2024.
The problems tackled by ECE faculty and students include MRI and CT reconstruction, reinforcement learning, fine tuning language models, creating visually interpretable spectrograms, and more. The following papers are accepted at the conference, with the names of ECE researchers in bold.
CONTRAST: Continual Multi-source Adaptation to Dynamic Distributions
Sk Miraj Ahmed, Fahim Faisal Niloy, Xiangyu Chang, Dripta S. Raychaudhuri, Samet Oymak, Amit Roy-Chowdhury
Abstract: Adapting to dynamic data distributions is a practical yet challenging task. One effective strategy is to use a model ensemble, which leverages the diverse expertise of different models to transfer knowledge to evolving data distributions. However, this approach faces difficulties when the dynamic test distribution is available only in small batches and without access to the original source data. To address the challenge of adapting to dynamic distributions in such practical settings, we propose Continual Multi-source Adaptation to Dynamic Distributions (CONTRAST) that handles multiple source models and optimally combines them to adapt to the dynamic test data. CONTRAST has two distinguishing features. First, it efficiently computes the optimal combination weights to combine the source models to adapt to the test data distribution continuously as a function of time. Second, it identifies which of the source model parameters to update so that only the model which is most correlated to the target data is adapted, leaving the less correlated ones untouched; this mitigates the issue of “forgetting” the source model parameters by focusing only on the source model that exhibits the strongest correlation with the test batch distribution. Experiments on diverse datasets demonstrate that the combination of multiple source models does at least as well as the best source (with hindsight knowledge), and performance does not degrade as the test data distribution changes over time (robust to forgetting).
Fine-grained Analysis of In-context Linear Estimation
[read the full paper – Fine-grained Analysis of In-context Linear Estimation: Data, Architecture, and Beyond]
Yingcong Li, Ankit Rawat, Samet Oymak
Abstract: Recent research has shown that Transformers with linear attention are capable of in-context learning (ICL) by implementing a linear estimator through gradient descent steps. However, the existing results on the optimization landscape apply under stylized settings where task and feature vectors are assumed to be IID and the attention weights are fully parameterized. In this work, we develop a stronger characterization of the optimization and generalization landscape of ICL through contributions on architectures, low-rank parameterization, and correlated designs: (1) We study the landscape of 1-layer linear attention and 1-layer H3, a state-space model. Under a suitable correlated design assumption, we prove that both implement 1-step preconditioned gradient descent. We show that thanks to its native convolution filters, H3 also has the advantage of implementing sample weighting and outperforming linear attention in suitable settings. (2) By studying correlated designs, we provide new risk bounds for retrieval augmented generation which reveal how ICL sample complexity significantly benefits from distributional alignment. (3) We derive the optimal risk for low-rank parameterized attention weights in terms of covariance spectrum. Through this, we also shed light on how LoRA can adapt to a new distribution by capturing the shift between task covariances. Experimental results corroborate our theoretical findings. Overall, this work explores the optimization and risk landscape of ICL in practically meaningful settings and contributes to a more thorough understanding of its mechanics.
Selective Attention: Enhancing Transformer through Principled Context Control
Xuechen Zhang, Xiangyu Chang, Mingchen Li, Amit Roy-Chowdhury, Jiasi Chen, Samet Oymak
Abstract: The attention mechanism is the central component of the transformer architecture as it enables the model to create learnable weighted combinations of the tokens that are relevant to the query. While self-attention has enjoyed major success, it notably treats all queries 𝒒 in the same way by applying the mapping 𝑽 T softmax(𝐊𝒒), where 𝑽, 𝐊 are the query and key respectively. In this work, we argue that this uniform treatment hinders the ability to control contextual sparsity and relevance. To overcome this, we introduce a Selective Self-Attention (SSA) layer that augments the softmax nonlinearity with a principled temperature scaling strategy. SSA utilizes a query-temperature to adapt the contextual sparsity of the softmax map to the specific query and its position in the context window. Through theory and experiments, we demonstrate that this alleviates attention dilution, aids the optimization process, and enhances the model’s ability to assign distinct sparsity levels across queries. To enhance relevance control, we also introduce a value-temperature and show that it boosts the model’s ability to suppress irrelevant/noisy tokens. Extensive empirical evaluations corroborate that SSA noticeably improves the language modeling performance: SSA-equipped Pythia and Llama models achieve a respectable and consistent perplexity improvement on language modeling benchmarks while introducing only about 5\% more parameters.
Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning
[read the full paper –TREACLE: Thrifty Reasoning via Context-Aware LLM and Prompt Selection]
Xuechen Zhang, Zijian Huang, Ege Onur Taga, Carlee Joe-Wong, Samet Oymak, Jiasi Chen
Abstract: Recent successes in natural language processing have led to the proliferation of large language models (LLMs) by multiple providers. Each LLM offering has different inference accuracy, monetary cost, and latency, and their accuracy further depends on the exact wording of the question (i.e., the specific prompt). At the same time, users often have a limit on monetary budget and latency to answer all their questions, and they do not know which LLMs to choose for each question to meet their accuracy and long term budget requirements. To navigate this rich design space, we propose TREACLE (Thrifty Reasoning via Context-Aware LLM and Prompt Selection), a reinforcement learning policy that jointly selects the model and prompting scheme while respecting the user’s monetary cost and latency constraints. TREACLE uses the problem context, including question text embeddings (reflecting the type or difficulty of a query) and the response history (reflecting the consistency of previous responses) to make smart decisions. Our evaluations on standard reasoning datasets (GSM8K, CSQA, and LLC) with various LLMs and prompts show that TREACLE enables cost savings of up to 85% compared to baselines, while maintaining high accuracy. Importantly, it provides the user with the ability to gracefully trade off accuracy for cost.
Once Read is Enough: Finetuning-free Language Models with Cluster-guided Sparse Experts for Long-tail Domain Knowledge
Fang Dong, Mengyi Chen, Jixian Zhou, Yubin Shi, Yixuan Chen, Mingzhi Dong, Yujiang Wang, Dongsheng Li, Xiaochen Yang, Rui Zhu, Robert P. Dick, Qin Lv, Fan Yang, Tun Lu, Ning Gu
Abstract: Language models (LMs) only pretrained on a general and massive corpus usually cannot attain satisfying performance on domain-specific downstream tasks, and hence, finetuning pretrained LMs is a common and indispensable practice.However, domain finetuning can be costly and time-consuming, hindering LMs’ deployment in real-world applications.In this work, we consider the incapability to memorize domain-specific knowledge embedded in the general corpus with rare occurrences and \say{long-tail} distributions as the leading cause for pretrained LMs’ inferior downstream performance. Analysis of Neural Tangent Kernels (NTKs) reveals that those long-tail data are commonly overlooked in the model’s gradient updates and, consequently, are not effectively memorized, leading to poor domain-specific downstream performance.Based on the intuition that data with similar semantic meaning are closer in the embedding space, we devise a Cluster-guided Sparse Expert (CSE) layer to actively learn long-tail domain knowledge typically neglected in previous pretrained LMs. During pretraining, a CSE layer efficiently cluster domain knowledge together and assign long-tail knowledge to designate extra experts. CSE is also a light-weight structure that only needs to be incorporated in several deep layers. With our training strategy, we found that during pretraining, data of long-tail knowledge gradually formulate isolated, \say{outlier} clusters in an LM’s representation spaces, especially in deeper layers. Our experimental results show that only pretraining CSE-based LMs is enough to achieve superior performance than regularly pretrained-finetuned LMs on various downstream tasks, implying the prospects of finetuning-free language models.
Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment
[read the full paper – Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment]
Zixian Yang, Xin Liu, Lei Ying
Abstract: The traditional multi-armed bandit (MAB) model for recommendation systems assumes the user stays in the system for the entire learning horizon. In new online education platforms such as ALEKS or new video recommendation systems such as TikTok, the amount of time a user spends on the app depends on how engaging the recommended contents are. Users may temporarily leave the system if the recommended items cannot engage the users. To understand the exploration, exploitation, and engagement in these systems, we propose a new model, called MAB-A where A” stands for abandonment and the abandonment probability depends on the current recommended item and the user’s past experience (called state). We propose two algorithms, ULCB and KL-ULCB, both of which do more exploration (being optimistic) when the user likes the previous recommended item and less exploration (being pessimistic) when the user does not. We prove that both ULCB and KL-ULCB achieve logarithmic regret, O(logK), where K is the number of visits (or episodes). Furthermore, the regret bound under KL-ULCB is asymptotically sharp. We also extend the proposed algorithms to the general-state setting. Simulation results show that the proposed algorithms have significantly lower regret than the traditional UCB and KL-UCB, and Q-learning-based algorithms.
DiffusionBlend: Learning 3D Image Prior through Position-aware Diffusion Score Blending for 3D Computed Tomography Reconstruction
[read the full paper – DiffusionBlend: Learning 3D Image Prior through Position-aware Diffusion Score Blending for 3D Computed Tomography Reconstruction]
Bowen Song, Jason Hu, Zhaoxu Luo, Jeffrey Fessler, Liyue Shen
Abstract: Diffusion models face significant challenges when employed for large-scale medical image reconstruction in real practice such as 3D Computed Tomography (CT).Due to the demanding memory, time, and data requirements, it is difficult to train a diffusion model directly on the entire volume of high-dimensional data to obtain an efficient 3D diffusion prior. Existing works utilizing diffusion priors on single 2D image slice with hand-crafted cross-slice regularization would sacrifice the z-axis consistency, which results in severe artifacts along the z-axis. In this work, we propose a novel framework that enables learning the 3D image prior through position-aware 3D-patch diffusion score blending for reconstructing large-scale 3D medical images. To the best of our knowledge, we are the first to utilize a 3D-patch diffusion prior for 3D medical image reconstruction. Extensive experiments on sparse view and limited angle CT reconstruction show that our DiffusionBlend method significantly outperforms previous methods and achieves state-of-the-art performance on real-world CT reconstruction problems with high-dimensional 3D image (i.e., 256256500 ). Our algorithm also comes with better or comparable computational efficiency than previous state-of-the-art methods.
Learning Image Priors Through Patch-Based Diffusion Models for Solving Inverse Problems
[read the full paper – Learning Image Priors through Patch-based Diffusion Models for Solving Inverse Problems]
[view the GitHub page for the PyTorch implementation]
Jason Hu, Bowen Song, Xiaojian Xu, Liyue Shen, Jeffrey Fessler
Abstract: Diffusion models can learn strong image priors from underlying data distribution and use them to solve inverse problems, but the training process is computationally expensive and requires lots of data. Such bottlenecks prevent most existing works from being feasible for high-dimensional and high-resolution data such as 3D images. This paper proposes a method to learn an efficient data prior for the entire image by training diffusion models only on patches of images. Specifically, we propose a patch-based position-aware diffusion inverse solver, called PaDIS, where we obtain the score function of the whole image through scores of patches and their positional encoding and utilize this as the prior for solving inverse problems. First of all, we show that this diffusion model achieves an improved memory efficiency and data efficiency while still maintaining the capability to generate entire images via positional encoding. Additionally, the proposed PaDIS model is highly flexible and can be plugged in with different diffusion inverse solvers (DIS). We demonstrate that the proposed PaDIS approach enables solving various inverse problems in both natural and medical image domains, including CT reconstruction, deblurring, and superresolution, given only patch-based priors. Notably, PaDIS outperforms previous DIS methods trained on entire image priors in the case of limited training data, demonstrating the data efficiency of our proposed approach by learning patch-based prior.
Images that Sound: Composing Images and Sounds on a Single Canvas
[read the full paper – Images that Sound: Composing Images and Sounds on a Single Canvas]
[view the webpage for the “Images that Sound” project]
Ziyang Chen, Daniel Geng, Andrew Owens
Abstract: Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these spectrograms “images that sound”. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual appearance of a desired image prompt.
Label Noise: Ignorance Is Bliss
Yilun Zhu, Jianxin Zhang, Aditya Gangrade, Clay Scott
Abstract: We establish a new theoretical framework for learning under multi-class, instance-dependent label noise. This framework casts learning with label noise as a form of domain adaptation, in particular, domain adaptation under posterior drift. We introduce the concept of \emph{relative signal strength} (RSS), a pointwise measure that quantifies the transferability from noisy to clean posterior. Using RSS, we establish nearly matching upper and lower bounds on the excess risk. Our theoretical findings support %s the minimax optimality of the simple \emph{Noise Ignorant Empirical Risk Minimization (NI-ERM)} principle, which minimizes empirical risk while ignoring label noise. Finally, we translate this theoretical insight into practice: by using NI-ERM to fit a linear classifier on top of a self-supervised feature extractor, we achieve state-of-the-art performance on the CIFAR-N data challenge.
The Implicit Bias of Gradient Descent on Separable Multiclass Data
Hrithik Ravi, Clay Scott, Daniel Soudry, Yutong Wang
Abstract: Implicit regularization describes the phenomenon where optimization-based training algorithms, without explicit regularization, show a preference for simple estimators even when more complex estimators have equal objective values. Multiple works have developed the theory of implicit regularization for binary classification under the assumption that the loss satisfies an exponential tail property. However, there is a noticeable gap in analysis for multiclass classification, with only a handful of results which themselves are restricted to the cross-entropy loss. In this work, we employ the framework of Permutation Equivariant and Relative Margin-based (PERM) losses [Wang and Scott, 2024] to introduce a multiclass extension of the exponential tail property. This class of losses includes not only cross-entropy but also other losses. Using this framework, we extend the implicit bias result of Soudry et al. [2018] to multiclass classification. Furthermore, our proof techniques closely mirror those of the binary case, thus illustrating the power of the PERM framework for bridging the binary-multiclass gap.
BLAST: Block-Level Adaptive Structured Matrices for Efficient Deep Neural Network Inference
[Read the full paper – BLAST: Block-Level Adaptive Structured Matrices for Efficient Deep Neural Network Inference]
Changwoo Lee, Soo Min Kwon, Qing Qu, Hun-Seok Kim
Abstract: Large-scale foundation models have demonstrated exceptional performance in language and vision tasks. However, the numerous dense matrix-vector operations involved in these large deep neural networks pose significant challenges for inference. To address these computational challenges, we introduce the Block-Level Adaptive STructured (BLAST) matrix, which aims to learn, identify, and exploit efficient structures prevalent in the weight matrices of deep learning models. The BLAST matrix is designed as a unique factorization technique to model the weights, employing a substantially reduced intrinsic dimension with fewer parameters, enabling lower complexity matrix multiplications. The components of the BLAST matrix can either be learned from data or estimated using an existing weight matrix via a preconditioned gradient descent method. We demonstrate that the BLAST matrices are applicable to any linear layer and can be employed during various stages of model deployment, including pre-training, fine-tuning, and post-training compression. Overall, our experimental results validate the efficiency of the BLAST matrix by exhibiting either minimal accuracy degradation or an increase in performance, both in language and vision tasks.
Understanding Generalizability of Diffusion Models Requires Rethinking the Hidden Gaussian Structure
[read the full paper – Understanding Generalizability of Diffusion Models Requires Rethinking the Hidden Gaussian Structure]
Xiang Li, Yixiang Dai, Qing Qu
Abstract: Recently, diffusion models have emerged as a highly effective new class of deep generative models, demonstrating exceptional generation performance. In this work, we study the generalizability (i.e., be able to generate new samples) of diffusion models by looking into the hidden properties of the learned score functions, which are essentially a series of deep denoisers trained on various noise levels. Notably, we observe that the nonlinear diffusion denoisers exhibit strong linearity when the diffusion model is able to generalize. This discovery leads us to approximate their function mappings with linear models, which serve as the first-order approximation of the nonlinear diffusion denoisers. Surprisingly, these linear denoisers are approximately the optimal denoisers for a multivariate Gaussian distribution characterized by the empirical mean and covariance of the training dataset. This finding implies that the diffusion models have the inductive bias towards capturing and utilizing the Gaussian structure (covariance information) of the training dataset for data generation. Our experiment results show that this inductive bias becomes more pronounced when the model capacity is relatively small compared to the size of the training dataset. However, even the model is highly overparameterized, this inductive bias emerges during the initial training phases before the model fully memorizes its training data. Our study provides crucial insights into understanding the notable strong generalizability recently observed in real-world diffusion models.
Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing
[read the full paper: Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing]
[view the webpage for the LOCO-Edit project]
Siyi Chen, Huijie Zhang, Minzhe Guo, Yifu Lu, Peng Wang, Qing Qu
Abstract: Recently, diffusion models have emerged as a powerful class of generative models with impressive generative capabilities. Despite their success in generating images guided by class or text-to-image conditions, achieving precise and disentangled image generation without additional training remains a significant challenge. In this work, we take one step towards this problem by starting from an intriguing observation: among a certain range of noise levels, the learned posterior mean predictor (PMP) is locally linear, and the singular vectors of its Jacobian lie in low-dimensional semantic subspaces. Under mild data assumptions, we validate the low-rankness and linearity of the PMP, as well as the homogeneity, composability, and linearity of the identified semantic directions within the subspace. These properties are quite universal, appearing consistently across various network architectures (e.g., UNet and Transformers) and datasets. These insights motivate us to propose LOw-rank COntrollable image editing (LOCO Edit). Specifically, the local linearity in the Jacobian provides a single-step, training-free method for precise local editing of regions of interest, while the low-rank nature allows for the effective identification of semantic directions using subspace power methods. Our method is broadly applicable to both undirected and text-directed editing and works across various diffusion-based models. Finally, extensive empirical studies demonstrate the effectiveness and efficiency of our approach.
Image Reconstruction Via Autoencoding Sequential Deep Image Prior
[read the full paper]
[view the GitHub page for the project]
Ismail Alkhouri, Shijun Liang, Evan Bell, Qing Qu, Rongrong Wang, Saiprasad Ravishankar
Abstract: Recently, Deep Image Prior (DIP) has emerged as an effective unsupervised one-shot learner, delivering competitive results across various image recovery problems. This method only requires the noisy measurements and a forward operator, relying solely on deep networks initialized with random noise to learn and restore the structure of the data. However, DIP is notorious for its vulnerability to overfitting due to the overparameterization of the network. Building upon insights into the impact of the DIP input and drawing inspiration from the gradual denoising process in cutting-edge diffusion models, we introduce the Autoencoding Sequential DIP (aSeqDIP) for image reconstruction by progressively denoising and reconstructing the image through a sequential optimization of multiple network architectures. This is achieved using an input-adaptive DIP objective, combined with an autoencoding regularization term. Our approach differs from the vanilla DIP by not relying on a single-step denoising process. Compared to diffusion models, our method does not require pre-training and outperforms the vanilla DIP in alleviating overfitting while maintaining the same number of parameter updates. Through extensive experiments, we validate the effectiveness of our method in various imaging reconstruction tasks, such as MRI and CT reconstruction, as well as in image restoration tasks like image denoising and inpainting.