Five papers by CSE researchers at ASPLOS 2025
Five papers authored by researchers affiliated with CSE are being presented at the 2025 ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). The premier venue for multidisciplinary computer systems research, ASPLOS brings together top experts from around the world specializing in areas spanning computer architecture, programming languages, operating systems, and related topics including networking and storage. This year’s conference is taking place in Rotterdam, The Netherlands, from March 30-April 3.
CSE authors are presenting a range of innovations and findings at ASPLOS, from a new approach for error mitigation in quantum algorithms to novel caching systems for advanced network adapters. The papers being presented are as follows, with the names of authors affiliated with CSE in bold:
“Toleo: Scaling Freshness to Tera-scale Memory Using CXL and PIM“
Juechu Dong, Jonah Rosenblum, Satish Narayanasamy
Abstract: Trusted hardware’s freshness guarantee ensures that an adversary cannot replay an old value in response to a memory read request. They rely on maintaining a version number for each cache block and ensuring their integrity using a Merkle tree. However, these existing solutions protect only a small amount of main memory (few MBs), as the extraneous memory accesses to the Merkle tree increase prohibitively with the protected memory size. We present Toleo, which uses trusted smart memory connected through a secure CXL IDE network to safely store version numbers. Toleo eliminates the need for an unscalable Merkle tree to protect the integrity of version numbers by instead using smart memory as the root of trust. Additionally, Toleo ensures version confidentiality which enables stealth versions that reduce the version storage overhead in half.
Furthermore, in the absence of Merkle tree imposed constraints, we effectively exploit version locality at page granularity to compress version number by a factor of 240. These space optimizations make it feasible for one 168 GB Toleo smart memory device to provide freshness to a 28 TB CXL-expanded main memory pool in a rack server for a negligible performance overhead. We analyze the benefits of Toleo using several privacy-sensitive genomics, graph, generative AI, and database workloads.

“PIM is All You Need: A CXL-Enabled GPU-Free System for LLM Inference“
Yufeng Gu, Alireza Khadem, Sumanth Umesh, Ning Liang, Xavier Servot, Onur Mutlu, Ravi Iyer, Reetuparna Das
Abstract: Large Language Model (LLM) inference uses an autoregressive manner to generate one token at a time, which exhibits notably lower operational intensity compared to earlier Machine Learning (ML) models such as encoder-only transformers and Convolutional Neural Networks. At the same time, LLMs possess large parameter sizes and use key-value caches to store context information. Modern LLMs support context windows with up to 1 million tokens to generate versatile text, audio, and video content. A large key-value cache unique to each prompt requires a large memory capacity, limiting the inference batch size. Both low operational intensity and limited batch size necessitate a high memory bandwidth. However, contemporary hardware systems for ML model deployment, such as GPUs and TPUs, are primarily optimized for compute throughput. This mismatch challenges the efficient deployment of advanced LLMs and makes users to pay for expensive compute resources that are poorly utilized for the memory-bound LLM inference tasks.
We propose CENT, a CXL-ENabled GPU-Free sysTem for LLM inference, which harnesses CXL memory expansion capabilities to accommodate substantial LLM sizes, and utilizes near-bank processing units to deliver high memory bandwidth, eliminating the need for expensive GPUs. CENT exploits a scalable CXL network to support peer-to-peer and collective communication primitives across CXL devices. We implement various parallelism strategies to distribute LLMs across these devices. Compared to GPU baselines with maximum supported batch sizes and similar average power, CENT achieves 2.3x higher throughput and consumes 2.3x less energy. CENT reduces the Total Cost of Ownership (TCO), generating 5.2x more tokens per dollar than GPUs.
“Gigaflow: Pipeline-Aware Sub-Traversal Caching for Modern SmartNICs“
Annus Zulfiqar, Ali Imran, Venkat Kunaparaju, Ben Pfaff, Gianni Antichi, Muhammad Shahbaz
Abstract: The success of modern public/edge clouds hinges heavily on the performance of their end-host network stacks if they are to support the emerging and diverse tenants’ workloads (e.g., distributed training in the cloud to fast inference at the edge). Virtual Switches (vSwitches) are vital components of this stack, providing a unified interface to enforce high-level policies on incoming packets and route them to physical interfaces, containers, or virtual machines. As performance demands escalate, there has been a shift toward offloading vSwitch processing to SmartNICs to alleviate CPU load and improve efficiency. However, existing solutions struggle to handle the growing flow rule space within the NIC, leading to high miss rates and poor scalability.
In this paper, we introduce Gigaflow, a novel caching system tailored for deployment on SmartNICs to accelerate vSwitch packet processing. Our core insight is that by harnessing the inherent pipeline-aware locality within programmable vSwitch pipelines-defining policies (e.g., L2, L3, and ACL) and their execution order (e.g., using P4 and OpenFlow)-we can create cache rules for shared segments (sub-traversals) within the pipeline, rather than caching entire flows. These shared segments can be reused across multiple flows, resulting in higher cache efficiency and greater rule-space coverage. Our evaluations show that Gigaflow achieves up to a 51% improvement in cache hit rate (average 25% improvement) over traditional caching solutions (i.e., Megaflow), while capturing up to 450x more rule space within the limited memory of today’s SmartNICs-all while operating at line speed.

“Clapton: Clifford Assisted Problem Transformation for Error Mitigation in Variational Quantum Algorithms“
Lennart Maximilian Seifert, Siddharth Dangwal, Frederic T. Chong, Gokul Subramanian Ravi
Abstract: Variational quantum algorithms (VQAs) show potential for quantum advantage in the near term of quantum computing, but demand a level of accuracy that surpasses the current capabilities of NISQ devices. To systematically mitigate the impact of quantum device error on VQAs, we propose Clapton: Clifford-Assisted Problem Transformation for Error Mitigation in Variational Quantum Algorithms. Clapton leverages classically estimated good quantum states for a given VQA problem, classical simulable models of device noise, and the variational principle for VQAs. It applies transformations on the VQA problem’s Hamiltonian to lower the energy estimates of known good VQA states in the presence of the modeled device noise. The Clapton hypothesis is that as long as the known good states of the VQA problem are close to the problem’s ideal ground state and the device noise modeling is reasonably accurate (both of which are generally true), then the Clapton transformation substantially decreases the impact of device noise on the ground state of the VQA problem, thereby increasing the accuracy of the VQA solution. Clapton is built as an end-to-end application-to-device framework and achieves mean VQA initialization improvements of 1.7x to 3.7x, and up to a maximum of 13.3x, over the state-of-the-art baseline when evaluated for a variety of scientific applications from physics and chemistry on noise models and real quantum devices.

“Proactive Runtime Detection of Aging-Related Silent Data Corruptions: A Bottom-Up Approach”
Jiacheng Ma, Majd Ganaiem, Madeline Burbage, Theo Gregersen, Rachel McAmis, Freddy Gabbay, Baris Kasikci
Abstract: Recent advancements in semiconductor process technologies have unveiled the susceptibility of hardware circuits to reliability issues, especially those related to transistor aging. Transistor aging gradually degrades gate performance, eventually causing hardware to behave incorrectly. Such misbehaving hardware can result in silent data corruptions (SDCs) in software-a type of failure that comes without logs or exceptions, but causes miscomputing instructions, bitflips, and broken cache coherency. Alas, while design efforts can be made to mitigate transistor aging, complete elimination of this problem during design and fabrication cannot be guaranteed. This emerging challenge calls for a mechanism that not only detects potentially aged hardware in the field, but also triggers software mitigations at application runtime.
We propose Vega, a novel workflow that allows efficient detection of aging-related failures at software runtime. Vega leverages the well-studied gate-level modeling of aging effects to identify susceptible signal propagation paths that could fail due to transistor aging. It then utilizes formal verification techniques to generate short test cases that activate these paths and detect any failure within them. Vega integrates the test cases into a user application by directly fusing them together, or by packaging the test cases into a library that the application can invoke. We demonstrate our proposed techniques on the arithmetic logic unit and floating-point unit of a RISC-V CPU. We show that Vega generates effective test cases and integrates them into applications with an average of 0.8% performance overhead.