Get to know: Ryan Huang
Associate professor Ryan Huang joined the CSE faculty in Winter 2023 after working as an Assistant Professor at Johns Hopkins University and a researcher with Microsoft Azure. Huang’s research broadly covers systems including operating and distributed systems, with a specific focus in building reliable, efficient, and defensible systems from large-scale data centers to small mobile devices. He told us a little about his goals as a researcher and professor.
What are the key research problems that motivate your work?
My research is in computer system reliability, a broad area and with many interesting problems to work on. My work targets the large scale distributed systems that run in cloud computing infrastructure. The problem there is the complexity of those systems. They have a lot of dependencies which can cause very weird behavior, leading to failures that cannot really be addressed by traditional methods.
Those failures are typically known as gray failures. Gray failures are very difficult to detect, and they also cause developers to spend a lot of time trying to diagnose, localize, and mitigate them. That’s the key problem that a lot of my work addresses.
Actually, when I looked up “gray failures” I found that the first result to come up is a discussion about one of your previous publications (HotOS’17).
Yeah, that was a position paper. The terminology was not invented by us, but the problem is not really well explored in the research community. So we basically made the case that this is an important problem to address, and then went on to discuss some of the characteristics of the problem. In the time since we’ve published a bunch of projects that talk more about this problem.
What makes gray failures so hard to detect compared to standard failures?
Traditional failures, like crashes, are very simple to detect. They are basically black and white: either the system crashes, or it doesn’t.
In the case of a gray failure, the system is still working, but some very important functionality is broken. So how do you know what the important functionality is, and what are the criteria that you use to evaluate this? Whether the functionality is broken is really very case by case, and sometimes it’s very difficult to judge whether a system is working as expected.
Is this problem unique to the cloud setting, because of the scale and complexity of those systems?
Other environments may also encounter this, but cloud computing systems have millions of lines of code, have many, many dependencies, and rely on all kinds of different hardware. These environments also handle many different kinds of workloads, and because of this they can encounter many different corner case scenarios that are not very easy to trigger with typical testing.
What is unique about your approach to tackling these problems?
Unfortunately, there is not really a silver bullet that can perfectly address all gray failures. So I’ve taken a divide-and-conquer approach. We analyze failures from existing outage reports, try to identify common patterns, and then divide those failures into categories and sub-categories. From there we focus on addressing each category separately
Regarding the solutions themselves, these problems require a combination of multiple techniques to tackle even just one category of failures. My work typically aims to provide a general solution that isn’t tailored to a particular system. Because of this we often leverage program analysis techniques to develop a careful reasoning about our solution that can be applied to new systems easily. We also carefully combine some other methods, such as runtime techniques and data-driven methods.
Tell me a bit about your recent projects.
One of our recent projects looks at semantic violations in distributed systems. Distributed systems today provide many interfaces and applications to users, and each interface provides a certain kind of promise, what we can also call semantics. For example, the semantics of an API might be: if it runs without error, the specified file name will be created, and will be replicated to multiple nodes.
Users and applications typically rely on these semantics, these promises, to work correctly. But unfortunately, bugs in the system and hardware issues can cause a system to violate its semantics without providing explicit errors. The users and application go on thinking that the system has done something that they asked for, but it actually did not.
We observed that developers of cloud systems write a lot of tests, but the system can continue violating semantics in spite of this. The issue is that these tests only check for specific bugs and whether they’ve been fixed, without checking the underlying semantics.
To tackle this problem, we developed a tool called Oathkeeper. The high level idea of this tool is to use these developer tests to collect very useful information about what the semantics offered by the system actually are. We leverage those tests to try to infer a set of rules to represent the system’s semantics.
Once we infer those rules with the tool, we deploy them to the system at runtime and monitor the system for any events. Then we see whether these events violate the rules or not. These rules account for things like if-this-then-that scenarios, relationships between different events, or expected behaviors following different types of events.
Tell me about a project that you’re especially proud of, or a standout achievement in your work so far.
I’m particularly proud of a tool called OmegaGen. One of the factors that contributes to gray failures in distributed systems is that they only use very simplistic runtime detectors. Typically, they use something analogous to a heartbeat monitor. As you can imagine, these detectors just provide a periodic check that processes in the system are still running. It assumes that as long as the process can send a heartbeat, it’s working fine. This is often not the case – even when the process is sending a heartbeat, it actually can have some crucial functionality break.
To address this gap, we developed OmegaGen. It uses a technique that we developed called program reduction that aims to automatically generate watchdogs for a system based on its code. These are more than just heartbeat monitors – they are actually specifically devised for each different system to monitor whether it is functioning as intended.
You can, of course, ask developers to write these watchdogs manually. But because these systems are so large, it’s hard to actually cover the functionality very well. It was a very ambitious goal to automatically generate these without a lot of specifications, based purely on the code in the system. After overcoming many technical challenges, it was eventually published in the top systems conference and received a Best Paper Award.
What are some of the most important things that you hope to give to graduate students in your lab?
The ultimate goal for students in my lab is to become independent researchers by the end of their PhD, regardless of what career path they choose. More specifically I want them to be able to learn research methodology and research taste. In their specific projects they’ll learn a variety of specific techniques, but the research methodology is really crucial to knowing how to pick an important problem, how to formulate it, how to refine it, how to design solutions, and how to evaluate them. Those kinds of things will be important whether they go to industry or academia.
Research taste is also something that I hope I can give to my students. There are many ways to do research, and it varies in different areas, but by the time they graduate I hope they can judge what good research looks like.
What drew you to your work, both as a professor and as a researcher in systems?
At the time I graduated, I wasn’t actually sure whether I wanted to be a professor or to go into industry, so I spent some time trying to explore different options. In the end, I felt that academia was something I felt strongly about. When I started my PhD I was already interested in systems, though I didn’t know as much about reliability yet. I found that the area not only has a lot of social impact, it’s also very broad and has many hard, interesting problems. That’s why I continued to work in this area when I became a professor.