Computer vision identifies images with a classification tree, including broad and specific categories

A new hierarchical classification model uses segmentation to focus attention on different parts of the same image, surpassing previous models in accuracy and precision.
A graphic with a photo in the center and two flow charts on either side. Center: An input photograph of a hummingbird hovering in front of a yellow flower. Left side, prior works: Two arrows point from the input photograph to two separate images labeled coarse classifier and fine classifier. The coarse classifier highlights the yellow flower and an arrow points to the coarse label plant. The fine classifier highlights the hummingbird and an arrow points to the fine label green hermit. Right side, our work: One arrow points to a graphic of segmented polygons that represent object boundaries within the photograph. A green hermit, fine label points to one section of polygons and a bird, coarse label points to another section of polygons.
The new computer vision model, H-CAST, aligns coarse and fine grained classifiers using intra-image segmentation. Previous models treat fine and coarse levels as separate tasks, leading to mistakes where the fine classifier predicts a bird species while the coarse classifier predicts “plant.” Credit: Park et al., 2025.

A new AI model, H-CAST, groups fine details into object-level concepts as attention moves from lower to high layers, outputting a classification tree—such as bird, eagle, bald eagle—rather than focusing only on fine-grained recognition. 

The research was presented recently at the International Conference on Learning Representations in Singapore and builds upon the team’s prior model, CAST—the counterpart for visually grounded single-level classification. 

While some argue that deep learning can reliably provide fine-grained classification and infer broader categories, this tactic only works with clear images.

“Real-world applications involve plenty of imperfect images. If a model only focuses on fine-grained classification, it gives up before it even starts on images that don’t have enough information to support that level of detail,” said Stella Yu, a professor of computer science and engineering at U-M and contributing author of the study.

Hierarchical classification overcomes this issue, providing classification at multiple levels of detail for the same image. However, up to this point hierarchical models have struggled with inconsistencies that come with treating each level as its own classification task. 

For example when identifying a bird, fine-grained classification often depends on local details like beak shape or feather color while coarse labels require global features like overall shape. When these two levels are disconnected, it can result in a fine classifier predicting “green parakeet” while the coarse classifier predicts “plant.”

The new model instead focuses all levels on the same object at different levels of detail by aligning fine-to-coarse predictions through intra-image segmentation. 

Previous hierarchical models trained from coarse to specific, focusing on the logic of semantic labeling which flows from general to specific (e.g., bird, hummingbird, green hermit). H-CAST instead trains in the visual space where recognition begins with fine details like beaks and wings that are composed of coarser structures, leading to better alignment and accuracy.

“Most prior work in hierarchical classification focused on semantics alone, but we found that consistent visual grounding across levels can make a huge difference. By encouraging models to ‘see’ the hierarchy in a visually coherent way, we hope this work inspires a shift toward more integrated and interpretable recognition systems,” said Seulki Park, a postdoctoral research fellow of computer science and engineering at the University of Michigan and lead author of the study. 

Unlike prior methods, the research team leveraged unsupervised segmentation—typically used for identifying structures within a larger image—to support hierarchical classification. They demonstrate that its visual grouping mechanism can be effectively applied to classification without requiring pixel-level labels and helps improve segmentation quality. 

To demonstrate the new model’s effectiveness, H-CAST was tested on four benchmark datasets and compared against hierarchical (FGN, HRN. TransHP, Hier-ViT) and baseline models (ViT, CAST, HiE).

“Our model outperformed zero-shot CLIP and state-of-the-art baselines on hierarchical classification benchmarks, achieving both higher accuracy and more consistent predictions,” said Yu.

For instance on the BREEDS dataset, H-CAST’s full-path accuracy was 6% higher than previous state-of-the-art and 11% higher than baselines.

Feature-level nearest neighbor analysis also shows H-CAST retrieves semantically and visually consistent samples across hierarchy levels—unlike prior models that often retrieve visually similar but semantically incorrect samples.

This work could potentially be applied to any situation that requires an understanding of multi-level images. It could particularly benefit wildlife monitoring, identifying species where possible but falling back on coarser predictions. H-CAST can also help autonomous vehicles interpret imperfect visual input like occluded pedestrians or distant vehicles, helping the system make safe, approximate decisions at coarser levels of detail. 

“Humans naturally fall back on coarser concepts. If I can’t tell if an image is of a Pembroke Corgi, I can still confidently say it’s a dog. But models often fail at that kind of flexible reasoning. We hope to eventually build a system that can adapt its prediction level just like we do,” said Park.

H-CAST was trained and tested using ARC High Performance Computing at U-M. 

UC Berkeley, MIT and Scaled Foundations also contributed to this research.

This research was supported by the National Science Foundation (2215542; 2313151), a Berkeley AI Research grant with Google, and Bosch gift funds to Stella Yu at UC Berkeley and the University of Michigan, with partial compute support from the National Artificial Intelligence Research Resource (NAIRR) Pilot (CIS240431, CIS250430).

Full citation: “Visually consistent hierarchical image classification,” Seulki Park, Youren Zhang, Stella X. Yu, Sara Beery, and Jonathan Huang, International Conference on Learning Representations (2025).