Abstract

Teaser Image

Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands. This rich spectral content has the potential to facilitate robust robotic perception, particularly in environments with complex material compositions, varying illumination, or other visually challenging conditions. However, current HSI semantic segmentation methods underperform due to their reliance on architectures and learning frameworks optimized for RGB inputs. In this work, we propose a novel hyperspectral adapter that leverages pretrained vision foundation models to effectively learn from hyperspectral data. Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features. Additionally, we introduce a modality-aware interaction block that facilitates effective integration of hyperspectral representations and frozen vision Transformer features through dedicated extraction and injection mechanisms. Extensive evaluations on three benchmark autonomous driving datasets demonstrate that our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods.

Technical Approach

Overview of our approach
Figure: Illustration of our proposed HSI-Adapter architecture. Our network takes hyperspectral images as input through the Spectral Transformer and a Spectral-Enhanced Spatial Prior Module (SPM), which extracts spectral and spatial features. These features interact with the frozen ViT backbone through a Modality-Aware Interaction Block, which utilizes gated bidirectional cross-attention. Finally, a semantic decoder yields pixel-level semantic class predictions from the fused representation.

We introduce HSI-Adapter, a modular framework that adapts pretrained Vision Transformers (ViTs) for hyperspectral semantic segmentation. The framework keeps the ViT backbone frozen to preserve its powerful representations learned from large-scale datasets, while adding specialized trainable modules to address the unique properties of hyperspectral data. It consists of three key components: a Spectral Transformer that captures long-range dependencies across spectral bands, a Spectral-Enhanced Spatial Prior Module (SPM) that integrates spectral and spatial information, and Modality-Aware Interaction Blocks that enable bidirectional fusion between hyperspectral features and ViT tokens. Together, these modules inject hyperspectral priors into the backbone, allowing the model to leverage pretrained knowledge while effectively adapting to hyperspectral inputs.


Building on these components, the core strength of HSI-Adapter lies in its dynamic, multi-stage interaction between hyperspectral features and the ViT backbone. At selected stages of the ViT, hyperspectral features are injected into the token stream via deformable cross-attention, efficiently sampling from the adapter's multi-scale feature maps. A novel modality gating mechanism then computes a learned, per-token weighting to adaptively balance hyperspectral cues with ViT tokens, ensuring that both modalities reinforce each other. This interaction is bidirectional, with an extractor stage leveraging the updated ViT context to refine the adapter's features through a cross-attention feedback loop. Finally, a UPerHead decoder aggregates the enriched multi-scale representations to produce dense, pixel-wise segmentation predictions, enabling HSI-Adapter to effectively combine the strengths of foundation models with domain-specific spectral priors for strong semantic performance on hyperspectral segmentation tasks.

Code

For academic usage a software implementation of this project based on PyTorch can be found in our GitHub repository and is released under the GPLv3 license. For any commercial purpose, please contact the authors.

Publications

If you find our work useful, please consider citing our paper:

Juana Valeria Hurtado, Rohit Mohan, Abhinav Valada

Hyperspectral Adapter for Semantic Segmentation with Vision Foundation Models
ArXiv Preprint, 2025.
(PDF) (BibTeX)

Authors

Juana Valeria Hurtado

Juana Valeria Hurtado

University of Freiburg

Rohit Mohan

Rohit Mohan

University of Freiburg

Abhinav Valada

Abhinav Valada

University of Freiburg

Acknowledgment

This work was financed by the Baden-Württemberg Stiftung gGmbH within the programm “Automone Robotik”. Additionally, this research was supported by the Bosch Research collaboration on AI-driven automated driving.