
Transformers, which have demonstrated remarkable performance improvements in natural language processing, have been increasingly adopted in computer vision tasks since the introduction of the Vision Transformer (ViT). In hyperspectral image (HIS) reconstruction, Transformer-based models have gained popularity due to their ability to capture global dependencies. While these models alleviate the certain limitations of convolutional neural networks (CNNs), their computational complexity scales quadratically with spatial resolution, making ultra-high-resolution reconstruction infeasible. Spectral Transformer variants have been proposed to reduce the computational burden associated with high spatial resolution, yet they still face challenges in handling ultra-high-resolution imagery. In this work, we propose a “Patched Input Spatial-Spectral Transformer (PSST)” that efficiently reconstructs HSIs from ultra-high-resolution RGB images. The model integrates a spatial transformer before spectral processing, enabling global context awareness while maintaining computational efficiency through in-model patch partitioning. Although performance slightly decreases for low-resolution inputs compared to state-of-the-art (SOTA) models, our method achieves the highest reconstruction quality for ultra-high-resolution inputs, achieving higher PSNR while significantly reducing memory consumption.