Back to articles
Work Presented at Archiving 2026 FastTrack
Volume: 0 | Article ID: 050401
Image
Hyperspectral and Multispectral Fusion Method based on Quadratic Optimization Model
Abstract
Abstract

Hyperspectral super-resolution fusion technology aims to fuse hyperspectral images with multispectral images in the same scene for the super-resolution reconstruction of hyperspectral data. Current deep learning methods are usually trained by data augmentation or constructing complex encoding–decoding networks, often neglecting the physical characteristics of hyperspectral data involving width, height, and three-dimensional channel information. Traditional methods continue to play a role in super-resolution reconstruction although there are deficiencies in the fusion result. For this reason, this paper proposes a Quadratic Optimization Model (QOM) that combines deep learning and traditional mathematical methods. The model first utilizes a three-module neural network for initial fusion; designs corresponding modules for the recovery of dimensional, spatial, and spectral information; and introduces spatial and channel attention mechanisms to enhance feature extraction capability. Subsequently, the preliminary fusion results are optimized by secondary super-resolution through the traditional matrix decomposition method to further improve fusion quality. The experimental results demonstrate that the QOM achieves excellent performance on all seven datasets, exhibiting strong fusion quality while maintaining favorable computational complexity (TFLOPS: 8.6024, Params: 2.9307). Noise experiments verify its high robustness.

Subject Areas :
Views 0
Downloads 0
 articleview.views 0
 articleview.downloads 0
  Cite this article 

Zican Sang, Long Ma, "Hyperspectral and Multispectral Fusion Method based on Quadratic Optimization Modelin Journal of Imaging Science and Technology,  2026,  pp 1 - 24,  https://doi.org/10.2352/J.ImagingSci.Technol.2026.70.5.050401

 Copy citation
  Copyright statement 
Copyright © Society for Imaging Science and Technology 2026
 Open access
  Article timeline 
  • received March 2025
  • accepted February 2026
jist
JIMTE6
Journal of Imaging Science and Technology
J. Imaging Sci. Technol.
J. Imaging Sci. Technol.
1062-3701
1943-3522
Society for Imaging Science and Technology
1.
Introduction
In the field of remote sensing, spectral imaging plays an indispensable role due to its ability to capture rich information across spatial and spectral dimensions [1, 2]. Hyperspectral images (HSI) have been widely applied in astronomy [3, 4], agriculture [5, 6], feature classification [7, 8], and image classification [9, 10]. However, the acquisition of high-resolution hyperspectral images (HR_HSI) is often limited by hardware constraints: ensuring an optimal signal-to-noise ratio (SNR) requires long exposure times, which results in low spatial resolution. Conversely, low-resolution hyperspectral images (LR_HSI) and high-resolution multispectral images (HR_MSI) of the same scene are easier to obtain. This discrepancy motivates the fusion of LR_HSI and HR_MSI to reconstruct HR_HSI through computational methods [11].
The fusion of hyperspectral and multispectral data presents two major challenges. First, it requires the simultaneous preservation and transfer of spatial and spectral information, which is difficult due to the inherent trade-off between spatial resolution and spectral richness. Second, existing methods face limitations in either performance or interpretability. Traditional approaches, such as matrix decomposition and tensor-based techniques, can partially transfer spatial–spectral information but often fail to fully exploit the potential of the data. Deep learning methods, particularly convolutional neural networks (CNNs), excel in adaptive feature extraction and capturing complex nonlinear relationships, yet they often rely on deep and complex architectures, leading to high computational costs and instability under different parameter initializations. Therefore, achieving a balance among reconstruction accuracy, computational efficiency, and interpretability remains an open problem.
Motivated by these challenges, we propose a novel two-stage fusion framework, termed the Quadratic Optimization Model (QOM), which combines the strengths of deep learning and traditional methods. As illustrated in Figure 1, the QOM fully integrates the physical characteristics of hyperspectral 3D data and employs a physically interpretable and structurally clear framework for effective spatial and spectral information transfer. In the first stage, a CNN-based network integrated with spatial and channel attention mechanisms performs the initial fusion of LR_HSI and HR_MSI. This network is composed of three dedicated modules: the first module recovers the dimensional information of the input data, the second module enhances spatial details, and the third module reconstructs spectral details. Moreover, specific spatial and spectral loss functions (Lossspat and Lossspec) [12] guide the network to accurately recover both types of information while enabling rapid convergence during training. In the second stage, the initial fusion results are further refined through quadratic optimization based on matrix decomposition. By iteratively optimizing the abundance matrix with the aid of multispectral data, the dictionary and coefficient matrices are updated, leading to more accurate and physically interpretable fused hyperspectral images [13].
Figure 1.
Overview diagram of the QOM. First, the MSI and HSI undergo bilinear interpolation to recover their dimensions. Subsequently, the first fusion results are derived through a spatial and a spectral optimization network module. Finally, the fusion results are further super-resolved using matrix decomposition to obtain the secondary optimized image.
In summary, the main contributions of this paper are as follows:
(1)
We propose a spectral fusion model incorporating attention mechanisms and a three-module CNN framework to fuse LR_HSI and HR_MSI, guided by spatial and spectral detail recovery loss functions to improve reconstruction accuracy.
(2)
We introduce matrix decomposition for secondary optimization, combining deep learning and traditional methods to enhance interpretability and training stability.
(3)
The effectiveness of QOM is validated through ablation experiments, achieving superior performance compared to ten state-of-the-art methods, and demonstrating robustness under varying noise conditions and computational efficiency.
The rest of the paper is organized as follows. Section 2 reviews mainstream spectral fusion methods, laying the background foundation for subsequent research. Section 3 elaborates the algorithmic framework of QOM and its core mechanism. Section 4 presents experimental evaluations on seven datasets, comparing QOM with ten other methods. Section 5 examines the stability of QOM through additional experiments. Section 6 analyzes computational complexity. Section 7 assesses robustness under noisy conditions. Section 8 summarizes the work presented in this paper, and Section 9 outlines potential directions for future research.
2.
Spectral Fusion Related Work
In this section, we systematically review various approaches to hyperspectral and multispectral fusion, including the traditional mathematical methods that were widely used in the early days as well as the recently acclaimed deep learning methods.
2.1
Traditional Methods
The traditional spectral fusion methods are mainly based on tensor representation, matrix decomposition, and panchromatic sharpening extension.
(1)
Methods based on matrix decomposition: These methods assume that the fusion result image (high-resolution hyperspectral [HR_HS]) contains only a small number of pure spectral features, that is, it can be viewed as a product of endmember and abundance matrices. Therefore, it is only necessary to estimate the spectral basis matrix and sparse coding from low-resolution hyperspectral (LR_HS) and high-resolution multispectral (HR_MS) images, respectively, to obtain the fusion result. Kawakami et al. [14] utilized the sparse prior knowledge to learn the spectral basis matrix from LR_HS and then sparsely coded the corresponding sparse coefficients for HR_MS. In addition, Yokoya et al [15] proposed a spectral unmixing method with nonnegative matrix decomposition, learning the endmember matrix and abundance matrix from LR_HS and HR_MS, respectively, and thus integrating them to generate a new HR_HS. Dong et al. [16] utilized the nonlocal spatial self-similarity possessed by the LR_HS, and they proposed a hyperspectral super-resolution method based on dictionary learning and sparse representation. Lin et al. [17] proposed a CO-CNMF algorithm based on the alternating multipliers method (ADMM) for solving the ill-posed inverse problem in spectral fusion.
(2)
Methods based on panchromatic sharpening extension: These methods are used to obtain HR_HSI by fusing LR_HSI with high-resolution panchromatic images [1821]. Panchromatic sharpening is conducted by demixing, Bayesian, component substitution (CS), and multiresolution analysis (MRA) methods, among others. The MRA and CS methods were generalized to solve the fusion problem of LR_HS and HR_MS. Aiazzi et al. [22] considered the effect of the spectral response function on component substitution and extended it to the fusion problem of LR_HS and HR_MS by constructing pan-sharpening subproblems, where each of the subproblems is the combination of a waveband in the HR_MS and the corresponding one in the LR_HS multiple bands in LR_HS. Based on this, Selva et al. [23] modeled multiple linear regression by combining one band in HR_MS with the corresponding multiple bands in LR_HS to obtain the final HR_HS image. Usually, the low spectral resolution of panchromatic images causes a large spectral distortion in the fusion result. Many scholars later extended the panchromatic sharpening method to hyperspectral and multispectral fusion studies.
(3)
Methods based on tensor decomposition: As the spectral image is essentially three-dimensional data, considering it tensor data and utilizing operations by tensor computation is a natural approach. Usually, the fusion result HR_HS can be regarded as the result of multiplying a kernel tensor with three factor matrices. Dian et al. [24] utilize sparse tensor decomposition and nonlocal spatial similarity to group LR_HS and HR_MS correspondingly, and for the same group of HR_HS datasets to perform sparse decomposition on the same dictionary. Finally, for each group, its corresponding third-order modal dictionary is learned, and then sparse coding is used to estimate its corresponding kernel tensor to reconstruct LR_HS. Zhang et al. [25], in combination with the regularization method, establish a spatial–spectral-graph-regularized low-rank tensor decomposition framework, which can effectively preserve spatial correlation in the fusion results.
2.2
Deep Learning Methods
In recent years, deep learning has provided effective solutions for spectral fusion. The CNNs have been widely adopted for hyperspectral (HSI) data processing due to their strong representation capability. Palsson et al. [26] introduced a 3D CNN combined with principal component analysis for fusing LR_HS and HR_MS data. Dian et al. [27] employed residual learning to alleviate overfitting in HSI sharpening while Xie et al. [28] designed a fusion network guided by the MS/HS data generation mechanism. Subsequently, Dian et al. [29] proposed a subspace-based fusion method applicable to multiple HSI datasets without retraining. Inspired by ResNet skip connections, Xu et al. [30] developed a two-branch fusion network (HAM-MFN) with RAP loss to reduce spectral and spatial distortions.
Based on the encoder–decoder framework, Liu et al. [31] proposed TFNet and further improved it as ResTFNet. Han et al. [32] introduced ConSSFCNN for efficient spatial–spectral fusion, and Yuan et al. [33] proposed MSDCNN by integrating multiscale feature extraction and residual learning. Huang et al. [34] developed DHIF-Net to regularize spatial and spectral optimization through a multilevel network while Li et al. [35] proposed an unsupervised three-stage network (UDALN) for hyperspectral super-resolution. Wang et al. [36] utilized nearest-neighbor multispectral data to compensate for spectral unmixing, and Qin et al. [37] further enhanced fusion performance via the GAN-based ADASR.
More recently, CasFormer, proposed by Hong et al. [38], employs a cascaded Transformer for dual-camera CASSI systems, integrating spectral-aware self-attention and spatial-fused cross-attention under physical constraints. It demonstrates superior performance in alleviating blind reconstruction and hardware limitations.
3.
Algorithmic Framework
This section describes the secondary optimization model structure in detail. As shown in Figure 2, the initial fusion is performed by the neural network composed of three modules with LR_HS and HR_MS as inputs; then the fusion results are decomposed by singular values to extract the dictionary matrix of the hyperspectral image; subsequently, the dictionary and coefficient matrices are iteratively updated by solving two optimization problems; and ultimately the optimal matrices are multiplied together to further recover the spatial and spectral details based on the initial fused image. To ensure the stability of the neural network training results and avoid large discrepancies from multiple trainings, we introduce specialized spatial and spectral loss functions for constraints, thus enhancing the interpretability of the model.
Figure 2.
The QOM framework.
3.1
Formulation of the Problem
In this paper, the reference HR_HS is denoted as RRH×W×L and the estimation HR_HS as ZRH×W×L, where the H, W, and L dimensions represent the height, width, and number of channels of the image, respectively. In addition, the inputs LR_HS and HR_MS are denoted as XRh×w×L(hH,wW) and YRH×W×l(lL), respectively. Based on the generation of X and Y , the following relationship can be obtained:
(1)
X=ZBH+Ex
(2)
Y=RZ+Ey,
where BRHW×HW denotes the spatial fuzzy operator, HRHW×hw the spatial downsampling operator, RRl×L the spectral response of the multispectral sensor, and ExRL×hw and EyRl×HW are the residuals. After utilizing dimensional transformation, ZRL×HW, XRL×hw, and YRl×HW.
The goal of fusion is to fuse X and Y to generate the image Z. Combining relations (1) and (2), the fusion problem can be initially formulated as the following optimization problem.
(3)
minZ12XZBHF2+12YRZF2,
where ∥∥F is the number of Frobenius paradigms.
3.2
Dimensional Recovery
Due to the limitation of experimental conditions and other factors, it is more difficult to directly acquire hyperspectral and multispectral images of the same region through the equipment. Therefore, in this paper, we downsample Z in spatial and spectral dimensions to obtain the LR_HS and HR_MS of the simulated inputs by using the following formulas:
(4)
ZGaussianZZbilinear 1rXY(i)=Z(Bi),i{1,,l}.
Specifically, a Gaussian filter is first applied to Z to achieve the blurring process, followed by spatial downsampling using bilinear interpolation of the ratio r to obtain X. Here, Y (i) denotes the ith band of Y and Bi is the band index of R in the spectral dimension, which is calculated as follows:
(5)
Bi=(i1)L(l1),Bi{B1,,Bl}.
In this way, downsampling of Z in the spectral dimension is achieved to obtain Y . Moreover, X and Y are used as simulated LR_HS and HR_MS input data before subsequent spectral fusion experiments are performed.
Since the inconsistent dimensions of the input images make it difficult to train the neural network directly, this module aims to perform the preliminary fusion of X and Y and improve the spatial resolution of X to be consistent with Y so as to meet the training requirements. While recovering the dimensionality, the relative positions between the bands must be kept constant. To this end, a bilinear interpolation method is used to ensure the consistency between LR_HS and HR_MS in terms of spatial and spectral positions so as to obtain the preliminary fusion results. The process can be expressed as follows:
(6)
Zbil=Y(i),i{B1,,Bl}Bilinear (X,r)(i),otherwise ,
where Zbil denotes the result of the initial fusion of the first layer of the network and i denotes the ith band in such a way that the relative positions of bands LR_HS and HR_MS are kept constant.
For extracting the features of the preliminary fusion results, this study adds a convolutional layer with a convolutional kernel size of 3 in the subsequent processing, and the step size of its width and height directions is set as 1. To perform the nonlinear mapping of the model, the ReLU activation function is attached after this convolutional layer, and the specific process is shown as follows:
(7)
Zbil=Conv (Zbil)Zbil=ReLU (Zbil).
3.3
Space Detail Recovery
3.3.1
Space Information Recovery Network
To ensure that the fusion result spatially recovers the original details as much as possible, a convolutional layer with a convolutional kernel of 3 and a shift step of 1 in width and height is added to the result of the previous layer for further feature extraction. Then a nonlinear activation function ReLU is added to delinearize the process, which is represented as follows:
(8)
Zspat=Zbil+Conv (Zbil)Zspat=ReLU (Zspat)
3.3.2
Spatial Loss Function
Given the inherent uncertainty of neural networks in the feature extraction process, the resulting model from training often lacks sufficient interpretability. To enhance the model’s ability to capture spatial edge information, accelerate the convergence of training toward spatial detail recovery, and improve the interpretability at the physical level, as shown in Figure 3, this module introduces a specialized spatial loss function to constrain the training process.
Figure 3.
Schematic diagram of spatial and spectral loss functions.
In this case, the loss function Lossspat 1 for the height dimension is calculated as follows:
(9)
Espat1(k,v,i)=Zspat(k+1,v,i)Zspat(k,v,i),E¯spat1(k,v,i)=Z(k+1,v,i)Z(k,v,i),Lossspat1=k=1H1v=1Wi=1L(Espat1(k,v,i)E¯spat1(k,v,i))22WL(H1),
where Espat1 denotes the edge mapping of the training result, E¯spat 1 the edge mapping of the reference image, and Lossspat 1 the loss of the training result in the height dimension. Furthermore, k, v, and i represent the width, height, and channel dimensions of the image, respectively. Similarly, the loss function Lossspat 2 for the width dimension is computed as follows:
(10)
Espat 2(k,v,i)=Zspat(k,v+1,i)Zspat(k,v,i)E¯spat 2(k,v,i)=Z(k,v+1,i)Z(k,v,i)Lossspat 2=k=1Hv=1W1i=1L(Espat 2(k,v,i)E¯spat 2(k,v,i))22(W1)HL,
where Espat2 denotes the edge mapping of the training results, E¯spat 2 the edge mapping of the reference image, and Lossspat 2 the loss of the training results in the width dimension. Both Lossspat 1 and Lossspat 2 are obtained by calculating the mean square error of the training results and the reference image. Finally, for combining the losses of the training results in height and width, these losses are equally weighted by the same weight to obtain the final spatial loss function Lossspat.
(11)
Lossspat=12Lossspat 1+12Lossspat 2.
3.3.3
Spatial Attention Module
To enhance the neural network’s ability to capture spatial edge details, this paper introduces a dedicated spatial attention module into the spatial information recovery module. Specifically, for the feature graph M, this module uses two operations, average pooling and maximum pooling, to extract features from the height and width directions of the image to generate Mavgs and Mmaxs descriptors, respectively. Subsequently, the two are spliced together and the higher-order spatial information is further extracted through a convolutional layer to finally obtain the spatial attention map Ms(M)RH×W, which assigns weights to each spatial location indicating its importance in the image, thus guiding the model to focus its attention on the key regions. The process is shown in Figure 4.
Figure 4.
Spatial Attention Module.
The process can be expressed by the following equation:
(12)
Ms(M)=σ(Convn×n([MaxPool (M);AvgPool (M)])).
Furthermore,
(13)
Ms(M)=σ(Convn×n([M maxs;M avgs])),
where σ denotes the activation function and Convn×n the convolution operation with a convolutional kernel of n.
3.4
Spectral Detail Recovery
3.4.1
Spectral Information Recovery Network
Similar to spatial detail recovery, to ensure that the fusion result recovers as much original detail as possible in the spectral dimension, a convolutional layer with a convolutional kernel of 3 and a shift step of 1 in width and height is added to the result of the previous layer to further extract features. Then a nonlinear activation function ReLU is added to delinearize it. The process is represented as follows:
(14)
Zspec=Zspat+Conv (Zspat)Zspec=ReLU (Zspec).
3.4.2
Spectral Loss Function
Similarly, as shown in Fig. 3, for enhancing the extraction of spectral edge information by the neural network, for the fast convergence of the training process toward spectral detail recovery, and for improving the interpretability of the neural network at the physical level, this layer employs a specific spectral loss function to constrain the neural network training. The spectral loss function Lossspec is
(15)
Espec(k,v,i)=Zspec(k,v,i+1)Zspec(k,v,i)E¯spec(k,v,i)=Z(k,v,i+1)Z(k,v,i)Lossspec=k=1Hv=1Wi=1L1(Espec(k,v,i)E¯spec(k,v,i))22HW(L1),
where k, v and i denote the width, height and channel dimensions of the image, respectively; Espec is the edge mapping of the training result in the spectral dimension and E¯spec is the spectral edge mapping of the reference image. The spectral loss function Lossspec measures the reconstruction error in the spectral dimension by calculating the mean square error between the training result and the reference image.
3.4.3
Channel Attention Module
For improving the ability of the neural network to capture the details of spectral edges, this paper introduces the channel attention module into the spectral information recovery module. First, average pooling and maximum pooling are used to aggregate spatial information on the feature maps to obtain the descriptors Mavgc and Mmaxc, respectively, where the former captures the global average information and the latter highlights salient features and extracts spatial information from different angles.
Subsequently, these two descriptors are fed into a shared network structure to generate the final channel attention map McRL×1×1. The network consists of a multilayer perceptron (MLP) with a hidden layer, which is responsible for extracting key channel information. To reduce the computational overhead, the hidden layer dimension is set as RLr×1×1, where r is the reduction rate.
Finally, element-by-element summation of the processed feature vectors is performed to fuse the information of average pooling and maximum pooling to enhance the model’s focus on critical channels. The process is shown in Figure 5.
Figure 5.
Channel attention module.
The process can be expressed by the following formula:
(16)
Mc(M)=σ(MLP(MaxPool (M))+MLP(AvgPool (M))).
Furthermore,
(17)
Mc(M)=σ(ω1(ω0(Mavgc))+ω1(ω0(Mmaxc))),
where σ denotes the activation function, ω0RLr×L, and ω1RL×Lr. Here, the MLP weights ω0 and ω1 are shared for both inputs and the ReLU activation function is followed by ω0.
Eventually, the fusion result Zspec (HR_HSIcnn) can be obtained after the training of the three neural network modules.
3.5
Hyperspectral Super-resolution
Since the first optimization yields a preliminary fusion result HR_HSIcnn, the final fusion problem can be changed to a new optimization problem:
(18)
minZ12ZBHXF2+12RZYF2+λASHR HSIcnnF2,
where λ is the regularization parameter. To solve this problem, it can be disassembled and processed in two steps: spatial optimization and spectral optimization. To make the result globally optimal, this paper introduces the method of matrix decomposition into the traditional method, where the estimation result Z is decomposed into two submatrices and the optimal Z is obtained by solving for the optimal solution of the two submatrices. The linear spectral mixing model (LSMM) assumes that HSI vectors can be expressed as linear combinations of some different spectral features, formulated as a product of endmember and abundance matrices. Given an HSI image ZRL×W×H containing J endmembers, the LSMM can be represented as
(19)
Z=AS+E.
Here, ARL×J(LJ) denotes the endmember matrix, where the physical meaning indicates the concentration of a substance or material present in a hyperspectral image. The matrix SRJ×HW(HWJ) denotes the abundance matrix, where the physical meaning represents the abundance of different constituents at a pixel point in a hyperspectral image. Moreover, ERL×HW is the residual matrix. Thus this problem is transformed into the following:
(20)
minA,S12ASBHXF2+12RASYF2+λASHR HSIcnnF2.
Based on this theory, the optimized fusion result of the neural network can be further optimized for better performance. Since HR_HSIcnn is highly correlated with Z in the spectral dimension, a singular value decomposition is considered to initialize A. Then A and S are updated by solving the corresponding subproblems separately. The specific steps are as follows.
First, singular value decomposition is applied to the HR_HSIcnn result fitted by the neural network to obtain the initial endmember matrix A:
(21)
U,Σ,AT=SVD(HR HSI cnnT).
The abundance matrix S is obtained by solving the following optimization problem:
(22)
minSRASYF2+λ1ASHR HSIcnnF2,
where ∥∥F is the Frobenius paradigm. This optimization problem is solved by iteratively updating to obtain S. Here, λ1 is the regularization parameter and λ1 > 0.
Finally the endmember matrix A is updated by solving the following new optimization problem:
(23)
minAASBHXF2+μ1ASHR HSIcnnF2,
where μ1 is the regularization parameter and μ1 > 0. The final estimate HR_HSI can be obtained by computing Z = AS:
(24)
HR HSI =Z=AS.
Then it is completed by taking X and Y as inputs and fusing them once with a neural network to obtain HR_HSIcnn. Then it is further optimized by the matrix decomposition technique, and finally the fused hyperspectral image HR_HSI is obtained.
3.6
Variants of the QOM
3.6.1
Ablation Study on Core Modules of QOM
To systematically evaluate the contribution of each component in the proposed QOM, a series of ablation experiments are conducted by decomposing the overall framework into several representative submodels. These submodels are designed to isolate the effects of spatial fusion, spectral fusion, and quadratic optimization, thereby verifying the necessity of each module in the complete architecture.
Specifically, the following submodels are constructed: QOMspat retains only the spatial fusion branch and is optimized using the spatial loss function Lossspat, focusing on spatial detail reconstruction; QOMspec preserves only the spectral fusion branch and is supervised by the spectral loss function Lossspec to emphasize spectral fidelity; QOMquad corresponds to the quadratic optimization stage based on matrix decomposition, which refines the initial fusion results through iterative updates of the abundance, dictionary, and coefficient matrices without introducing an explicit learning-based loss function. The full QOM integrates the spatial fusion module, spectral fusion module, and quadratic optimization stage into a unified framework, jointly constrained by both Lossspat and Lossspec.
In addition, the bilinear interpolation operation employed at the first layer of QOM to restore the spatial and spectral dimensions of the inputs is denoted as QOMinsert for clarity. Although this component is not treated as an independent submodel, it serves as a common preprocessing step for all configurations.
3.6.2
Ablation Study on Attention Mechanisms
To further investigate the role of attention mechanisms in spatial–spectral feature interaction, additional ablation experiments are conducted by selectively enabling or disabling the spatial attention and spectral (channel) attention modules within the fusion network. These experiments aim to quantify the contribution of attention-guided feature enhancement to the overall fusion performance.
Based on the full QOM architecture, two attention-related submodels are introduced. The QOM_SA removes the spatial attention module while retaining the spectral attention mechanism, allowing the model to assess the impact of spatial attention on spatial detail recovery and spatial–spectral alignment. Conversely, QOM_SE removes the spectral attention module while preserving spatial attention, focusing on the influence of channel-wise feature weighting on spectral reconstruction accuracy. Both QOM_SA and QOM_SE retain the spatial and spectral fusion branches and are jointly supervised by Lossspat and Lossspec to ensure consistent optimization objectives.
Through comparisons among QOMspat, QOMspec, QOM_SA, QOM_SE, and the complete QOM, the effectiveness of individual fusion branches and attention mechanisms can be comprehensively analyzed. The detailed network configurations and corresponding loss functions of all submodels are summarized in Table I.
Table I.
Ablation settings of QOM with different network structures, branches, attention mechanisms, and loss functions.
SubmodelInputNetwork structureSpatial branchSpectral branchSpatial attentionSpectral attentionLoss function
QOMspatQOMinsertConv(3,1,1)×××Lossspat
Conv(3,1,1)
ReLU()
QOMspecQOMinsertConv(3,1,1)×××Lossspec
Conv(3,1,1)
ReLU()
QOM_SAQOMinsertConv(3,1,1)×Lossspat + Lossspec
Conv(3,1,1)
Conv(3,1,1)
ReLU()
QOM_SEQOMinsertConv(3,1,1)×Lossspat + Lossspec
Conv(3,1,1)
Conv(3,1,1)
ReLU()
QOMquadQOMinsert××××
QOMQOMinsertConv(3,1,1)Lossspat + Lossspec
Conv(3,1,1)
Conv(3,1,1)
ReLU()
4.
Experiments
4.1
Datasets
In this paper, seven datasets are used to validate the proposed quadratic optimization model to evaluate its effectiveness. They are PaviaU, Botswana, Pavia, IndianP, Washington DC, Berlin, and Augsburg.
(1)
Pavia University (PaviaU): This dataset originates from a segment of hyperspectral data captured by Germany in 2003 within the city of Pavia, Italy, employing the Airborne Reflectance Optical Spectral Imager (AROSI). The spectral imager captured 115 bands within the wavelength spectrum of 0.43–0.86 μm, yielding a continuously imaged dataset with a spatial resolution of 1.3 m. Twelve bands were omitted due to noise interference, resulting in the utilization of the remaining set of 103 spectral bands. The dataset dimensions are 610 × 340, comprising a total of 2,207,400 pixels.
(2)
Pavia Center: Both this and the PaviaU dataset were captured by AROSI but with different imaging rules. However, in contrast to the Pavia University dataset, it comprises 102 bands, each measuring 1096 × 1096 pixels, which is one band less than the former.
(3)
Botswana: This dataset originates from NASA’s EO-1 satellite. The sensors on board the EO-1 satellite captured data in 242 bands, each of which was acquired at 10 nm intervals from 400 to 2500 nm, with a pixel resolution of 30 m in each band. Following the exclusion of noisy bands, 145 bands were preserved, each comprising measurements of 1096 × 1096 pixels. The image dimensions for each band were 1476 × 256 pixels.
(4)
Indian Pines (IndianP): This is a classic dataset that was collected by the AVIRIS sensor located in the Indiana IP Test Center, which captured spectral images over a total of 224 bands in the range of 0.4–2.5 μm. After removing a portion of the unavailable band data located in the absorption region, 200 spectral bands remain with images measured at 145 × 145 pixels per band.
(5)
Washington DC: This dataset includes an aerial hyperspectral image captured using the HYDICE sensor, containing 191 bands from 0.4 to 2.4 μm. The dataset dimensions are 1208 × 307 pixels.
(6)
Berlin: This dataset is a multimodal urban scene over Berlin, Germany, composed of hyperspectral and SAR data. The hyperspectral image is a simulated EnMAP product generated from HyMap data, containing 244 spectral bands ranging from 0.4 to 2.5 μm with a spatial resolution of 30 m and a scene size of 797 × 220 pixels. The corresponding SAR data are acquired from Sentinel-1 dual-polarization (VV–VH) observations and preprocessed into PolSAR covariance features [39].
(7)
Augsburg: This dataset integrates hyperspectral, SAR, and DSM data collected over the city of Augsburg, Germany. The hyperspectral image is acquired by the HySpex sensor with 180 bands covering 0.4–2.5 μm while the SAR data are obtained from Sentinel-1 dual-polarization measurements and represented by four PolSAR features. A single-band DSM is additionally included to provide elevation information. All modalities are resampled to a unified spatial resolution of 30 m, resulting in images of 332 × 485 pixels [39].
4.2
Evaluation Metrics
In our experiments, we utilize four evaluation metrics commonly used in the field of spectral fusion to evaluate the performance of QOM in comparison with nine other fusion methods. In this section, k, v, and i denote the width, height, and number of channels of the fused image, respectively.
(1)
Peak signal-to-noise ratio (PSNR): PSNR evaluates the spatial quality of the fusion result in terms of bands, where the ith band is defined as
(25)
PSNR =10log10max(Ri)21HWRiZi22,
where Ri and Zi denote the reference and estimated images of the ith band, respectively; ∥∥2 refers to the l2-norm or Euclidean norm. The final PSNRs are the average of the PSNRs of all the bands. Higher values of PSNR indicate better fusion results.
(2)
Root mean square error (RMSE): RMSE can be used to measure the difference between the reference image and the estimated image. It is defined as
(26)
RMSE =k=1Hv=1Wi=1L(Zi(k,v)Ri(k,v))2LHW,
where the smaller the RMSE value, the better the fusion.
(3)
Spectral Angle Mapper (SAM): SAM is used to evaluate the similarity of spectral information at each pixel. The ith band is defined as
(27)
SAMi=arccos Zi(k,v),Ri(k,v)Ri(k,v)2Zi(k,v)2,
where R(k, v) and Z(k, v) denote the spectral vectors of the reference and estimated images at position (k, v), respectively, and  <> denotes the inner product operation. The final SAMs are the average of the SAMs of all the pixels. The smaller the SAM value, the better the fusion.
(4)
Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS): ERGAS is used to evaluate the quality of the fusion result, which is defined as
(28)
ERGAS =100i=1LRiZi22μ2(Ri)Lr,
where r is the sampling ratio during downsampling from HR_HSI to LR_HSI and μ(Ri) denotes the average value of the reference image in the ith band. The smaller the ERGAS value, the better the fusion.
4.3
Experimental Settings
Data division in spectral fusion research has been difficult due to limited access to datasets. To fully utilize the existing data, this paper draws on the research method of Zhang et al. [12] to divide the training and test sets on the same dataset in the following manner.
For the IndianP dataset, due to its limited spatial resolution, we select the center region (64 × 64) as the test data and the remaining part as the training data. For the other four datasets, the center region (128 × 128) is cropped as the test set and the remaining region is used for training. During the training process, subregions with the same resolution as the test data are randomly cropped from the training region for training at each iteration, and the cropped region is filled with 0 to maintain data integrity.
In addition, when generating LR_HSI, we use a Gaussian filter of size 5 × 5 with standard deviation 2 for downsampling and set the downsampling ratio r = 4. The HR_MSI consists of five bands selected at equal intervals from HR_HSI. Figure 6 shows how the training and test sets are divided, taking the PaviaU dataset as an example.
Figure 6.
The training and test sets are divided (using the PaviaU dataset as an example), and the blanks that cut the training set are filled with 0.
In the hyperspectral super-resolution task, the corresponding B and H matrices can be obtained directly from the generation of LR_HSI and HR_MSI while the R matrix can be estimated by the HySure method [40, 41]. The regularization parameters λ1 and μ1 are set at 0.002 in the subsequent optimization solution process. The activation function of both the spatial and channel attention modules uses ReLU. The convolution parameter n is the number of channels of the corresponding feature map.
The methods compared in this experiment include two traditional methods, LTTR [42] and CNMF [15], as well as eight advanced deep learning methods proposed in recent years: MSDCNN [33], ResTFNet [31], ConSSFCNN [32], DHIF-Net [34], UDALN [35], RECF [36], ADASR [37], and CasFormer [38]. To ensure fairness, all methods are consistent in data preprocessing, evaluation index calculation, and other parameter settings. During the training process, each method is iterated 10,000 times, the learning rate is set at 1e − 4, and the optimizer is chosen to be Adam.
In addition, this experiment was implemented based on PyTorch 2.0.1+cu118, using Python 3.9.13 as the programming language. All neural-network-related computations were performed on an NVIDIA GeForce RTX 4090 GPU with 24 GB of graphics memory.
4.4
Ablation Experiments with QOM
Following the variant definitions, ablation experiments are conducted on the Pavia Center dataset. Results are analyzed in terms of both core module contributions and attention mechanism effects. All experiments are performed under identical training and evaluation conditions.
4.4.1
Ablation Results on Core Functional Modules
The contribution of spatial fusion, spectral fusion, and quadratic optimization is assessed by comparing QOMspat, QOMspec, QOMquad, and the complete QOM. Quantitative results are summarized in Table II, the corresponding fusion images and difference maps are displayed in Figure 7, and the convergence curves are shown in Figure 8.
Table II.
Comparison of QOM ablation experiments on the Pavia Center dataset. The optimal values for each evaluated metric are highlighted in bold.
MethodPavia Center
PSNR RMSE SAM ERGAS
QOMspat24.452315.273011.861614.8537
QOMspec36.54083.79754.12544.3109
QOMquad31.93276.45514.84806.1569
QOM38.32343.09293.54793.5748
Figure 7.
Results of QOM ablation experiments on the Pavia dataset: (a) QOMspat, (b) QOMspec, (c) QOMquad, and (d) QOM. The upper row shows the pseudo-RGB images obtained by QOMspat, QOMspec, QOMquad, and QOM, respectively, where the R, G, and B channels are set at 66, 28, and 0, respectively. The lower row shows the pseudo-RGB image of the difference between the results of each method and the reference image under the same channel settings.
Figure 8.
Changes in the evaluation metrics of submodels in ablation experiments: (a) SAM, (b) PSNR, (c) RMSE, and (d) ERGAS.
The QOMspat and QOMspec both improve fusion performance over the interpolation baseline, demonstrating the effectiveness of learning-based fusion. The QOMspec consistently outperforms QOMspat, indicating that spectral reconstruction is relatively easier than fine-grained spatial recovery. The QOMquad enhances spatial structures compared with QOMspat but remains inferior to QOMspec in spectral metrics, highlighting the advantage of deep learning in modeling complex spectral correlations.
The full QOM, integrating spatial and spectral fusion with quadratic optimization, achieves the best performance across all metrics and exhibits faster, more stable convergence. This confirms the complementarity of the three modules and the importance of combining learning-based and model-driven strategies for high-quality fusion.
4.4.2
Ablation Results on Attention Mechanisms
The impact of spatial and spectral attention is analyzed by comparing QOM_SA, QOM_SE, and the full QOM. The results are reported in Table III together with visual examples in Figure 9.
Table III.
Comparison of attention ablation experiments on the Pavia Center dataset. The optimal values for each metric are highlighted in bold.
MethodPavia Center
PSNR RMSE SAM ERGAS
QOM_SA34.96764.55164.62874.9676
QOM_SE36.92003.65004.72004.6100
QOM38.23743.12373.55723.5851
Figure 9.
Results of attention ablation experiments on the Pavia Center dataset: (a) QOM_SA, (b) QOM_SE, and (c) QOM. The upper row shows pseudo-RGB images obtained by QOM_SA, QOM_SE, and the full QOM, where the R, G, and B channels are set at 66, 28, and 0, respectively. The lower row shows the corresponding difference images between each result and the reference image using the same channels.
The QOM_SA significantly outperforms QOMspat, showing that spectral attention alone improves spectral consistency and suppresses redundant features. However, its performance is lower than the complete QOM, particularly in PSNR and ERGAS, indicating the necessity of spatial attention for fine structural recovery. The QOM_SE performs comparably to QOMspec in PSNR and RMSE but shows degradation in SAM and ERGAS, suggesting that spatial attention enhances spatial details but cannot fully compensate for the lack of spectral attention.
The full QOM consistently achieves the highest scores across all metrics, confirming that spatial and spectral attention mechanisms are complementary rather than interchangeable. By jointly exploiting spatially adaptive and channel-wise feature reweighting, these mechanisms substantially improve spatial–spectral feature representation.
While attention mechanisms are commonly used in other fusion networks, our experiments quantitatively demonstrate their complementary contribution within the QOM framework, improving both spatial structure and spectral fidelity beyond the baseline submodels.
4.4.3
Overview
The ablation experiments confirm the following: (1) each core module—spatial fusion, spectral fusion, and quadratic optimization—contributes meaningfully to the final performance; and (2) spatial and spectral attention mechanisms are complementary and critical for optimal spatial–spectral fusion. These findings reveal that the full QOM design effectively integrates learning-based and model-driven strategies with attention-guided feature enhancement to achieve leading performance in hyperspectral and multispectral image fusion.
4.5
Performance Comparison
For validating the performance of the proposed model, comparison experiments are conducted using seven datasets: PaviaU, Botswana, Pavia, IndianP, Washington DC, Berlin, and Augsburg. The two traditional methods in the comparison approach, LTTR and CNMF, are implemented using the idea of matrix decomposition.
(1) Results of PaviaU: Table IV reports the experimental results of QOM and other state-of-the-art methods on the PaviaU dataset. The QOM achieves the highest PSNR of 44.0235, the lowest RMSE of 1.5474, the smallest SAM of 1.8544, and the lowest ERGAS of 1.1703, outperforming all competing methods. Traditional approaches such as LTTR and CNMF lag behind, reflecting their limited capacity to extract latent spatial–spectral features from hyperspectral data. Among deep learning methods, ADASR achieves a PSNR of 42.9031, mainly benefiting from GAN-based data augmentation, while CasFormer attains 43.0503 by using a cascaded Transformer architecture with dual-imaging fusion, spatial coherence alignment, and decoupling-based loss that enforces spatial consistency and spectral fidelity. The QOM surpasses CasFormer by 0.97 dB in PSNR and shows consistent improvement across RMSE, SAM, and ERGAS. This advantage stems from QOM’s spatial and channel attention modules, which enable precise focus on key features and effective suppression of irrelevant information, enhancing both spatial detail and spectral fidelity. Figure 10 visualizes pseudo-RGB images and difference maps. The QOM clearly preserves fine spatial structures and reduces spectral distortion compared with other methods, further validating its superior performance in hyperspectral image fusion.
Table IV.
Experimental results of QOM and other state-of-the-art methods on the PaviaU dataset. The optimal values corresponding to each evaluated metric are highlighted in bold.
MethodPavia University
PSNR RMSE SAM ERGAS
LTTR [42]28.64969.42025.18835.5504
CNMF [15]30.53567.58162.24693.8939
MSDCNN [33]40.76102.38632.61601.6779
ResTFNet [31]41.44822.08152.36201.5178
ConSSFCNN [32]39.99882.45952.47731.6721
DHIF-Net [34]41.23162.13402.07811.4830
UDALN [35]41.77742.00412.03081.5733
RECF [36]42.02451.94792.02501.3729
ADASR [37]42.90312.76052.99311.4674
CasFormer [38]43.05031.73092.00221.2504
Our QOM44.02351.54741.85441.1703
Figure 10.
Comparison of experimental results between QOM and other methods on the PaviaU dataset: (a) LTTR, (b) CNMF, (c) MSDCNN, (d) ResTFNet, (e) ConSSFCNN, (f) DHIF-Net, (g) UDALN, (h) RECF, (i) ADASR, (j) CasFormer, (k) QOM, and (l) GT. The first row shows the pseudo-RGB images computed by each method, where the RGB channels are selected as 68, 28, and 0 band data, respectively. The second row shows the pseudo-RGB images of the difference between the results of each method and the reference image under the same band settings.
(2) Results of Pavia: Figure 11 shows the pseudo-RGB images obtained by the QOM and other methods on the Pavia dataset. Table V reports four key evaluation metrics for each method. Notably, the Pavia and PaviaU datasets share similar dimensionality and acquisition conditions, making them highly comparable. From Table IV, the QOM achieves the best performance across all metrics, with PSNR = 39.1249, RMSE = 3.1144, SAM = 3.1366, and ERGAS = 3.2364. Compared with CasFormer (PSNR = 37.5472, RMSE = 3.3089), the QOM improves PSNR by 1.6 dB and reduces RMSE by 0.28, indicating superior spatial and spectral reconstruction. The effectiveness of the QOM can be attributed to its spatial and channel attention modules, which enhance spatial resolution, preserve spectral fidelity, and reduce spectral aliasing. In contrast, CasFormer leverages a cascaded Transformer with dual-imaging fusion, aligning spatial coherence and recovering spectral sequences under physical constraints, which explains its competitive performance. Other deep learning baselines, such as DHIF-Net, UDALN, and ADASR, achieve moderate improvements, but none match the joint spatial–spectral optimization of the QOM. Figure 12 further illustrates the convergence trends over 10,000 training iterations. The QOM demonstrates faster and more stable convergence than the other methods, reflecting the efficiency of its hierarchical network in capturing both spatial details and spectral characteristics.
Figure 11.
Comparison of experimental results between QOM and other methods on the Pavia dataset: (a) LTTR, (b) CNMF, (c) MSDCNN, (d) ResTFNet, (e) ConSSFCNN, (f) DHIF-Net, (g) UDALN, (h) RECF, (i) ADASR, (j) CasFormer, (k) QOM, and (l) GT. The first row displays the pseudo-RGB image generated from the fusion results of each method, where the RGB channels are selected in the 68th, 28th, and 0th bands, respectively. The second row displays the pseudo-RGB image generated from the difference between the results of each method and the reference image under the same channel settings.
Figure 12.
Changes in four assessment metrics during the Pavia dataset experiment: (a) PSNR, (b) RMSE, (c) SAM, and (d) ERGAS.
Table V.
Experimental results of QOM and other state-of-the-art methods on the Pavia Center dataset. The optimal values corresponding to each evaluated metric are highlighted in bold.
MethodPavia Center
PSNR RMSE SAM ERGAS
LTTR [42]25.222128.337317.080031.1229
CNMF [15]19.083613.97774.363511.4360
MSDCNN [33]35.65194.20675.27004.7723
ResTFNet [31]36.25873.92294.64074.4106
ConSSFCNN [32]34.96434.55334.75374.9828
DHIF-Net [34]36.76643.70023.72813.8520
UDALN [35]36.92613.67483.63033.7476
RECF [36]37.15323.53903.61523.7161
ADASR [37]37.42593.42963.67233.9191
CasFormer [38]37.52473.39083.87543.8608
Our QOM39.12493.11443.13663.2364
(3) Results of Botswana: Figure 13 shows the experimental results of the QOM and other compared methods on the Botswana dataset, and Table VI lists the corresponding quantitative metrics. Due to the large spatial and spectral dimensions of this dataset, QOM’s dual loss design—combining spatial and spectral constraints—effectively captures both spatial structures and spectral signatures, leading to superior fusion performance. This is reflected in Table VI, where the QOM achieves the highest PSNR of 39.5868, the lowest RMSE of 0.3892, and the best SAM and ERGAS values (1.7527 and 1.4249, respectively), outperforming all other methods by a clear margin. Deep-learning-based methods generally show stronger performance than traditional approaches like LTTR and CNMF, which face challenges from limited feature modeling. For instance, ResTFNet leverages skip connections to stabilize training on large datasets, achieving a PSNR of 37.6264 and an RMSE of 0.4786, while DHIF-Net and ADASR also demonstrate robust spectral reconstruction (SAM ≈ 2.48). CasFormer, a cascaded Transformer model designed for spatial–spectral fusion, achieves moderate improvement over conventional CNN-based methods (PSNR 35.2622, RMSE 0.6284). Its cascade-attention blocks help align spatial details from RGB images, but the method is still limited by the dual-imaging reconstruction process in capturing fine spectral information, which explains why its SAM (2.4944) and ERGAS (4.8574) values are higher than those of the QOM. Overall, the results indicate that although advanced network architectures (ResTFNet, CasFormer) improve spatial fidelity, the combined spatial–spectral loss in the QOM provides more balanced and accurate fusion, resulting in consistent improvement across all metrics.
Figure 13.
Comparison of experimental results between QOM and other methods on the Botswana dataset: (a) LTTR, (b) CNMF, (c) MSDCNN, (d) ResTFNet, (e) ConSSFCNN, (f) DHIF-Net, (g) UDALN, (h) RECF, (i) ADASR, (j) CasFormer, (k) QOM, and (l) GT. The first row displays the pseudo-RGB image generated by each method, where the RGB channels are selected from the 47th, 14th, and 3rd band data, respectively. The second row displays the pseudo-RGB image of the difference between the results of each method and the reference image under the same channel settings.
Table VI.
Experimental results of QOM and other state-of-the-art methods on the Botswana dataset. The optimal values corresponding to each evaluated metric are highlighted in bold.
MethodBotswana
PSNR RMSE SAM ERGAS
LTTR [42]28.90009.15254.868214.2396
CNMF [15]19.716626.34572.48669.4849
MSDCNN [33]35.76770.59282.76443.3538
ResTFNet [31]37.62640.47862.27862.7114
ConSSFCNN [32]30.89181.03934.621815.4482
DHIF-Net [34]36.98390.51532.48415.3765
UDALN [35]36.86450.52072.77436.1752
RECF [36]36.94270.51782.48205.3825
ADASR [37]37.26450.50162.47535.3754
CasFormer [38]35.26220.62842.49444.8574
Our QOM39.58680.38921.75271.4249
(4) Results of Indian Pines: Figure 14 presents the experimental comparison between QOM and other methods on the IndianP dataset, with Table VII summarizing the evaluation metrics. The QOM achieves the best performance on PSNR (35.3801), RMSE (4.3405), and ERGAS (3.1664), demonstrating its strong overall fusion capability. Although QOM’s SAM (3.2541) is slightly higher than that of CNMF (2.8280), it remains competitive, reflecting that for low spatial resolution datasets like IndianP, methods that explicitly model spectral relationships—such as CNMF—can have a marginal advantage in pure spectral similarity assessment. CasFormer, a cascaded-Transformer-based approach designed for spatial–spectral fusion, performs well with PSNR 34.7364, RMSE 4.6743, and SAM 3.4601. Its cascade-attention blocks and dual-imaging mechanism enhance spatial coherence and spectral recovery, which is consistent with its superior PSNR and RMSE. However, its SAM is slightly worse than QOM and CNMF, likely due to the relatively low spatial resolution of the IndianP dataset limiting the effectiveness of its spatially driven spectral refinement. Overall, the QOM demonstrates balanced improvements across all metrics, outperforming other methods in key areas, which highlights the effectiveness of its attention-based fusion mechanism for hyperspectral and multispectral image enhancement. The results indicate that while Transformer-based methods like CasFormer excel in leveraging spatial–spectral correlations, QOM’s design achieves stronger overall accuracy in both spatial and spectral domains for datasets of this resolution.
Figure 14.
Comparison of experimental results between QOM and other methods on the IndianP dataset: (a) LTTR, (b) CNMF, (c) MSDCNN, (d) ResTFNet, (e) ConSSFCNN, (f) DHIF-Net, (g) UDALN, (h) RECF, (i) ADASR, (j) CasFormer, (k) QOM, and (l) GT. The first row shows the pseudo-RGB images generated by each method, where the RGB channels correspond to 28, 14, and 3 bands of data, respectively. The second row shows the pseudo-RGB difference image obtained by comparing the fused image generated by each method with the reference image under the same channel settings.
Table VII.
Experimental results of QOM and other state-of-the-art methods on the IndianP dataset. The optimal values corresponding to each evaluation metric are highlighted in bold.
MethodIndian Pines
PSNR RMSE SAM ERGAS
LTTR [42]15.240444.108319.0087226.8429
CNMF [15]27.94410.21742.828012.0293
MSDCNN [33]33.02715.6914.00224.6999
ResTFNet [31]33.53125.68313.96574.8726
ConSSFCNN [32]28.64149.42917.487717.4377
DHIF-Net [34]26.038312.724111.005416.5274
UDALN [35]33.25836.23155.74164.5359
RECF [36]31.94756.44424.236412.2007
ADASR [37]33.93845.18343.76714.1725
CasFormer [38]34.73644.67433.460112.0412
Our QOM35.38014.34053.25413.1664
(5) Results of Washington DC: Figure 15 and Table VIII present the experimental results of QOM and other state-of-the-art methods on the Washington DC dataset. The QOM achieves the best performance across all metrics (PSNR = 49.1421, RMSE = 0.6544, SAM = 0.2219, ERGAS = 0.1201), slightly surpassing CasFormer (PSNR = 48.5839, RMSE = 0.6984, SAM = 0.2837, ERGAS = 0.1205). The high spatial and spectral resolution of the Washington DC dataset provides abundant information for feature extraction. CasFormer’s cascaded Transformer architecture effectively exploits this by aligning high-resolution RGB images spatially and recovering spectral sequences, which explains its strong PSNR and low RMSE. However, its SAM remains higher (0.2837 versus 0.2219) and ERGAS marginally above QOM, indicating less precise spectral fidelity. The QOM further improves performance by more effectively integrating spatial features and preserving spectral details, particularly benefiting spectral accuracy (SAM reduced by 0.0618) while slightly boosting PSNR. This demonstrates that while CasFormer excels at spatial–spectral fusion via its cascade-attention mechanism, QOM’s design better balances spatial reconstruction and spectral consistency, leading to superior overall fusion quality on high-resolution hyperspectral datasets.
Figure 15.
Comparison of experimental results between QOM and other methods on the Washington DC dataset: (a) LTTR, (b) CNMF, (c) MSDCNN, (d) ResTFNet, (e) ConSSFCNN, (f) DHIF-Net, (g) UDALN, (h) RECF, (i) ADASR, (j) CasFormer, (k) QOM, and (l) GT. The first row shows the pseudo-RGB images generated by each method, where the RGB channels correspond to 54, 34, and 10 bands of data, respectively. The second row shows the pseudo-RGB difference image obtained by comparing the fused image generated by each method with the reference image under the same channel settings.
Table VIII.
Experimental results of QOM and other state-of-the-art methods on the Washington DC dataset. The best values corresponding to each evaluated metric are highlighted in bold.
MethodWashington DC
PSNR RMSE SAM ERGAS
LTTR [42]12.897957.762132.68961356.291
CNMF [15]22.739618.60213.3933301.23
MSDCNN [33]35.38053.19381.04830.5582
ResTFNet [31]41.03331.6660.54630.2902
ConSSFCNN [32]20.808817.09587.17253.1691
DHIF-Net [34]22.798913.59525.69952.5856
UDALN [35]38.87252.06330.94130.4579
RECF [36]40.99771.67280.54730.2914
ADASR [37]42.35161.43610.51620.2312
CasFormer [38]48.58390.69840.28370.1205
Our QOM49.14210.65440.22190.1201
(6) Results of Berlin: Table IX presents the quantitative results on the Berlin dataset, which features complex urban scenes with strong spatial heterogeneity and diverse spectral signatures. Figure 16 presents the experimental comparison between the QOM and other methods. As shown in Table IX, the QOM achieves the best overall performance, obtaining the highest PSNR (41.3218) and the lowest RMSE (1.1564) and ERGAS (0.5940), demonstrating its strong spatial reconstruction accuracy. Although RECF attains a lower SAM (0.7618), indicating better spectral angle preservation, its PSNR and RMSE are inferior to QOM, suggesting that spectral consistency alone is insufficient to fully recover fine spatial structures in highly structured urban scenes. Traditional methods such as LTTR and CNMF perform poorly due to their limited capacity to model nonlinear spatial–spectral correlations. Among deep learning methods, CNN-based approaches (e.g., MSDCNN and ResTFNet) show moderate improvements but struggle to resolve spatial discontinuities while Transformer-based or hybrid methods (DHIF-Net, UDALN, and CasFormer) enhance spectral fidelity but remain limited in spatial reconstruction. Benefiting from its spatial and channel attention mechanisms combined with quadratic optimization, the QOM effectively balances spatial detail preservation and spectral consistency, leading to superior fusion performance on the Berlin dataset.
Figure 16.
Comparison of experimental results between QOM and other methods on the Berlin dataset: (a) LTTR, (b) CNMF, (c) MSDCNN, (d) ResTFNet, (e) ConSSFCNN, (f) DHIF-Net, (g) UDALN, (h) RECF, (i) ADASR, (j) CasFormer, (k) QOM, and (l) GT. The first row shows the pseudo-RGB images generated by each method, where the RGB channels correspond to 29, 17, and 5 bands of data, respectively. The second row shows the pseudo-RGB difference image obtained by comparing the fused image generated by each method with the reference image under the same channel settings.
Table IX.
Experimental results of QOM and other state-of-the-art methods on the Berlin dataset. The best values corresponding to each evaluated metric are highlighted in bold.
MethodBerlin
PSNR RMSE SAM ERGAS
LTTR [42]16.485220.181517.96079.6548
CNMF [15]21.439111.409311.54444.3092
MSDCNN [33]25.33427.28624.53073.5947
ResTFNet [31]25.31717.30066.92734.0238
ConSSFCNN [32]34.04702.67222.54151.8093
DHIF-Net [34]36.13042.10231.53921.0482
UDALN [35]36.33842.05251.36410.9697
RECF [36]41.32181.15640.91430.5940
ADASR [37]40.30791.29960.92520.6641
CasFormer [38]41.30691.15840.82390.6235
Our QOM42.34751.02760.76180.5469
(7) Results of Augsburg: Table X presents the quantitative results on the Augsburg dataset, which contains complex urban–rural mixed scenes with fine spatial structures and diverse land-cover materials. Figure 17 shows the experimental comparison between the QOM and other methods. As indicated in Table X, CasFormer achieves the best performance across all metrics, with the highest PSNR (42.0116), lowest RMSE (0.8437), lowest SAM (1.1264), and lowest ERGAS (0.8569). The QOM demonstrates competitive performance, achieving PSNR (40.6246), RMSE (0.9897), SAM (1.2827), and ERGAS (0.9842) that are close to state-of-the-art values while offering a well-balanced reconstruction of spatial details and spectral consistency. Traditional methods such as LTTR and CNMF perform poorly due to their limited capability in modeling complex spatial–spectral correlations. The CNN-based approaches (MSDCNN, ResTFNet) moderately improve reconstruction but are limited by scene variability. Hybrid methods such as DHIF-Net, UDALN, and RECF enhance spectral fidelity as reflected by reduced SAM, but their spatial detail recovery remains less balanced. By leveraging spatial and channel attention combined with quadratic optimization, the QOM achieves a robust trade-off between spatial and spectral reconstruction, providing stable fusion performance in diverse urban–rural scenarios.
Figure 17.
Comparison of experimental results between QOM and other methods on the Augsburg dataset: (a) LTTR, (b) CNMF, (c) MSDCNN, (d) ResTFNet, (e) ConSSFCNN, (f) DHIF-Net, (g) UDALN, (h) RECF, (i) ADASR, (j) CasFormer, (k) QOM, and (l) GT. The first row shows the pseudo-RGB images generated by each method, where the RGB channels correspond to 21, 13, and 4 bands of data, respectively. The second row shows the pseudo-RGB difference image obtained by comparing the fused image generated by each method with the reference image under the same channel settings.
Table X.
Experimental results of QOM and other state-of-the-art methods on the Augsburg dataset. The best values corresponding to each evaluated metric are highlighted in bold.
MethodAugsburg
PSNR RMSE SAM ERGAS
LTTR [42]18.095613.242618.460210.9219
CNMF [15]27.83694.31434.59803.5916
MSDCNN [33]27.48634.49207.35213.0637
ResTFNet [31]25.81135.44748.66405.1811
ConSSFCNN [32]31.87042.71174.22921.9887
DHIF-Net [34]36.30921.62662.08151.6312
UDALN [35]38.69151.23651.51991.2244
RECF [36]38.79491.22181.57521.2367
ADASR [37]38.80361.22061.60871.1877
CasFormer [38]42.01160.84371.12640.8569
Our QOM40.62460.98971.28270.9842
5.
Model Stability Validation
Due to the complexity of network structure and the impact of parameter initialization on deep learning methods, many deep learning models may produce different results or even large performance fluctuations when multiple experiments are conducted on the same dataset. Based on this fact, we conducted ten independent experiments on the Pavia dataset using the same experimental setup for the proposed QOM model and calculated the four evaluation metrics and their standard deviations for the fusion results obtained from each experiment. The specific values are listed in Table XI, from which it can be seen that the standard deviations of the metrics are small, indicating that the results of the QOM are always stable and have superior performance over the ten experiments.
Table XI.
QOM changes and corresponding standard deviations of the four evaluation metrics over ten experiments on the Pavia dataset.
MetricsExperiment No.Std. dev.
12345678910
PSNR38.237438.769838.669838.460438.250838.368738.468138.574238.682038.57140.1813
RMSE3.12373.22653.13463.21603.19953.15233.11303.17163.13003.17270.0400
SAM3.55723.59093.50413.57893.52443.55243.58373.50193.51303.53170.0335
ERGAS3.58513.55273.52373.51373.48593.57013.59433.53383.50583.54610.0352
In addition, to analyze the volatility of the experimental results more intuitively, we plotted the box line diagrams of the four evaluation indicators (see Figure 18). From the figure, it can be observed that the box of each indicator is narrower, the median is basically located at the center of the box, there are no outliers, and the overall distribution is uniform. This further validates the stability of the QOM. This characteristic is closely related to the QOM’s network design that effectively combines the physical characteristics of hyperspectral data. Since both the network structure and loss function of the QOM are highly interpretable, its experimental results show a high degree of stability and reliability.
Figure 18.
Boxplot of QOM’s change in four evaluation metrics over ten experiments on the Pavia dataset.
6.
Model Complexity Analysis
In addition to experimental performance, computational efficiency is a crucial factor for the real-world applicability of hyperspectral image fusion models. Table XII compares the QOM with several representative deep learning methods on the Pavia dataset in terms of floating-point operations (TFLOPS) and the number of parameters (Params).
Table XII.
TFLOPS and Params comparison of QOM with other deep learning methods.
MethodTFLOPSParams (M)
MSDCNN [33]20.35871.2426
ResTFNet [31]8.45872.3241
ConSSFCNN [32]12.68190.9686
DHIF-Net [34]3.62002.6700
UDALN [35]7.39422.5681
RECF [36]5.64933.5246
ADASR [37]8.37463.3275
CasFormer [38]10.70004.1746
Our QOM8.60242.9307
From Table XII, traditional CNN-based methods such as MSDCNN achieve relatively low parameter counts (1.2426M) but exhibit high TFLOPS (20.3587), indicating that shallow networks with large convolutional kernels or dense feature maps incur substantial computational overhead during inference. ResTFNet, a residual Transformer network, reduces TFLOPS (8.4587) by leveraging skip connections and attention layers for more efficient feature reuse though the parameter count increases (2.3241M). ConSSFCNN has the lowest number of parameters (0.9686M) due to its compact convolutional design, yet its TFLOPS (12.6819) remains moderate because of repeated spatial–spectral feature extraction layers.
Hybrid methods integrating attention mechanisms, such as DHIF-Net and UDALN, achieve improved feature modeling with relatively low TFLOPS (3.6200 and 7.3942, respectively) but require more parameters (2.6700M and 2.5681M) to encode multiscale spatial–spectral dependencies. The RECF and ADASR employ deeper networks and multibranch structures, which further increase parameter counts (3.5246M and 3.3275M) and moderate TFLOPS, reflecting the trade-off between modeling capacity and efficiency. CasFormer, a cascaded-Transformer-based method, exhibits the highest parameter count (4.1746M) and TFLOPS (10.7000), arising from its multistage attention and dual-imaging fusion modules.
The QOM achieves a balanced profile with 2.9307M parameters and 8.6024 TFLOPS. Its design incorporates spatial and channel attention modules together with quadratic optimization, allowing the network to selectively focus on informative features and suppress irrelevant information. Compared with simpler CNNs, the QOM attains significantly higher PSNR and lower RMSE, SAM, and ERGAS, demonstrating superior fusion quality. Compared with large-scale Transformers such as CasFormer, the QOM maintains competitive performance while using fewer computational resources, illustrating that its hierarchical attention mechanism and lightweight optimization effectively balance model complexity and reconstruction accuracy. Overall, these results reveal that the QOM achieves a favorable trade-off between efficiency and performance, confirming the rationality of its network design for real-world hyperspectral image fusion.
7.
Noise Sensitivity Studies
7.1
Noise-Aware Fusion Mechanism of QOM
In real-world hyperspectral imaging, LR_HS and HR_MS observations are inevitably corrupted by noise, which propagates through both spatial degradation and spectral response processes as described in Eqs. (1)–(3). Robust fusion under such conditions therefore requires effective suppression of noise while preserving spatial structures and spectral consistency.
The robustness of the QOM mainly stems from its two-stage design. In the first stage, the CNN-based initial fusion exploits spatial and spectral redundancy in the data. With the aid of spatial and channel attention, the network suppresses noise-dominated responses while retaining informative structures, yielding a stable initial estimate even under increased noise levels.
In the second stage, the quadratic optimization refinement enforces global consistency with the physical degradation model. By constraining the fused image to simultaneously agree with LR_HS and HR_MS observations, noise components that are inconsistent with the imaging process are further attenuated, leading to improved stability and fidelity.
By combining attention-guided feature learning with model-based constraints, the QOM effectively balances noise suppression and information preservation. This hybrid mechanism explains the robustness of the QOM under various noise conditions, which is validated in the following experiments.
7.2
Experimental Evaluation under Noisy Conditions
7.2.1
Comparison of Noise Robustness across Different Methods
In real-world imaging scenarios, hyperspectral and multispectral observations are inevitably affected by noise, which can significantly degrade fusion performance. To systematically evaluate the noise robustness of different fusion methods, comparative experiments are conducted on the Botswana dataset by injecting additive noise with increasing intensity into both LR_HS and HR_MS images.
Considering the resolution discrepancy between the two modalities, the SNRs of LR_HS and HR_MS are jointly controlled with a fixed 10 dB offset to ensure comparable degradation severity, following the same setting as in Section 4. A total of six noise levels are designed, covering a wide range of SNR conditions. For each noise setting, the fusion results of all competing methods are evaluated using PSNR, and the corresponding performance variations are illustrated in Figure 19.
Figure 19.
The PSNR (dB) variation of QOM and other methods in noise experiments can be used to determine the robustness of the corresponding method based on the slope.
The PSNR of all methods decreases as the SNR drops, but the rate of decline varies, reflecting differences in noise robustness. Traditional matrix factorization and shallow methods (e.g., CNMF, LTTR) degrade significantly under high noise, indicating sensitivity to noise. Some CNN-based methods (e.g., MSDCNN, ConSSFCNN) maintain moderate performance at mid-level SNR but underperform at low SNR due to noise accumulation.
In contrast, the QOM exhibits a more stable performance drop across noise levels, retaining a relatively high PSNR in low-to-mid SNR ranges. This robustness stems from its spatial and spectral constraints in the initial fusion stage, which effectively mitigate noise impact. CasFormer follows a similar trend but consistently outperforms the QOM, especially under low SNR, suggesting that Transformer-based global modeling better captures long-range dependencies to enhance fusion robustness.
7.2.2
Robustness of QOM under Different Noise Types
To further evaluate the generalization ability of the QOM, we test its performance under different noise models commonly encountered in hyperspectral imaging, including Gaussian noise, Poisson noise, and speckle noise. The experimental setup is the same as before. All noise types are applied to both LR_HS and HR_MS images with comparable intensity levels. The result is shown in Figure 20.
Figure 20.
The PSNR (dB) variation of QOM under different noise conditions.
The results indicate that the QOM maintains stable reconstruction quality under different noise models. Although performance degradation is observed for more complex noise types such as speckle noise, the QOM consistently preserves spectral consistency and spatial structures. This robustness can be attributed to the complementary roles of the CNN-based initial fusion and the model-based quadratic refinement, which together reduce sensitivity to specific noise distributions.
8.
Summary
To address the intrinsic challenges of hyperspectral and multispectral image fusion, this paper proposes a QOM that effectively integrates deep learning with traditional model-driven optimization. Unlike existing methods that rely solely on data-driven learning or handcrafted mathematical models, the QOM establishes a unified fusion framework in which deep neural networks and quadratic optimization are mutually reinforcing rather than independently applied.
Specifically, the QOM introduces a three-module deep neural network designed for hyperspectral dimension recovery, spatial detail reconstruction, and spectral detail enhancement. To improve training efficiency, convergence speed, and result stability, specially designed spatial loss (Lossspat) and spectral loss (Lossspec) functions are incorporated together with embedded spatial attention and channel attention mechanisms. These designs explicitly guide the network to preserve both spatial structures and spectral fidelity, which directly addresses the spectral–spatial trade-off commonly observed in existing fusion approaches.
Importantly, different from purely deep-learning-based fusion methods that treat network outputs as final results, the QOM further performs a secondary optimization stage by embedding matrix decomposition techniques from hyperspectral super-resolution. Through iterative updates of the endmember and abundance matrices, the proposed framework refines the initial network predictions in a physically interpretable and mathematically constrained manner. This learning–optimization coupling constitutes a key methodological difference from existing approaches and significantly enhances fusion accuracy, stability, and robustness.
The significance of the proposed method lies in its ability to bridge the gap between learning-based flexibility and model-driven reliability. By combining the strong representation capability of deep networks with the interpretability and convergence guarantees of quadratic optimization, the QOM overcomes the limitations of existing methods that often face challenges from spectral distortion, insufficient spatial detail preservation, or poor generalization under noise and large-scale data conditions.
Extensive experiments further validate these contributions. Ablation studies confirm the necessity and effectiveness of each network module and loss design. Comparisons with ten representative deep learning and traditional fusion methods demonstrate that the QOM achieves superior performance in most quantitative and qualitative evaluations. Repeated experiments on the Pavia dataset show extremely low metric fluctuations, highlighting the stability of the proposed framework. In addition, computational complexity analysis and noise addition experiments confirm that the QOM exhibits excellent scalability, feasibility for large-scale fusion tasks, and strong robustness against noise.
9.
Future Work
Future work can be pursued along several complementary directions. Richer representations of spatial and spectral information remain worthy of further investigation together with more refined spatial and spectral attention mechanisms and improved loss function formulations.
More advanced matrix factorization techniques and iterative optimization strategies could be incorporated to further strengthen the integration between deep learning models and mathematical optimization frameworks. The two-stage optimization scheme in the QOM also has the potential to be generalized and applied to other image fusion tasks.
Robustness across different sensors and imaging conditions may be further improved by introducing transfer learning or domain adaptation strategies into the QOM framework.
An extension toward multitemporal hyperspectral and multispectral data processing would support dynamic monitoring tasks such as crop growth analysis and urban change detection, increasing the operational applicability of the method in real remote sensing scenarios.
References
1VivoneG.2023Multispectral and hyperspectral image fusion in remote sensing: a surveyInf. Fusion89405417405–1710.1016/j.inffus.2022.08.032
2ShuklaA.KotR.2016An overview of hyperspectral remote sensing and its applications in various disciplinesIRA Int. J. Appl. Sci5859085–90
3HegeE. K.O’ConnellD.JohnsonW.BastyS.DereniakE. L.2004Hyperspectral imaging for astronomy and space surveillanceProc. SPIE5159380391380–91
4GuilloteauC.OberlinT.BernéO.DobigeonN.2020Hyperspectral and multispectral image fusion under spectrally varying spatial blurs—application to high dimensional infrared astronomical imagingIEEE Trans. Comput. Imaging6136213741362–7410.1109/TCI.2020.3022825
5RekhaB. U.DesaiV. V.AjawanP. S.JhaS. K.2018Remote sensing technology and applications in agriculture2018 Int’l. Conf. on Computational Techniques, Electronics and Mechanical Systems (CTEMS)193197193–7IEEEPiscataway, NJ10.1109/CTEMS.2018.8769124
6LuB.DaoP. D.LiuJ.HeY.ShangJ.2020Recent advances of hyperspectral imaging technology and applications in agricultureRemote Sens.12265910.3390/rs12162659
7VibhuteA. D.KaleK. V.DhumalR. K.MehrotraS. C.2015Soil type classification and mapping using hyperspectral remote sensing data2015 Int’l. Conf. on Man and Machine Interfacing (MAMI)141–4IEEEPiscataway, NJ10.1109/MAMI.2015.7456607
8WangZ.TianS.2021Ground object information extraction from hyperspectral remote sensing images using deep learning algorithmMicroprocess. Microsyst.8710439410.1016/j.micpro.2021.104394
9LuoF.ZhangL.DuB.ZhangL.2020Dimensionality reduction with enhanced hybrid-graph discriminant learning for hyperspectral image classificationIEEE Trans. Geosci. Remote Sens.58533653535336–5310.1109/TGRS.2020.2963848
10MouL.GhamisiP.ZhuX. X.2017Deep recurrent neural networks for hyperspectral image classificationIEEE Trans. Geosci. Remote Sens.55363936553639–5510.1109/TGRS.2016.2636241
11DianR.LiS.SunB.GuoA.2021Recent advances and new guidelines on hyperspectral and multispectral image fusionInf. Fusion69405140–5110.1016/j.inffus.2020.11.001
12ZhangX.HuangW.WangQ.LiX.2020SSR-NET: spatial–spectral reconstruction network for hyperspectral and multispectral image fusionIEEE Trans. Geosci. Remote Sens.59595359655953–6510.1109/TGRS.2020.3018732
13ShenD.LiuJ.XiaoZ.YangJ.XiaoL.2020A twice optimizing net with matrix decomposition for hyperspectral and multispectral image fusionIEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.13409541104095–11010.1109/JSTARS.2020.3009250
14KawakamiR.MatsushitaY.WrightJ.Ben-EzraM.TaiY.-W.IkeuchiK.2011High-resolution hyperspectral imaging via matrix factorizationCVPR 2011232923362329–36IEEEPiscataway, NJ10.1109/CVPR.2011.5995457
15YokoyaN.YairiT.IwasakiA.2011Coupled nonnegative matrix factorization unmixing for hyperspectral and multispectral data fusionIEEE Trans. Geosci. Remote Sens.50528537528–3710.1109/TGRS.2011.2161320
16DongW.FuF.ShiG.CaoX.WuJ.LiG.LiX.2016Hyperspectral image super-resolution via non-negative structured sparse representationIEEE Trans. Image Process.25233723522337–5210.1109/TIP.2016.2542360
17LinC.-H.MaF.ChiC.-Y.HsiehC.-H.2017A convex optimization-based coupled nonnegative matrix factorization algorithm for hyperspectral and multispectral data fusionIEEE Trans. Geosci. Remote Sens.56165216671652–6710.1109/TGRS.2017.2766080
18LiuP.XiaoL.LiT.2017A variational pan-sharpening method based on spatial fractional-order geometry and spectral–spatial low-rank priorsIEEE Trans. Geosci. Remote Sens.56178818021788–80210.1109/TGRS.2017.2768386
19LiuP.XiaoL.ZhangJ.NazB.2015Spatial-Hessian-feature-guided variational model for pan-sharpeningIEEE Trans. Geosci. Remote Sens.54223522532235–5310.1109/TGRS.2015.2497966
20YangJ.FuX.HuY.HuangY.DingX.PaisleyJ.2017PanNet: a deep network architecture for pan-sharpeningProc. IEEE Int’l. Conf. on Computer Vision544954575449–57IEEEPiscataway, NJ10.1109/ICCV.2017.193
21ZhangY.LiuC.SunM.OuY.2019Pan-sharpening using an efficient bidirectional pyramid networkIEEE Trans. Geosci. Remote Sens.57554955635549–6310.1109/TGRS.2019.2900419
22AiazziB.BarontiS.SelvaM.2007Improving component substitution pansharpening through multivariate regression of MS + Pan dataIEEE Trans. Geosci. Remote Sens.45323032393230–910.1109/TGRS.2007.901007
23SelvaM.AiazziB.ButeraF.ChiarantiniL.BarontiS.2015Hyper-sharpening: a first approach on SIM-GA dataIEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.8300830243008–2410.1109/JSTARS.2015.2440092
24DianR.FangL.LiS.2017Hyperspectral image super-resolution via non-local sparse tensor factorizationProc. IEEE Conf. on Computer Vision and Pattern Recognition534453535344–53IEEEPiscataway, NJ10.1109/CVPR.2017.411
25ZhangK.WangM.YangS.JiaoL.2018Spatial–spectral-graph-regularized low-rank tensor decomposition for multispectral and hyperspectral image fusionIEEE J. Sel. Top. Appl. Earth Obs. Remote Sen.11103010401030–4010.1109/JSTARS.2017.2785411
26PalssonF.SveinssonJ. R.UlfarssonM. O.2017Multispectral and hyperspectral image fusion using a 3-D-convolutional neural networkIEEE Geosci. Remote Sens. Lett.14639643639–4310.1109/LGRS.2017.2668299
27DianR.LiS.GuoA.FangL.2018Deep hyperspectral image sharpeningIEEE Trans. Neural Netw. Learn. Syst.29534553555345–5510.1109/TNNLS.2018.2798162
28XieQ.ZhouM.ZhaoQ.MengD.ZuoW.XuZ.2019Multispectral and hyperspectral image fusion by MS/HS fusion netProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition158515941585–94IEEEPiscataway, NJ10.1109/CVPR.2019.00168
29DianR.LiS.KangX.2020Regularizing hyperspectral and multispectral image fusion by CNN denoiserIEEE Trans. Neural Netw. Learn. Syst.32112411351124–3510.1109/TNNLS.2020.2980398
30XuS.AmiraO.LiuJ.ZhangC.-X.ZhangJ.LiG.2020HAM-MFN: hyperspectral and multispectral image multiscale fusion network with RAP lossIEEE Trans. Geosci. Remote Sens.58461846284618–2810.1109/TGRS.2020.2964777
31LiuX.LiuQ.WangY.2020Remote sensing image fusion based on two-stream fusion networkInf. Fusion551151–1510.1016/j.inffus.2019.07.010
32HanX.-H.ShiB.ZhengY.2018SSF-CNN: spatial and spectral fusion with CNN for hyperspectral image super-resolution2018 25th IEEE Int’l. Conf. on Image Processing (ICIP)250625102506–10IEEEPiscataway, NJ10.1109/ICIP.2018.8451142
33YuanQ.WeiY.MengX.ShenH.ZhangL.2018A multiscale and multidepth convolutional neural network for remote sensing imagery pan-sharpeningIEEE J. Sel. Top. Appl. Earth Obs. Remote Sen.11978989978–8910.1109/JSTARS.2018.2794888
34HuangT.DongW.WuJ.LiL.LiX.ShiG.2022Deep hyperspectral image fusion network with iterative spatio-spectral regularizationIEEE Trans. Comput. Imaging8201214201–1410.1109/TCI.2022.3152700
35LiJ.ZhengK.YaoJ.GaoL.HongD.2022Deep unsupervised blind hyperspectral and multispectral data fusionIEEE Geosci. Remote Sens. Lett.19151–5
36WangY.ChenJ.MouX.ChenT.ChenJ.LiuJ.FengX.LiH.ZhangG.WangS.LiS.LiuY.2024Fusion of Hyperspectral and Multispectral Images with Radiance Extreme Area CompensationRemote Sensing16124810.3390/rs16071248
37QinJ.FangL.LuR.LinL.ShiY.2023ADASR: an adversarial auto-augmentation framework for hyperspectral and multispectral data fusionIEEE Geosci. Remote Sens. Lett.205002705
38LiC.ZhangB.HongD.ZhouJ.VivoneG.LiS.ChanussotJ.2024CasFormer: cascaded transformers for fusion-aware computational hyperspectral imagingInf. Fusion10810240810.1016/j.inffus.2024.102408
39HongD.ZhangB.LiH.LiY.YaoJ.LiC.WernerM.ChanussotJ.ZipfA.ZhuX. X.2023Cross-city matters: a multimodal remote sensing benchmark dataset for cross-city semantic segmentation using high-resolution domain adaptation networksRemote Sens. Environ.2991710.3390/rs16010017
40SimõesM.Bioucas-DiasJ.AlmeidaL. B.ChanussotJ.2014Hyperspectral image superresolution: an edge-preserving convex formulation2014 IEEE Int’l. Conf. on Image Processing (ICIP)416641704166–70IEEEPiscataway, NJ10.1109/ICIP.2014.7025846
41SimõesM.Bioucas-DiasJ.AlmeidaL. B.ChanussotJ.2014A convex formulation for hyperspectral image superresolution via subspace-based regularizationIEEE Trans. Geosci. Remote Sens.53337333883373–88
42DianR.LiS.FangL.2019Learning a low tensor-train rank representation for hyperspectral image super-resolutionIEEE Trans. Neural Netw. Learn. Syst.30267226832672–83