Multi-Attention Guided SKFHDRNet For HDR Video Reconstruction

Ehsan Ullah; Marius Pedersen; Kjartan Sebastian Waaseth; Bernt-Erik Baltzersen

doi:10.2352/J.ImagingSci.Technol.2023.67.5.050409

Abstract

We propose a three stage learning-based approach for High Dynamic Range (HDR) video reconstruction with alternating exposures. The first stage performs alignment of neighboring frames to the reference frame by estimating the flows between them, the second stage is composed of multi-attention modules and a pyramid cascading deformable alignment module to refine aligned features, and the final stage merges and estimates the final HDR scene using a series of dilated selective kernel fusion residual dense blocks (DSKFRDBs) to fill the over-exposed regions with details. The proposed model variants give HDR-VDP-2 values on a dynamic dataset of 79.12, 78.49, and 78.89 respectively, compared to Chen et al. [“HDR video reconstruction: A coarse-to-fine network and a real-world benchmark dataset,” Proc. IEEE/CVF Int’l. Conf. on Computer Vision (IEEE, Piscataway, NJ, 2021), pp. 2502–2511] 79.09, Yan et al. [“Attention-guided network for ghost-free high dynamic range imaging,” Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (IEEE, Piscataway, NJ, 2019), pp. 1751–1760] 78.69, Kalantari et al. [“Patch-based high dynamic range video,” ACM Trans. Graph. 32 (2013) 202–1] 70.36, and Kalantari et al. [“Deep hdr video from sequences with alternating exposures,” Computer Graphics Forum (Wiley Online Library, 2019), Vol. 38, pp. 193–205] 77.91. We achieve better detail reproduction and alignment in over-exposed regions compared to state-of-the-art methods and with a smaller number of parameters.

jist

JIMTE6

Journal of Imaging Science and Technology

J. Imaging Sci. Technol.

1062-3701

1943-3522

Society for Imaging Science and Technology

050409

10.2352/J.ImagingSci.Technol.2023.67.5.050409

1544

Work Presented at CIC31: Color and Imaging 2023

Multi-Attention Guided SKFHDRNet For HDR Video Reconstruction

Multi-attention guided SKFHDRNet for HDR video reconstruction

UllahEhsan

PedersenMarius

▴

Colourlab, Department of Computer Science, NTNU, Gjøvik, Norway

marius.pedersen@ntnu.no

WaasethKjartan Sebastian

BaltzersenBernt-Erik

DvNor, Nagra Kudelski, Oslo, Norway

Ullah et al.

▴

IS&T Member.

092023

1162023

2382023

2023

Abstract

ccc

1062-3701/2023/67(5)/050409/19/$25.00

printed

Printed in the USA

Introduction

There is a difference and mismatch of dynamic range information when capturing a physical scene. This means that more visual information is available in the scene than what can be captured and reproduced as conventional camera system’s capabilities are limited in simultaneously covering the wide range of luminance in a single exposure. Additionally, a large part of the digital content currently used is stored and captured using 8-bit integer values, offering 28 = 256 distinct levels. These device-referred formats such as JPEG, PNG, TIFF, etc. are constructed according to the limitations of display devices and accommodate according to the capabilities of the imaging device with minimum care for loss of visual information that the imaging device cannot display [1].

High Dynamic Range (HDR) video can be created through reconstruction using single or multiple Low Dynamic Range (LDR) frames captured using conventional cameras by alternating the exposure of each frame using software solutions or using specialized single-shot HDR cameras. HDR reconstruction using single exposure is further divided into three unique sub-problems of decontouring (Daly and Feng [2], Song et al. [3], Luzardo et al. [4], Mukherjee et al. [5], tone expansion Banterle et al. [6, 7], De Simone et al. [8], Masia et al. [9]) and filling of details in over-exposed regions from its adjacent non-exposed pixels [10, 11]. Time-sequential multi-exposure techniques are another way to capture HDR images, by taking a sequence of images with different exposures. Although an LDR sensor may record only a small portion of the whole luminance range of a scene at any given time, it has a functional range with the potential to include the entire luminance range by adjusting the exposure of each capture. The images are further combined to generate an image with a higher dynamic range. For video, one can obtain alternate exposures between subsequent video frames, this had resulted in multi-exposure techniques for video. In the case of HDR reconstruction of video, the problem of frame alignment to compensate for camera and object motion arises. This is often solved by methods that rely on pixel-level alignment with optical flow [12–15]. Recently, several learning-based methods have been used for reconstructing HDR video. Refs. [13–15] addresses the problem of HDR reconstruction by using Convolutional Neural Network(CNN) with optical flow to learn the HDR video reconstruction. Wu et al. [16] aligned LDR frames by performing homography, which is a non-flow-based approach. Yan et al. [17] applied attention mechanism for content alignment and gave importance to only those features that are similar to the reference image and excluded regions with motion and severe saturation. Later, they introduced a non-local neural network [18]. Despite these appraoches it still remains a big challenge to reconstruct ghost-free HDR videos from sequences with alternating exposures.

In this paper, we introduce a learning-based approach to address the issue of HDR video reconstruction with two alternating exposures. The goal is to obtain ghost-free videos with good detail preservation. Our approach has three main stages, the first stage performs alignment of neighboring frames to the current frame by estimating the flows between them, recovering a large part of missing details from the input LDR images, and the second stage is composed of multi-attention modules and a Pyramid Cascading Deformable (PCD) alignment module [19] to refine previously aligned features by performing a sophisticated feature alignment. The final stage performs merging by estimating the final HDR scene based on a series of Dilated Selective Kernel Fusion Residual Dense Blocks (DSKFRDBs) with global residual learning strategy [17, 20] that allows the network to fill the over-exposed regions with rich details. The entire network is trained in an end-to-end fashion to reconstruct HDR video. We employ L1 and a combined L1MS–SSIM [21] loss function to minimize the error between the reconstructed and original HDR frames.

The major contributions of our work for HDR video reconstruction are as follows:

∙

Introduction of multi-attention (particularly using a selective kernel fusion module) blocks with the goal of proper image alignment by extracting rich information spatially, channel-wise, and giving attention to the scale of the content in the input frames.

∙

For effective HDR video reconstruction, we employ robust DSKFRDBs in the merge network for recovering details in over- and under-exposed regions.

∙

Our proposed model has fewer network parameters than previous learning-based techniques.

∙

Model training is performed using L1 and a combined L1MS–SSIM loss to guide the optimization algorithm by learning more refined network weight parameters for HDR video reconstruction.

Our proposed multi-attention selective kernel fusion HDR network (SKFHDRNet) method showed a fair improvement over existing techniques and makes it possible to use LDR frames in HDR video reconstruction.

Related Work

Different approaches have been proposed for hardware-based HDR video acquisition and computationally-based HDR reconstruction. Nayar and Mitsunaga [22] and Nayar et al. [23] proposed different types of per-pixel changeable optical density masks that were used to vary the spatial exposure to capture the scene at different exposures. Others [24–26] were able to successfully capture a wider range of HDR video through internal/external beam-splitters. The sensor’s dynamic range capabilities are improved by [27], while some sensors calculate the logarithm of the irradiance in the analog domain using the logarithmic response of a sensor [28, 29].

Many single-exposure computationally-based inverse tone mapping operators made efforts to solve the issue by applying separate expansion to pixels that are classified as saturated recovering details in over-exposed regions [6, 30–34]. Didyk et al. [35] decomposed video frame components into diffuse, reflections, and light sources using a semi-manual classifier (Zhang and Brainard [10] and Xu et al. [11]) to perform pixel-level image processing. A dithering-based approach was proposed that adds noise to mask banding artifacts due to quantization [2, 5]. More recently, several methods have employed deep learning strategies for single-exposure HDR image reconstruction. Eilertsen et al. [36] used a CNN-based encoder and decoder architecture reconstructing colors, intensities, and details in saturated regions. By merging bracketed LDR images, Endo et al. [37] indirectly recreated an HDR image from a single LDR input. Liu et al. [38] developed three deep networks for dequantization, linearization, and hallucination of missing details in over-exposed regions.

Kang et al. [12] proposed the first HDR video reconstruction algorithm for sequences with alternating exposures using optical flow. Mangiat and Gibson [39] improved the approach by Kang et al. [12] using a block-based motion estimation method coupled with a refinement stage. In follow-up work, Mangiat and Gibson [40] proposed to filter regions with a large motion to reduce blocking artifacts. Kalantari et al. [41] proposed a patch-based optimization system to synthesize the missing exposures at each frame. Gryaditskaya et al. [42] improved the method of Kalantari et al. [41] by adaptively adjusting the exposures. Li et al. [43] proposed the HDR video reconstruction problem as maximum a posteriori estimation. Kalantari and Ramamoorthi [14] addressed the drawbacks of their previous approach [13] by proposing to use CNNs to learn the HDR video reconstruction process. Eilertsen et al. [44] improved the temporal stability of CNNs by introducing a regularization approach that encourages the network to produce consistent results for consecutive frames in a video. Yan et al. [17] proposed an attention-guided deep neural network with an attention mechanism for frame alignment for HDR imaging. Kim et al. [45] addressed the reconstruction of (ultra high definition) UHD HDR videos by simultaneously working on the content super-resolution and inverse tone-mapping and introducing GAN (Generative Adversarial Network) based architecture with multiple subnets for specific tasks. The super-resolution and inverse tone-mapping (SR-ITM) framework is further extended by utilizing information at multi-scale to enhance the network’s local receptive fields. The approach involves downsampling image features at various scales, enabling to catch complex image patterns from pixels using varied local receptive field sizes [46]. Chen et al. [47] suggested a deep learning pipeline composed of adaptive global color mapping, local enhancement, and highlight generation. For adaptive global color mapping, they introduced a color condition block that extracts global image priors and adapts them to different images. Beside that, ResNet was used as their network architecture and a GAN model for local enhancement and highlight generation, respectively. Similarly, GAN-based framework for HDR video reconstruction from LDR sequences with alternating exposures was adopted by Anand et al. [48]. Yang et al. [49] introduced a multimodal learning framework for reconstructing HDR videos based on three components. One component to align the frames; the second, a fusion component based on confidence guided multimodal fusion, and the last component to suppress flicker. Yang et al. [50] proposed a lightweight-efficient network based on structural re-parameterization, and a motion alignment loss to reduce motion artifacts. Cogalan et al. [51] proposed a CNN method for HDR image and video reconstruction that works for both for single-shot acquisition with spatially-interleaving exposures and for multi-shot acquisition with spatially-interleaving and temporally-alternating exposures. Their method used optical flow and is stated to work well for non-linear motion as well. Liu et al. [52] focused on optical flow estimation for LDR images of different exposures, and they proposed an unsupervised approach that incorporates a model-based algorithm and a data-driven deep network. Martorell and Buades [53] proposed a variational temporal approach to optical flow estimation that has data and spatial smoothness terms, as well as a temporal smoothness term and to match pixels from different frames. Jiang et al. [54] introduced a tri-exposure quad-bayer sensors. With a larger number of exposure sets uniformly distributed over each frame, providing robustness to noise and spatial artifacts. Ref. [55] produced high-dynamic range (HDR) video using dual-exposure sensors, which capture differently exposed and spatially interleaved half-frames in a single shot, eliminating the need for exposure alignment. Neural networks are employed for denoising, deblurring, and upsampling tasks and optical flow is utilized for precise warping. Recently, Chen et al. [15] came up with a two-stage coarse-to-fine framework for HDR video reconstruction. Their first stage aligns images using optical flow and blending in the image space. Their second stage performs more sophisticated alignment fusion for HDR video using deformable convolution [56] in PCD module as well as performing fusion temporally.

However, most single exposure-based techniques are not built to handle videos and cannot handle noise in the dark regions while hallucinating only smaller saturated regions. Similarly, solving the issue of frame alignments and temporal aspects of HDR video reconstruction through single attention is challenging, and recent models with optical flow have a large number of parameters and struggle on examples with large motions.

Multi-Attention Guided SKFHDRNet for HDR Video Reconstruction

Given an input LDR video/sequential frames {I|i = 1, …n} with alternating exposures {t|i = 1, …n}, the Multi-Attention SKFHDRNet reconstructs a high-quality HDR video {H|i = 1, …n}. Similar to [13–15], input frames in linear and LDR domain are stacked and passed to the network for HDR video reconstruction shown in Figure 1.

Figure 1.

Representation of three consecutive frames with two alternate exposures of the carousel firework scene in Ref. [26] HDR dataset. Each frame in three consecutive frame input contains few missing contents with the presence of noise in frame Fi − 1 and Fi + 1 in the darker region due to acquisition with low exposure whereas Fi, which was taken with high exposure, lacks details in over-saturated and bright regions. The missing content of a final HDR image has to be reconstructed from neighboring frames with alternating exposures. For our full model we also used the neighboring frames Fi − 2 and Fi + 2 as well.

3.1

Data Preprocessing

Similar to the work of [13–15] the camera response function of the input frames Ii is assumed to be known. As in Refs. [14, 15], we replace the camera response function of the input images with a fixed gamma curve as:

(1)

F_{i} = {lin}_{i} (I_{i}) = > {(I_{i} t_{i})}^{1 ∕ γ},

where γ is set to 2.2 and lini is a function that transfers the image Ii from the linear HDR domain into LDR domain at exposure ti. Similarity transforms that include rotation, translation, and isometric scaling are applied to globally align adjacent frames to simplify the learning process of our proposed model.

Real-world cameras often produce noisy images and are difficult to calibrate. It is necessary for the training dataset to represent these limitations of conventional camera systems to enable the learning-based model to perform and generalize effectively on scenes captured with conventional consumer cameras. Refs. [13–15] imitate the flaws of common consumer cameras by introducing noise and altering the tone of the synthetic images in their synthetic training dataset for ensuring the generalizability of their proposed network during inference time. Image acquisition through conventional digital cameras usually contains noisy pixels in dark regions. Then the information from those darker regions of the image should be taken from the high-exposure image which has more details in that region. The input LDR synthetic training dataset usually has the same amount of noise for both exposures. Using the dataset directly without modification, the content of the high-exposure image in the dark regions will be unused, which eventually produces noisy results in real scenes [14]. Similar to Kalantari and Ramamoorthi [14] and Chen et al. [15], zero-mean Gaussian noise was added to the input LDR images with low exposure making the models use the information in the dark regions of a clean high exposure image. The zero-mean Gaussian noise was specifically applied to the images in the linear domain. The intention was to magnify the noise in the dark regions after transforming the image into the LDR domain. To account for noise variation similar to [14, 15], random Gaussian noise range using the standard deviation between 10 ×−3 and 3 × 10−3 and the tone of the reference image was perturbed with γ = exp(d) function, where d is randomly selected from the range [−0.7,0.7] for simulation of an inaccurate camera response function. Cropped patches of size 256 × 256 were given as input to the proposed model along with random horizontal/vertical flipping and rotation.

Figure 2.

Visualization of the network architecture of our proposed multi-attention SKFHDRNet for HDR video reconstruction with two alternating exposures.

3.2

Pipeline

In Figure 2, the multi-attention SKFHDRNet comprises two primary sub-networks. These sub-networks are designed to align and recover missing content in the reference (center) frame using attention modules, incorporating spatial, channel, and attention through adaptive kernel selection and fusion mechanisms. The multi-attention blocks focus solely on the relevant features related to the center frame. To achieve this, neighboring frame features are fused with the reference frame, and the resulting features are passed through the multi-attention blocks to extract missing content from surrounding frames in relation to the center frame. Furthermore, to enhance temporal coherence and alignment, the aligned features are passed through the PCD [19] alignment module. These refined features are then fed into the merge network, which is composed of a series of DSKFRDBs. These DSKFRDBs with dilation convolutions helps in recovering details due to over-exposure and motion of objects by enlarging the receptive field and ultimately estimating high-quality HDR video.

Motivated by the work of Ledig et al. [20] and Yan et al. [17], global residual learning strategy was adopted by adding the shallow reference frame feature Fr to OF5 before reconstructing the final HDR frame. Our proposed method predicts blending weights (see Section 4) and produces a 15 channel output. The input images are averaged using their blending weights to obtain the final HDRi image at frame i.

3.3

Image Alignment Using Optical Flow

We adopted the optical flow network of Chen et al. [15] for efficient frame alignment. Alignment of frames is done in the initial phase of learning-based techniques with the reference frame Li. Flows are estimated for neighbouring frames Li−1 and Li+1, in relation to the reference frame, Li. The nearby frames Li−1 and Li+1 are then warped with the help of two estimated flows to set a series of aligned images Li−1, i and Li+1, i in relation to the reference frame Li for efficient treatment of non-rigid motion and the inaccuracies introduced by global alignment.

3.4

Multi-Attention Guided Feature Alignment

The attention-guided blocks were given five 6-channels input frames in linear and LDR domain Fi, where i = 1,2,3,4,5. First neighbouring input frames Fi − 2, Fi − 1 and Fi + 1, Fi + 2 were concatenated and fused (see Fig. 2) before passing to the attention blocks.

3.4.1

Channel Attention

We make use of channel attention proposed by Woo et al. [57] to take advantage and exploit dependencies among features across channels. The architecture of the channel attention network is represented in Figure 3.

Figure 3.

The channel attention sub-module uses a combination of max and average pooling, alongside a shared MLP network.

In the channel attention blocks, spatial information is collected from the feature maps through both average and max-pooling operations, resulting in two sets of features Favg and Fmax (refer to Eq. (2)). These two sets of features are then fed into a shared Multi-Layer Perceptron (MLP) network with one hidden layer for providing attention-guided weights for each channel, represented as W ∈ RC×1×1. The MLP’s hidden layer parameter size is set to RC∕r×1×1, where r (reduction ratio) is utilized to reduce and control the size of parameters in the hidden layer. Finally, the output feature vectors from the shared network (MLP) corresponding to the Favg and Fmax features are combined using element-wise summation.

(2)

\begin{matrix} A_{i} & = & σ (M L P (F_{avg} (F_{i r}))) + M L P (F_{max} (F_{i r})) \\ = & σ (W_{1} (W_{0} ((F_{i}, F_{r}) F_{avg})) + W_{1} (W_{0} ((F_{i}, F_{r}) F_{max}))), \end{matrix}

where σ denotes the sigmoid function, W0 ∈ RC∕r×C and W1 ∈ RC×C∕r represent MLP layer weights, and Fir is the concatenated feature by fusing Fi and Fr respectively. The estimated attention maps are point-wise multiplied to attend the features of the non-reference frames via Eq. (3):

(3)

F_{i}^{''} = A_{i} \circ F_{i}^{'}, (i = 1, 3),

where

\circ

denotes the point-wise multiplication between Ai and Fi′, (i = 1,3). Attention-guided features Fi′′− 1 and Fi′′ + 1 are concatenated and fused with the reference frame Fr to get the final stack of channel attention-guided feature Fca using Eq. (4).

(4)

F_{s k} = Concat (F_{i}^{''} - 1, F_{r}, F_{i}^{''} + 1),

3.4.2

Soft Attention Using Selective Kernel Fusion

We utilize the work proposed by [58] as an adaptive soft attention technique. This method involves employing multiple kernels with varying receptive field sizes to effectively capture information from objects of different scales within the input. The selective kernel fusion block consists of three main operations: splitting, fusing, and selecting, as depicted in Figure 4.

Figure 4.

Represents selective kernel fusion attention block involving three main operations specifically, split, fuse and select.

3.4.3

Split

Through split operation, the incoming features Fi′, Fr of size H′× W′× C′ are transformed to U3 and U5 features based on the receptive field sizes of 3 × 3 and 5 × 5 and applying efficient depthwise convolutions [59], followed by ReLU activation function performing convolution with dilation size of 2.

3.4.4

Fuse

Fuse module adaptively controls the information flow of different scales of the two branches that have different receptive fields into the activation functions in the upcoming layer.

The data from the two branches is combined via element-wise summation. Following this, global average pooling is applied to incorporate global information and produce channel-wise statistics represented as S ∈ RC (see Eq. (5)).

(5)

S = F_{g p} (U^{'}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} U^{'} (i, j),

Moreover, the feature vector obtained from global average pooling is then fed into a fully connected layer to enable accurate and adaptive feature selection, resulting in Z ∈ Rd×1. Additionally, a dimensionality reduction parameter is incorporated in Eq. (6) for improving the efficiency of the attention block.

(6)

Z = F_{f c} (S) = δ ((W S)),

where δ is the ReLU function and W ∈ Rd ×C represent fully connected (fc) layer parameters.

(7)

d = max (C ∕ r, L),

where C represents channel and d represents reduction ratio which is controlled by parameter r for modifying the parameter size of the fully connected layer and L = 32 represent the minimal value of variable d.

3.4.5

Select

The last step involves the adaptive selection of informative content from the guided feature descriptor Z by applying a channel-wise softmax operator, as described in Eq. (8). Focusing on different scales of valuable information.

(8)

a = softmax (Z), b = softmax (Z)

The softmax-based attention-guided feature maps are multiplied with U3 and U5 features which was retrieved previously through split process and then summed to obtain the final attention-guided feature map using Eq. (9).

(9)

A_{i} = a \cdot U_{3} + b \cdot U_{5},

where Ai represents soft attention-guided features that are then pointwise multiplied with non-reference features Fi′ using Eq. (10).

(10)

F_{i}^{''} = A_{i} \circ F_{i}^{'}, (i = 1, 3),

Attention guided features Fi′′− 1 and Fi′′ + 1 are concatenated and fused with the reference frame Fr to get selective kernel fusion based soft attention guided features Fsk by using Eq. (11).

(11)

F_{s k} = Concat (F_{i}^{''} - 1, F_{r}, F_{i}^{''} + 1),

3.4.6

Spatial Attention

We also utilize the findings of [17] to acquire spatial attention maps for the non-reference frames as depicted in Fig. 2. Fused features Fi′, i = 1,3 of the non-reference images are introduced to the convolutional attention module ai(⋅), i = 1,3 along with the reference frame feature map Fr, obtaining attention maps Ai, i = 1,3 for the non-reference frames using Eq. (12).

(12)

A_{i} = a_{i} (F_{i}^{'}, F_{r}), (i = 1, 3) .

The predicted attention maps are used to attend to the features of the non-reference images via Eq. (13):

(13)

F_{i}^{''} = A_{i} \circ F_{i}^{'}, (i = 1, 3),

where

\circ

denotes the point-wise multiplication between Ai and Fi′, (i = 1,3). The Fi′′ denotes the feature maps with attention guidance. The reference feature map Fr (i.e. Fi) and the attention-guided features of the non-reference images Fi′′− 1 and Fi′′ + 1 are stacked and fused to get the final 64 channel attention-guided feature map Fs.

3.5

Refined Deformable Feature Alignment

Recently, for the task of video super-resolution, researchers [56] introduced deformable convolution, which has been effectively employed by [19] and [60]. The fundamental idea behind deformable alignment is to predict an offset using an offset prediction module defined by Eq. (14). This module employs general convolutional layers and takes two features as input, our fused features Fs, Fca, and Fsk, along with a reference frame feature map Fi.

(14)

Δ p_{i} - 1 = func ([fused (F_{s}, F_{c a}, F_{s k}), F_{i}])

After acquiring the learned offset, the fused multi-attention guided features Fs, Fca, and Fsk can be sampled and aligned to the reference frame Fi using deformable convolution introduced by [56] using Eq. (15):

(15)

{\tilde{F}}_{i} = D F Conv (fused (F_{s}, F_{c a}, F_{s k}), Δ p_{i} - 1) .

Figure 5.

Represents architecture of PCD [19] alignment module.

The overall structure of the PCD alignment module is represented in Figure 5 where the alignment is performed at multiple scales between the fused refined features and the reference frame. The final HDR video reconstruction is optimized by implicit learning capabilities of deformable convolution offsets for this alignment process.

3.6

Merge Network for HDR Image Reconstruction

The primary goal of the merge network is to reconstruct a high-quality HDR frame using attention-guided aligned features. This network is designed to identify and eliminate any alignment artifacts that may still be present in the registered images and to restore missing content in the over and under-exposed regions, resulting in the final HDR image.

We introduce the selective kernel fusion network, which is based on a residual dense network architecture, similar to the approach presented in Ref. [61]. Our merge network comprises convolution layers and DSKFRDBs with the incorporation of skip connections, as illustrated in Figure 6.

Figure 6.

Represents merge network composed of series of dilated selective kernel fusion residual dense blocks with skip connections.

The merge network takes the stacked features from the PCD alignment module. The merge network first applies a conv layer to produce 64-channel feature maps. These feature maps are then passed to three DSKFRDBs outputting three corresponding feature maps OF1, OF2 and OF3. All three feature maps are then concatenated to get OF4. Then convolution operations are applied for extracting more relevant information from all the three merged feature maps produced from DSKFRDBs to get OF5.

3.6.1

Global Residual Learning With the Reference Features

Motivated by the work of [17, 20], global residual learning strategy was adopted by adding the shallow reference frame feature Fr to OF5 where the representation of the original reference information is integrated before reconstructing the final HDR image from OF5 to optimize the accuracy of the model.

(16)

O F_{6} = O F_{5} + F_{r},

The final feature map OF6 contains almost all the ingredients for reconstructing the final HDR image without ghosting artifacts with details recovered in over and under-exposed regions with large motion. The final HDR image is estimated in the HDR domain after two convolution layers followed by activation funtion.

3.6.2

Dilated Selective Kernel Fusion Residual Dense Block

The merge network requires a larger receptive field for hallucinating details since the reconstruction of some local regions of the HDR images cannot receive enough information from the LDR images due to the occlusion of moving objects and saturation. Therefore, we used a DSKFRDB having two branches with dilation. The proposed DSKFRDB, which is represented in Figure 7, perform final HDR video reconstruction by adaptive feature selection using two different receptive fields using the split, fuse, and select strategy with dense concatenation based skip-connections where the input for each layers is the concatenation of all feature maps from preceding layers.

Figure 7.

Illustration of a three-layer dilated selective kernel fusion residual dense block structure following the residual dense block strategy of [61] as a framework.

Pixel Blending

To our full multi-Attention SKFHDRNet, we provided five 6-channel input images in both LDR and linear domains making a 30 channel input. Then, for these five images, our network predicted the blending weights and produced a 15 channel output. To effectively utilize the information in each color channel, we estimated blending weights for each color channel in a manner similar to the methods proposed by [41, 62]. The five input images are averaged using their blending weights to get the final HDR image HDRi at frame i by using Eq. (17).

(17)

\begin{matrix} H D R_{i} \\ = \frac{w_{1} L_{i} - 1 + w_{2} {\hat{L}}_{i} - 1 + w_{3} L_{i} + w_{4} {\hat{L}}_{i} + 1 + w_{5} L_{i} + 1}{\sum_{K = 1}^{5} w k}, \end{matrix}

where, wk is the estimated blending weight for each image.

Loss Function

Following the works of [14, 15, 36] the linear HDR images are transformed into log domain for boosting the pixel values in the dark regions of the image. Directly applying the loss function on the images in the linear HDR domain will produce inaccuracies by underestimating the error in the pixel values of the dark regions. We specifically employ the differentiable μ-law function using Eq. (18):

(18)

T_{i} = \frac{log (1 + μ H D R_{i})}{log (1 + μ)},

where HDRi represent linear HDR frame with the pixel values in range of [0, 1]. The parameter μ is set to 5000 to control the rate of compression range. The model parameters are updated by minimizing the L1 distance between the estimated,

{\hat{T}}_{i}

, and ground truth, Ti, HDR frames in the log domain with Eq. (19):

(19)

E = ∥ {\hat{T}}_{i} - T_{i} ∥_{1} .

5.1

L1MS–SSIM Loss Function

According to [21], MS–SSIM preserves the contrast in high-frequency regions better than the other loss functions. On the other hand, L1 preserves colors and luminance and error are weighted equally regardless of the local structure but does not produce quite the same contrast as MS–SSIM. To capture the best characteristics of both error functions, [21] propose a combined L1MS–SSIM loss function which is represented by Eq. (20):

(20)

L_{mix} = α L_{M S - S S I M} + (1 - α) {G_{σ}^{M}}_{G} \cdot L_{1},

where α is empirically set to 0.84 with point-wise multiplication between GσMG and L1. GσMG which represents the computation of mean and standard deviations with a Gaussian filter. We adopted the work of [21] to optimize the training of our model. The parameters or weights of the networks are modified using these computed gradients continuously until convergence.

Implementation Details

PyTorch framework was used to implement the Multi-Attention SKFHDRNet model architecture. We integrated the flow network implemented by [15] using Pytorch into our pipeline for HDR video reconstruction. End-to-end training is done for both optical flow and multi-attention SKFHDRNet. The technique used by [63] is used to initialize the initial weights of the network parameters. Using ADAM with default settings of β1 = 0.9 and β2 = 0.999 with a learning rate of 0.0001, to solve the optimization problem. Mantiuk et al. [64] approach was used for tone-mapping the results. Given training images, we randomly crop the images of size 256 × 256 for training. The model was trained for 20 epochs on two NVIDIA Tesla V100 32 Gb of NTNU cluster [65].

Experiment Results

We conducted experiments and performed an evaluation on synthetic test HDR scenes and real-world dataset (dynamic and static scenes from [15], under CC BY-NC-SA 4.0 license) to verify the effectiveness of the proposed method. All models are visually compared, and the predicted HDR frame is evaluated in terms of multiple image quality metrics. We specifically used μ-law tone-mapped PSNR, HDR–VDP2 [40] and HDR-VQM [66] (HDR-VQM for full model comparison). We followed the HDR-VQM design of [15] to assess the quality of HDR videos. Additionally, all models were evaluated based on color difference error between estimated and ground truth HDR using CIEDE2000 [67]. All visual results in the experiment are tone-mapped using Mantiuk et al. [64] tone-mapping method.

7.1

Evaluation of Baseline Models

We performed our initial comparisons with [17] in the case of no optical flow and no pixel blending where the model estimated a 3-channel final HDR image. This was specifically done to check and compare the effectiveness of our proposed attention modules against [17] AHDRNet. The proposed attention modules effectiveness is represented in Figure 8 indicating better performance in frame alignment against reference frame with less ghosting artifacts in comparison to AHDRNet [17] attention module. In Fig. 8, this is seen especially in the hand and racket with fewer ghosting artifacts.

Figure 8.

Visualization of the model’s outputs having consecutive frames as an input with local motion and a conv layer feature maps after passing through attention modules.

Similarly, the robustness of our proposed DSKFRDBs in filling rich details of over-exposed regions is illustrated in Figure 9 against AHDRNet [17]. Our proposed DSKFRDBs enable the model to produce results with rich details while achieving more accurate content in over-saturated areas. This can be seen by the proposed model having less color difference in the highlights compared to AHDRNet.

Figure 9.

Dynamic scene Ground Truth (GT) test sample and its estimated HDR scene of AHDRNet [17] and our proposed multi-attention SKFHDRNet. The top shows the full image, the middle images are a zoomed in area, and the bottom show the CIEDIE2000 color difference map.

Similarly, the zoomed regions of the CAROUSEL FIREWORKS frame represented in Figure 10 show poor performance of Yan et al. [17] AHDRNet. It struggles in reducing ghosting artifacts due to large motion which ultimately introduces higher color difference errors, which can be seen in color difference maps of the images. However, our proposed Multi-Attention SKFHDRNet model performed better alignment in case of large motions and produced a smaller color difference error in relation to the ground truth HDR frame.

Figure 10.

Represents visual and color difference error results of baseline models on synthetic dataset (CAROUSEL FIREWORKS) scene.

Our baseline model SKFHDRNet performed fairly well in case of the static dataset. From the visual results, the Yan et al. [17] model struggled to recover details in over-exposed regions which are illustrated in the zoomed regions of static dataset scene in Figure 11. Multi-attention SKFHDRNet recovers much of the missing information in the over-exposed regions with a small color difference error as shown in Fig. 11. This indicates that using DSKFRDBs in the merge network for filling missing content in the over-exposed regions works better compared to the dilated residual dense block of [17].

Figure 11.

Represents visual and color difference error results on the static dataset scene.

Quantitative results in terms of μPSNR and HDR–VDP2 are represented in Table I. Our multi-attention SKFHDRNet showed better performance in terms of visual results as well as image/video quality metrics, where the values are higher than Yan et al. in all datasets for both μPSNR and HDR–VDP2. This indicates our multi-attention modules efficiency which guides more relevant features from the neighbouring frames in relation to the reference frame and robustness of our DSKFRDBs in merge network in filling missing content in the over-exposed regions.

Table I.

Quantitative results of our baseline Multi-attention SKFHDRNet and Yan et al. [17] AHDRNet on test datasets are represented. Bold text indicates the better among models.

Model performance on synthetic dataset
Models	μPSNR	HDR–VDP–2
Yan et al. [17]	28.78	63.56
Multi-attention SKFHDRNet	32.11	65.65
Model performance on dynamic dataset
Yan et al. [17]	34.68	68.42
Multi-attention SKFHDRNet	40.77	73.81
Model performance on static dataset
Yan et al. [17]	33.06	69.81
Multi-attention SKFHDRNet	36.76	71.34

7.2

Per Frame Objective Metric Results Visualization of Our Baseline Model Without Optical Flow and Pixel Blending

Figure 12 represents our baseline model performance in relation to Yan et al. [17] AHDRNet on all the three datasets. Blue violin plots represent [17] model and orange violin plots represent our baseline Multi-Attention SKFHDRNet. The data points represent per frame image quality metrics results specifically, μPSNR and HDR–VDP–2. The median is represented by (the red point), and the first and third quartile are represented by the black bar where the lower region of the bar represent first quartile and the upper region of the black bar represents the third quartile. Our baseline model predicted better per frame quality metrics’ results considering the median in a violin plot which is higher than Yan et al. [17] AHDRNet on all three datasets. From the results, an intersection between data points can be clearly seen, especially in case of synthetic and dynamic datasets. This represents the performance of models on low and high exposure samples. The model shows higher performance for samples with low exposure, which is represented mostly in the third quartile region of the violin plot above the median. Samples with center frame having high exposure are represented below the median red point in the first quartile region of violin plot. It is worth noting that the proposed model is able to generate higher values in the synthetic dataset, as seen in HDR-VDP-2, the lowest values are approximately the same between the two models, but the proposed model has higher maximum values. In μPSNR we see a shift from a bottom heavy distribution to values being increased. For the static dataset, we see a similar behaviour for HDR-VDP-2, with a larger concentration of values being towards the top end, while in μPSNR it is in general, a shift upwards. Lastly for the dynamic dataset, the proposed model shifts the values upwards for μPSNR with more values concentrated towards the higher end, while in HDR-VDP-2 the values have a larger spread with more values being above the highest values in AHDRNet. In general, our model performance based on μPSNR and HDR–VDP–2 was higher than Yan et al. [17] AHDRNet.

Figure 12.

Per frame representation of image quality objective metric results on all three datasets using violin plot of our baseline architecture (orange) against Yan et al. [17] AHDRNet (blue).

7.3

Evaluation of Our Full Model

We compared our full model performance with [13, 14, 17] and [15] along with its individual networks CoarseNet and RefineNet. We re-implemented Yan et al. [17] method for alternating-exposure HDR video reconstruction and used the already trained Chen et al. [15] network parameters for comparison. For Kalantari et al. [13] and Kalantari and Ramamoorthi [14], we took the results of the model from [15] since the same datasets are used for comparison. All models are visually compared and the predicted HDR image is evaluated in terms of multiple image quality metrics. We specifically used μ-law tone-mapped PSNR, HDR–VDP2 [40] and HDR-VQM [66]. Additionally, all models were evaluated based on the color difference error between estimated and ground truth HDR using CIEDE2000 [67].

7.4

Synthetic Dataset for Training

Following the work of [13–15], we used 13 HDR video scenes from [26] and eight downsampled video scenes of resolution 1280 × 720 from [68] for training purposes. Furthermore, we also used a high-quality Vimeo-90K [69] dataset as training samples similar to [15] due to the limited size of the training HDR video dataset.

7.5

Evaluation on Synthetic Dataset

Our proposed multi-attention SKFHDRNet with re-implemented AHDRNet [17] is evaluated on a synthetic test dataset which is composed of two HDR videos (i.e., POKER FULLSHOT and CAROUSEL FIREWORKS) of [26] HDR dataset with random Gaussian noise added to low-exposure images like [15].

Figure 13.

Visual and color difference error results on the synthetic dataset.

Figure 13 illustrates the model performance on POKER FULLSHOT HDR scene. From the visual results, the color difference error is more prominent in Yan et al. [17] AHDRNet estimated HDR image. The reconstructed scene is noisy and the color difference map shows error across the scene. Similarly, there is higher color difference error in the saturated regions specifically in the edges and the curtain of the table in the scene reconstructed by [15] where some pixels are still over-saturated which is detected by the CIEDE-2000 color difference metric. However, the reconstructed HDR scenes of our model variants have less over-saturated pixels in the edges and the curtain on the table. This indicates DSKFRDB’s robustness to filling rich details in the over-exposed regions with 50% less model parameters compared to Chen et al. [15] and providing better performance in accuracy.

Quantitative results using HDR-VDP2, HDR-VQM and μPSNR of our multi-attention SKFHDRNet variants on the synthetic dataset are presented in Table II.

Our multi-attention SKFHDRNet showed better performance on all three image and video quality metrics. This indicates our multi-attention modules’ efficiency regarding noise reduction and filling details in over-exposed regions.

Figure 14.

Represents visual and color difference error results on the static dataset.

7.6

Evaluation on Real World Static Dataset

We test our multi-attention SKFHDRNet variants on a static dataset that is composed of random global motions. Random translation was performed for each frame in the range of [0,5] pixels. For all methods, no pre-alignment is done on input frames similar to Chen et al. [15] to evaluate their robustness to input with inaccurate global alignment. The Yan et al. [17] model produce results with noise and the error is captured and visualised in Figure 14 in the color difference error map. While Chen et al. [15] model produce results with out noise in the reconstructed frame but showed higher color difference error in the over-saturated regions in the scene which can be seen in the color difference error maps represented in Fig. 14. Our model variants produce better performance in case of noise and filling rich details in over-saturated regions producing smaller color difference error.

Similarly [15] model struggle to perform proper alignment in the zoomed and highlighted regions in Figure 15. The straight lines are distorted in the highlighted region of [15] reconstructed frame. In case of Yan et al. [17] model, apart from distortions in the straight lines, there are also more prominent color fringe patterns in the highlighted and zoomed region shown in Fig. 15. However, our proposed model variants showed better performance with reduced distortion and without prominent color fringe patterns in the highlighted region of reconstructed frame and the error is recorded by CIEDE2000 color difference error maps.

Figure 15.

Represents visual and color difference error results on the static dataset.

Our multi-attention SKFHDRNet variants performed better than Yan et al. [17] AHDRNet and [13, 14] learning-based methods using objective image and video quality metrics represented in Table II. Our models also performed better than the [15] single models (CoarseNet and RefineNet). However, the model of Chen et al. [15] showed slightly better results compared to our multi-attention SKFHDRNet variants based on image/video quality metrics.

Our proposed model showed comparable results on static scenes in comparison to prior work with half the size of network parameters than [15] full model, which can be seen in Table II.

7.7

Evaluation on Real World Dynamic Dataset

The dynamic dataset contains large local motions, making it challenging for the models to perform well in these cases. Figure 16 visualizes the results of our multi-attention SKFHDRNet variants along with [17] and [15] models. All of our models clearly show high performance in large local motion regions in the dynamic dataset scene, apart from our model variant SKFHDRNet having L1 and MS-SSIM loss, which can be seen in the zoomed region of the dynamic dataset scene in Fig. 16. The arrow pointing to regions is where we can see the ghosting artifacts and blur in the reconstructed scene of [15] results. Similarly, there is ghosting artifact of whole racket in the reconstructed scene of [17] results. This shows our multi-attention and PCD module effectiveness regarding feature alignment of neighbouring frames in to the reference frame. The color difference error maps also show large deviation in color information from the original HDR image in the motion regions of the estimated HDR frames of [17] and [15] models.

Figure 16.

Represents visual and color difference error results on the dynamic dataset.

The performance of our proposed model variants was better than Yan et al. [17] AHDRNet, [13, 14], and Chen et al. [15] learning-based methods using objective image/video quality metrics on dynamic dataset represented in Table II. Our models also showed better performance than the [15] single models (CoarseNet and RefineNet).

This again indicates our model’s DSKFRDB fusion block effectiveness in filling the missing content in large over-exposed regions with local motion (see results in Table II).

Table II.

Quantitative results of our multi-attention SKFHDRNet variants on all three datasets. The best model is represented with red text, the second best model is represented by blue text, and the third best model is represented by green text.

7.8

Per Frame Objective Metric Results Visualization of Our Full Architecture

Figure 17 represents violin plots of our multi-attention SKFHDRNet variants specifically, multi-attention SKFHDRNet with L1 loss, multi-attention SKFHDRNet with L1 loss and PCD alignment module, multi-attention SKFHDRNet with L1MS–SSIM loss along with PCD alignment module. The performance of the mentioned model is compared to [15] network.

Fig. 17 represents violin plot where the blue violin plots represent our multi-attention SKFHDRNet with L1 loss. The orange violin plot represents multi-attention SKFHDRNet with L1 loss and PCD alignment module. The yellow violin plot represents multi-Attention SKFHDRNet with L1MS–SSIM loss function and PCD alignment module. Purple violin plots represent Chen et al. [15] model results. Our model variants produce consistent or in some cases showed better results from [15] full model considering μPSNR and HDR-VDP2 per frame image quality results. By looking at the median point in red, the performance of all models looks almost equivalent. However, in some cases, like the result of our multi-attention SKFHDRNet with PCD and L1 loss (orange) in terms of HDR–VDP–2 image quality metric, produce better results by considering the median (red point) point of a violin plot. It is also worth noting that for the dynamic dataset (bottom), the Chen et al. [15] model produces higher minimum values than the others for HDR-VDP-2, but the others have slightly higher maximum HDR-VDP-2 values. A similar behaviour can also be seen on the synthetic dataset (top). Overall, the behaviour of all models were similar, where all the models performed well in the case of HDR test scenes with a center frame under-exposed, while producing inferior results in the case of scenes with a center frame highly over-exposed with large motions.

Figure 17.

Per frame representation of image quality objective metric results on all three datasets using violin plot of our multi-attention SKFHDRNet variants blue violin representing our multi-attention SKFHDRNet with L1 loss, orange representing multi-attention SKFHDRNet with L1 loss and PCD alignment module, and yellow represents multi-attention SKFHDRNet with L1MS–SSIM loss function and PCD alignment module against purple points of Chen et al. [15] model results.

Network Parameters Analysis

The full model of Chen et al. [15] is composed of 6.1 million parameters, with 3.1M parameters for CoarseNet and 3.0M for RefineNet, while Yan et al. [17] model contained 1.9M parameters and Kalantari and Ramamoorthi [14] model had 9.0M parameters mentioned by [15]. However, our full model without the PCD module has 1.3M parameters. Our other model variants having the PCD module have 2.9M parameters providing almost similar or even surpassing performance of Chen et al. [15] model which has network parameters more than half the size of our model. However, our full model variants had a high inference time on the test images, which is represented in Table III.

Table III.

Performance of the proposed network.

	Synthetic dataset	Dynamic dataset	Static dataset	Network parameters
Models	1920 × 1080	1476 × 753	1536 × 813
Kalantari et al. [13]	185 s	–	–
Yan et al. [17]	0.82 s	0.46 s	0.52 s	1.9M
Kalantari and Ramamoorthi [14]	0.59 s	–	–	9.0M
Chen et al. [15] Full model	0.84 s	0.63 s	0.89 s	6.1M
Our SKFHDRNet (L1)	1.21 s	0.68 s	0.97s	1.3M
Our SKFHDRNet (PCD + L1)	2.02 s	1.15 s	1.29 s	2.9M
Our SKFHDRNet (PCD + L1 + MS–SSIM)	2.04 s	1.16 s	1.30 s	2.9M

Limitations of Our Proposed Methodology

In general, our approach performs better and produces high quality HDR video. However, some use cases were harder, and the model struggled to produce satisfactory HDR video reconstruction. One typical example of our model poor performance is observed in cases where the center (reference) frame has highly over-exposed regions and there is apparently large movement of objects during consecutive frames with large occlusion. As can be seen in Figure 18, our method results in ghosting and other distortions, such as decolorized pixels. Other methods from the results also encounter difficulties in these regions and provided estimated HDR with a similar type of artifacts.

Figure 18.

The top row represents estimated HDR scenes for CAROUSEL FIREWORKS scene using two alternating exposures. The bottom row shows the zoomed region where all the models introduced decolorized pixels. By looking at the model inputs, where the center (reference) frame Li is over-exposed in the highlighted region and the missing content should be recovered from the neighboring frames with low exposure, Li − 2, Li − 1 and Li + 1, Li + 2. Because of significant displacement of objects due to large motions along with high exposure in that region none of the methods are able to properly register and reconstruct details in that region of the image, producing ghosting artifacts which can be seen from the bottom row. Therefore, our method is similar to other approaches and contains artifacts in this region.

Moreover, in cases where the center (reference) image has low exposure and the neighboring frames with high exposure contain darker pixels in the same region; this scenario makes it harder for the models to recover detail in darker regions because the information is very limited in all the frames which produce noise in those regions. This is illustrated in the zoomed region of the static dataset scene in Figure 19. However, our full model results are still considerably better than the other learning-based techniques.

Figure 19.

The top row represents the estimated HDR scenes for static scene using two alternating exposures. The bottom row shows the zoomed region where all the models introduced noise in the dark region. By looking at the model inputs, where the center (reference) frame Li is under-exposed and the highlighted region have very dark pixels. Upon that the neighboring frames with high exposure, Li − 2, Li − 1 and Li + 1, Li + 2 also have darker pixel values in the same regions. Due to less information in the middle as well as neighbouring frames, the models produced noisy texture in those regions which is visualized in the zoomed sections in the bottom row. Therefore, our method, similar to other approaches contains artifacts in this region. However, our multi-attention SKFHDRNet variants have less noisy estimated HDR scene than the other methods.

10.

Future Work

Considering the real-time scenarios, further research is needed to make the model more interactive by minimizing the inference time of the model. As an example, performing HDR video estimation without an optical flow network will further reduce the model inference time.

Although our methodology showed improved performance regarding recovering details in over-exposed regions of LDR images, further improvement is required as most of the prior work similar to our proposed method showed inferior performance in recovering missing details in challenging over-exposed examples.

In the future, we will extend the evaluation by conducting a psychophysical study to evaluate model performance. Additionally, it would be interesting to modify our system to work with different types of capturing setups, for example, stereo cameras with various exposures.

11.

Conclusion

We proposed a learning-based technique having optical flow, multi-attention, and PCD alignment modules for improved model performance regarding image alignment and ghosting artifacts. For recovering lost details in under and over-exposed regions, we merged the previously refined aligned features using a series of (DSKFRDBs) for estimating high-quality final HDR scenes. We demonstrate the performance of our method on a number of HDR test datasets containing challenging cases with over-exposed regions and large motions. Our learning-based method achieves better results in most cases than recent state-of-the-art methods with model parameters half the size of the recent state-of-the-art method.

References

1MantiukR. K.MyszkowskiK.SeidelH.-P.High Dynamic Range Imaging2015Wiley Encyclopedia of Electrical and Electronics Engineering10.1002/047134608X.W8265

2DalyS. J.FengX.2003Bit-depth extension using spatiotemporal microdither based on models of the equivalent input noise of the visual systemProc. SPIE5008455466455–6610.1117/12.472016

3SongQ.SuG.-M.CosmanP. C.2016Hardware-efficient debanding and visual enhancement filter for inverse tone mapped high dynamic range images and videos2016 IEEE Int’l. Conf. on Image Processing (ICIP)329933033299–303IEEEPiscataway, NJ10.1109/ICIP.2016.7532970

4LuzardoG.AeltermanJ.LuongH.PhilipsW.OchoaD.2017Real-time false-contours removal for inverse tone mapped HDR contentProc. 25th ACM Int’l. Conf. on Multimedia147214791472–9ACMNew York, NY10.1145/3123266.3123400

5MukherjeeS.SuG.-M.ChengI.2018Adaptive dithering using curved Markov–Gaussian noise in the quantized domain for mapping SDR to HDR imageInt’l. Conf. on Smart Multimedia193203193–203SpringerCham10.1007/978-3-030-04375-9_17

6BanterleF.LeddaP.DebattistaK.ChalmersA.2006Inverse tone mappingProc. 4th Int’l. Conf. on Computer Graphics and Interactive Techniques in Australasia and Southeast Asia349356349–56ACMNew York, NY10.1145/1174429.1174489

7BanterleF.LeddaP.DebattistaK.ChalmersA.BlojM.2007A framework for inverse tone mappingVis. Comput.23467478467–7810.1007/s00371-007-0124-9

8De SimoneF.ValenziseG.LaugaP.DufauxF.BanterleF.2014Dynamic range expansion of video sequences: A subjective quality assessment study2014 IEEE Global Conf. on Signal and Information Processing (GlobalSIP)106310671063–7IEEEPiscataway, NJ10.1109/GlobalSIP.2014.7032284

9MasiaB.SerranoA.GutierrezD.2017Dynamic range expansion based on image statisticsMultimedia Tools Appl.76631648631–4810.1007/s11042-015-3036-0

10ZhangX.BrainardD. H.2004Estimation of saturated pixel values in digital color imagingJ. Opt. Soc. Am. A21230123102301–1010.1364/JOSAA.21.002301

11XuD.DoutreC.NasiopoulosP.2011Correction of clipped pixels in color imagesIEEE Trans. Vis. Comput. Graph.17333344333–4410.1109/TVCG.2010.63

12KangS. B.UyttendaeleM.WinderS.SzeliskiR.2003High dynamic range videoACM Trans. Graph. (TOG)22319325319–2510.1145/882262.882270

13KalantariN. K.RamamoorthiR.2017Deep high dynamic range imaging of dynamic scenesACM Trans. Graph.361121–1210.1145/3072959.3073609

14KalantariN. K.RamamoorthiR.2019Deep hdr video from sequences with alternating exposuresComputer Graphics ForumVol. 38193205193–205WileyHoboken, NJ10.1111/cgf.13630

15ChenG.ChenC.GuoS.LiangZ.WongK.-Y. K.ZhangL.2021HDR video reconstruction: A coarse-to-fine network and a real-world benchmark datasetProc. IEEE/CVF Int’l. Conf. on Computer Vision250225112502–11IEEEPiscataway, NJ10.1109/ICCV48922.2021.00250

16WuS.XuJ.TaiY.-W.TangC.-K.2018Deep high dynamic range imaging with large foreground motionsEuropean Conf. on Computer Vision ECCV 2018: Computer Vision – ECCV 2018120135120–35SpringerCham10.1007/978-3-030-01216-8_8

17YanQ.GongD.ShiQ.HengelA. v. d.ShenC.ReidI.ZhangY.2019Attention-guided network for ghost-free high dynamic range imagingProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition175117601751–60IEEEPiscataway, NJ10.1109/CVPR.2019.00185

18YanQ.ZhangL.LiuY.ZhuY.SunJ.ShiQ.ZhangY.2020Deep HDR imaging via a non-local networkIEEE Trans. Image Process.29430843224308–2210.1109/TIP.2020.2971346

19WangX.ChanK. C. K.YuK.DongC.LoyC. C.2019EDVR: Video restoration with enhanced deformable convolutional networksProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition Workshops195419631954–63IEEEPiscataway, NJ10.1109/CVPRW.2019.00247

20LedigC.TheisL.HuszárF.CaballeroJ.CunninghamA.AcostaA.AitkenA.TejaniA.TotzJ.WangZ.ShiW.2017Photo-realistic single image super-resolution using a generative adversarial networkProc. IEEE Conf. on Computer Vision and Pattern Recognition105114105–14IEEEPiscataway, NJ10.1109/CVPR.2017.19

21ZhaoH.GalloO.FrosioI.KautzJ.2016Loss functions for image restoration with neural networksIEEE Trans. Computational Imaging3475747–5710.1109/TCI.2016.2644865

22NayarS. K.MitsunagaT.2000High dynamic range imaging: Spatially varying pixel exposuresProc. IEEE Conf. on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662)Vol. 1472479472–9IEEEPiscataway, NJ10.1109/CVPR.2000.855857

23NayarS. K.BranzoiV.BoultT. E.2004Programmable imaging using a digital micromirror arrayProc. 2004 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, 2004. CVPR 2004Vol. 1IEEEPiscataway, NJpp. I–I10.1109/CVPR.2004.1315065

24TocciM. D.KiserC.TocciN.SenP.2011A versatile HDR video production systemACM Trans. Graph. (TOG)301101–1010.1145/2010324.1964936

25KronanderJ.GustavsonS.BonnetG.UngerJ.2013Unified HDR reconstruction from raw CFA dataIEEE Int’l. Conf. on Computational Photography (ICCP)191–9IEEEPiscataway, NJ10.1109/ICCPhot.2013.6528315

26FroehlichJ.GrandinettiS.EberhardtB.WalterS.SchillingA.BrendelH.2014Creating cinematic wide gamut HDR-video for the evaluation of tone mapping operators and HDR-displaysProc. SPIE9023279288279–8810.1117/12.2040003

27LuléT.KellerH.WagnerM.BöhmM.1999LARS II-a high dynamic range image sensor with a-si: H photo conversion layer1999 IEEE Workshop on Charge-Coupled Devices and Advanced Image Sensors, Nagano, Japan, CiteseerIEEEPiscataway, NJ

28SegerU.ApelU.HöfflingerB.1999HDRC-imagers for natural visual perceptionHandbook Comput. Vis. Appl.12

29KavadiasS.DierickxB.SchefferD.AlaertsA.UwaertsD.BogaertsJ.2000A logarithmic response CMOS image sensor with on-chip calibrationIEEE J. Solid-State Circuits35114611521146–5210.1109/4.859503

30ReinhardE.StarkM.ShirleyP.FerwerdaJ.2002Photographic tone reproduction for digital imagesProc. 29th Annual Conf. on Computer Graphics and Interactive Techniques267276267–76ACMNew York, NY10.1145/566654.566575

31MeylanL.DalyS.SüsstrunkS.2006The reproduction of specular highlights on high dynamic range displaysProc. IS&T/SID CIC14: Fourteenth Color Imaging Conf.333338333–8IS&TSpringfield, VA10.2352/CIC.2006.14.1.art00061

32RempelA. G.TrentacosteM.SeetzenH.YoungH. D.HeidrichW.WhiteheadL.WardG.2007Ldr2hdr: on-the-fly reverse tone mapping of legacy video and photographsACM Trans. Graph. (TOG)2610.1145/1276377.127642639-es

33BanterleF.LeddaP.DebattistaK.ChalmersA.2008Expanding low dynamic range videos for high dynamic range applicationsProc. 24th Spring Conf. on Computer Graphics334133–41ACMNew York, NY10.1145/1921264.1921275

34KovaleskiR. P.OliveiraM. M.2014High-quality reverse tone mapping for a wide range of exposures2014 27th SIBGRAPI Conf. on Graphics, Patterns and Images495649–56IEEEPiscataway, NJ10.1109/SIBGRAPI.2014.29

35DidykP.MantiukR.HeinM.SeidelH.-P.2008Enhancement of bright video features for HDR displaysComputer Graphics ForumVol. 27126512741265–74WileyHoboken, NJ10.1111/j.1467-8659.2008.01265.x

36EilertsenG.KronanderJ.DenesG.MantiukR. K.UngerJ.2017HDR image reconstruction from a single exposure using deep CNNsACM Trans. Graph. (TOG)361151–1510.1145/3130800.3130816

37EndoY.KanamoriY.MitaniJ.2017Deep reverse tone mappingACM Trans. Graph.3610.1145/3130800.3130834177–1

38LiuY.-L.LaiW.-S.ChenY.-S.KaoY.-L.YangM.-H.ChuangY.-Y.HuangJ.-B.2020Single-image HDR reconstruction by learning to reverse the camera pipelineProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition165116601651–60IEEEPiscataway, NJ10.1109/CVPR42600.2020.00172

39MangiatS.GibsonJ.2010High dynamic range video with ghost removalProc. SPIE779877981210.1117/12.862492

40MangiatS.GibsonJ.2011Spatially adaptive filtering for registration artifact removal in HDR video2011 18th IEEE Int’l. Conf. on Image Processing131713201317–20IEEEPiscataway, NJ10.1109/ICIP.2011.6115678

41KalantariN. K.ShechtmanE.BarnesC.DarabiS.GoldmanD. B.SenP.2013Patch-based high dynamic range videoACM Trans. Graph.32181–810.1145/2508363.2508402

42GryaditskayaY.PouliT.ReinhardE.MyszkowskiK.SeidelH.-P.2015Motion aware exposure bracketing for HDR videoComputer Graphics ForumVol. 34119130119–30WileyHoboken, NJ10.1111/cgf.12684

43LiY.LeeC.MongaV.2016A maximum a posteriori estimation framework for robust high dynamic range video synthesisIEEE Trans. Image Process.26114311571143–5710.1109/TIP.2016.2642790

44EilertsenG.MantiukR. K.UngerJ.2019Single-frame regularization for temporally stable cnnsProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition111761118511176–85IEEEPiscataway, NJ10.1109/CVPR.2019.01143

45KimS. Y.OhJ.KimM.2020Jsi-gan: Gan-based joint super-resolution and inverse tone-mapping with pixel-wise task-specific filters for uhd hdr videoProc. AAAI Conf. on Artificial IntelligenceVol. 34112871129511287–95AAAIWashington, DC

46ZhangH.SongL.GanW.XieR.2023Multi-scale-based joint super-resolution and inverse tone-mapping with data synthesis for UHD HDR videoDisplays7910249210.1016/j.displa.2023.102492

47ChenX.ZhangZ.RenJ. S.TianL.QiaoY.DongC.2021A new journey from SDRTV to HDRTVProc. IEEE/CVF Int’l. Conf. on Computer Vision450045094500–9IEEEPiscataway, NJ10.1109/ICCV48922.2021.00446

48AnandM.HarilalN.KumarC.RamanS.2021HDRVideo-GAN: deep generative HDR video reconstructionProc. Twelfth Indian Conf. on Computer Vision, Graphics and Image Processing191–9ACMNew York, NY10.1145/3490035.3490266

49YangY.HanJ.LiangJ.SatoI.ShiB.“Learning event guided high dynamic range video reconstruction,” Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (IEEE, Piscataway, NJ, 2023), pp. 13924–13934

50YangQ.LiuY.YangJ.Efficient HDR reconstruction from real-world raw images, arXiv preprint arXiv:2306.10311 (2023), 10 pages

51CogalanU.BemanaM.MyszkowskiK.SeidelH.-P.RitschelT.2022Learning HDR video reconstruction for dual-exposure sensors with temporally-alternating exposuresComput. Graph.105577257–7210.1016/j.cag.2022.04.008

52LiuZ.LiZ.ChenW.WuX.LiuZ.2023Unsupervised optical flow estimation for differently exposed images in ldr domainIEEE Trans. Circuits Syst. Video Technol.111–10.1109/TCSVT.2023.3252007

53MartorellO.BuadesA.2022Variational temporal optical flow for multi-exposure videoVISIGRAPP (4: VISAPP)666673666–73SciTePressSetubal

54JiangY.ChoiI.JiangJ.GuJ.HDR video reconstruction with tri-exposure quad-bayer sensors, arXiv preprint arXiv:2103.10982 (2021), 10 pages

55CogalanU.BemanaM.MyszkowskiK.SeidelH.-P.RitschelT.2022Learning HDR video reconstruction for dual-exposure sensors with temporally-alternating exposuresComput. Graph.105577257–7210.1016/j.cag.2022.04.008https://www.sciencedirect.com/science/article/pii/S0097849322000607

56DaiJ.QiH.XiongY.LiY.ZhangG.HuH.WeiY.2017Deformable convolutional networksProc. IEEE Int’l. Conf. on Computer Vision764773764–73IEEEPiscataway, NJ10.1109/ICCV.2017.89

57WooS.ParkJ.LeeJ.-Y.KweonI. S.2018Cbam: Convolutional block attention moduleProc. European Conf. on Computer Vision (ECCV)3193–19SpringerCham10.1007/978-3-030-01234-2_1

58LiX.WangW.HuX.YangJ.2019Selective kernel networksProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition510519510–9IEEEPiscataway, NJ10.1109/CVPR.2019.00060

59HowardA. G.ZhuM.ChenB.KalenichenkoD.WangW.WeyandT.AndreettoM.AdamH.Mobilenets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861 (2017), 9 pages

60TianY.ZhangY.FuY.XuC.2020Tdan: Temporally-deformable alignment network for video super-resolutionProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition336033693360–9IEEEPiscataway, NJ10.1109/CVPR42600.2020.00342

61ZhangY.TianY.KongY.ZhongB.FuY.2018Residual dense network for image super-resolutionProc. IEEE Conf. on Computer Vision and Pattern Recognition247224812472–81IEEEPiscataway, NJ10.1109/CVPR.2018.00262

62DebevecP. E.MalikJ.1997Recovering high dynamic range radiance maps from photographsSIGGRAPH ’97: Proc. 24th Annual Conf. on Computer Graphics and Interactive Techniques369378369–78ACMNew York, NY10.1145/258734.258884

63GlorotX.BengioY.2010Understanding the difficulty of training deep feedforward neural networksProc. Thirteenth Int’l. Conf. on Artificial Intelligence and Statistics, JMLR Workshop and Conf. Proc.249256249–56JMLRCambridge, MA

64MantiukR.DalyS.KerofskyL.2008Display adaptive tone mappingACM Trans. Graphics271101–1010.1145/1360612.1360667

65SjälanderM.JahreM.TufteG.ReissmannN.EPIC: an energy-efficient, high-performance GPGPU computing research infrastructure, arXiv:1912.05848 (2019), 6 pages

66NarwariaM.Da SilvaM. P.Le CalletP.2015HDR-VQM: An objective quality measure for high dynamic range videoSignal Process., Image Commun.35466046–6010.1016/j.image.2015.04.009

67LuoM. R.CuiG.RiggB.2001The development of the CIE 2000 colour-difference formula: CIEDE2000Color Res. Appl.26340350340–50

68KronanderJ.GustavsonS.BonnetG.YnnermanA.UngerJ.2014A unified framework for multi-sensor HDR video reconstructionSignal Process., Image Commun.29203215203–1510.1016/j.image.2013.08.018

69XueT.ChenB.WuJ.WeiD.T. FreemanW.2019Video enhancement with task-oriented flowInt. J. Comput. Vis.127110611251106–2510.1007/s11263-018-01144-2