Back to articles
Work Presented at CIC33: Color and Imaging Conference 2025 FastTrack
Volume: 0 | Article ID: 050402
Image
Automatic Image Colorization with Semantic Segmentation and Multipath Deep Networks
Abstract
Abstract

A fully automated colorization model that integrates image segmentation features to enhance both the accuracy and diversity of colorization is proposed. In the model, a multipath architecture is employed, with each path designed to address a specific objective in processing grayscale input images. The context path utilizes a pretrained ResNet50 model to identify object classes while the spatial path determines the locations of these objects. ResNet50 is a 50-layer deep convolutional neural network (CNN) that uses skip connections to address the challenges of training deep models. It is widely applied in image classification and feature extraction. The outputs from both paths are subsequently fused and fed into the colorization network to ensure precise representation of image structures and to prevent color spillover across object boundaries. The colorization network is designed to handle high-resolution inputs, enabling accurate colorization of small objects and enhancing overall color diversity. The proposed model demonstrates robust performance even when training with small datasets. Comparative evaluations with CNN-based and diffusion-based classification approaches show that the proposed model significantly improves colorization quality.

Subject Areas :
Views 0
Downloads 0
 articleview.views 0
 articleview.downloads 0
  Cite this article 

Jie-Sen Wang, Hung-Chung Li, Pei-Li Sun, "Automatic Image Colorization with Semantic Segmentation and Multipath Deep Networksin Journal of Imaging Science and Technology,  2025,  pp 1 - 14,  https://doi.org/10.2352/J.ImagingSci.Technol.2025.69.5.050402

 Copy citation
  Copyright statement 
Copyright © Society for Imaging Science and Technology 2025
 Open access
  Article timeline 
  • received May 2025
  • accepted October 2025
jist
JIMTE6
Journal of Imaging Science and Technology
J. Imaging Sci. Technol.
J. Imaging Sci. Technol.
1062-3701
1943-3522
Society for Imaging Science and Technology
1.
Introduction
Since the advent of photography, the colorization of grayscale images has been a topic of considerable interest. This technology can provide additional semantic information, enhancing the readability and interpretability of image content while also improving visual effects. Traditional grayscale image colorization methods typically require users to manually provide color and image information for the process [13]. However, these approaches are labor-intensive and carry the risk of inaccuracies due to user-provided erroneous color information.
Driven by the rapid development and technological breakthroughs of deep learning, automatic colorization of grayscale images has become an important research topic in recent years. Early convolutional neural network (CNN) architectures for colorization used simple and straightforward designs [46], primarily consisting of networks with increased depth achieved by stacking multiple convolutional layers. Although these architectures were well designed, they required large datasets for effective learning, limiting their practicality in scenarios with limited data availability. In subsequent studies, some approaches reformulated the colorization problem as a classification task by learning color distributions from large-scale natural image datasets [7]. Other methods employed pixel histogram modeling to capture multimodal color possibilities and avoid single-point estimation [8]. In addition, research combining local and global semantic features has effectively improved color consistency for objects such as buildings and the sky [9]. Exemplar-based methods, on the other hand, transfer colors from reference images to grayscale inputs, thereby enhancing the realism of specific objects in street scenes [10].
In CNN-based colorization methods, user inputs in the form of dots or doodles are often incorporated [1114]. However, this approach is associated with an increased workload and requires a certain level of expertise from the user. Consequently, the process can be time-consuming. The employment of generative adversarial networks (GANs) or variational autoencoders is a common practice in achieving diverse colorization. The GANs utilize a competitive framework in which the generator endeavors to generate colors that are indistinguishable to the discriminator while the discriminator’s objective is to differentiate between genuine and generated colors [1519]. However, GAN-based methods often exhibit suboptimal performance when dealing with objects that have consistent colors. Additionally, they are characterized by high computational cost and significant resource consumption.
Notwithstanding the elevated memory demands associated with the training process, multipath neural networks demonstrate a remarkable ability to accurately capture semantic information [9, 2024]. By capitalizing on both local and global image features, these networks achieve enhanced colorization precision in grayscale images, thereby improving the overall quality of the colorization process. Transformer-based models [2527] have recently garnered considerable attention due to their capacity to extract salient image features through multihead attention mechanisms. Another notable approach is the diffusion model [28, 29], which incorporates incremental noise during training to augment image diversity through denoising. However, both methods are highly data-dependent and necessitate substantial datasets for effective implementation.
With the advancement of generative models, diffusion models and multimodal conditional control methods have been introduced into the field of automatic colorization, further enhancing detail and texture. However, recent studies have also revealed several limitations. First, the one-to-many nature of mapping grayscale to color persists: models in open-domain street scenes tend to generate low-saturation or conservative colors, with limited ability to recover rare hues, as shown in Figure 1 [28, 30, 31]. Second, insufficient semantic understanding often leads to incorrect color predictions when the model encounters uncommon objects or distinctive signs, reducing the realism of the results [9, 28]. Furthermore, although exemplar-based methods can improve colorization accuracy, they are highly dependent on the structural similarity and alignment quality of the reference image. When significant differences exist, these methods can cause color shifts or unnatural transfers [10].
Figure 1.
The architecture of the proposed colorization model.
Another challenge lies in the trade-off between controllability and stability. Although conditions such as text descriptions, strokes, and reference images allow users greater control over colorization results, current methods still suffer from issues like color bleeding and unstable condition alignment [32, 33]. On the data side, most existing approaches rely on natural image datasets such as ImageNet and Places, which lack training sources specifically tailored for street scene colorization. As a result, their generalization ability in cross-domain applications, such as low-light environments and historical photographs, remains limited [31]. Although some studies have begun to adopt street scene datasets, such as Cityscapes [34] and Mapillary Vistas [35], these datasets were initially designed for semantic segmentation and autonomous driving, rather than being optimized for colorization tasks.
In terms of colorization evaluation, existing assessments still mainly rely on the peak signal-to-noise ratio and the structural similarity index. However, these pixel-level metrics cannot adequately reflect the realism and usability of street scene colorization [3640], for instance, whether traffic light colors are correctly reproduced or how the results affect autonomous driving tasks. Therefore, future research should not only establish benchmark datasets specifically for street scene colorization but also introduce evaluation metrics grounded in human visual perception. Furthermore, integrating human subjective assessments with task-oriented performance measures will be essential for a more comprehensive evaluation of the practical value of these models in real-world applications.
In image colorization, a thorough understanding of semantic information is crucial to ensure the authenticity of the results. Such understanding enables sensible color assignments, for example, recognizing that cats are unlikely to be blue while leaves are typically green. In the realm of image segmentation, network architectures proposed in [4145] exhibit a close correlation with semantic and spatial location information despite adopting divergent design approaches. Based on these concepts, this study proposes a similar framework to support and enhance image colorization.
To meet the application needs of autonomous driving and intelligent transportation, the automatic colorization of grayscale street scene images has gradually attracted increasing attention. However, in the extant literature on grayscale image colorization, user-provided information or reference images are often required, with limited emphasis placed on road-specific colorization. This study introduces an automated method for colorizing grayscale road images. Image segmentation was incorporated into the design to address the challenge of road colorization due to the presence of various artificial objects. By capturing multiple local textures and objects and integrating this information into the colorization network, our model can effectively colorize elements such as buildings, trucks, and the sky without human intervention.
Although diffusion models have gained popularity in colorization, our CNN-based model exhibits superior learning capability, achieving improved performance on both small and large datasets. Our approach leverages the CIELAB color space to predict chromaticity components of an image. The proposed model comprises three key elements: a contextual network, a spatial network, and a colorization network. The contextual network learns semantic information about objects, the spatial network identifies their positions, and the colorization network integrates this information to generate the final colorized output. Experimental results demonstrate that our method outperforms the state-of-the-art diffusion model, achieving superior colorization accuracy and diversity.
The motivation and objective of this study are to enable the practical application of colorization in real-world scenarios while delivering high-quality results. However, due to the diversity of real-world colors, this task presents considerable challenges. Our research primarily employs the proposed model training architecture to first validate its effectiveness on a specific dataset, with subsequent work focusing on conducting a more comprehensive investigation and optimization of the model’s generalization capability.
2.
Methods
This section presents a comprehensive account of the proposed CNN architecture illustrated in Fig. 1. The architecture comprises three principal components: the context pathway, the spatial pathway, and the colorization pathway. The context path furnishes data regarding the immediate content of the image, including the sky, building, and tree. In contrast, the spatial path provides information regarding the exact spatial position of these contextual elements within the image. The outputs of the context and spatial paths were integrated after important features were extracted by Efficient Channel Attention (ECA-Net) [46] and Squeeze and Excitation (SE-Net) [47], which enhanced channel-wise attention by adaptively weighting feature maps based on their importance, thereby facilitating effective semantic segmentation before being passed to the coloring path. A detailed explanation of ECA-Net and SE-Net is provided in Section 2.3. The incorporation of semantic information into the colorization network yielded three primary benefits: (1) improved precision in color prediction, (2) mitigation of color overflow problems, and (3) enhanced diversity in color applications.
In the early stages of this study, a single-path architecture was tested, but the recognition and segmentation performance on grayscale images was found to be suboptimal. This was likely due to the absence of color information, which typically provided important cues for semantic discrimination. To address this limitation, a dual-path architecture was adopted. The contextual path, based on an ImageNet pretrained model, enhanced semantic understanding and improved generalization across diverse scenes. The spatial path focused on preserving edge and structural details, thereby helping to prevent color bleeding during the colorization process.
In the proposed model, the process of downsampling (DS) was achieved through the implementation of convolution (Conv2d) [48], a layer that extracts spatial features by sliding filters over the input to generate feature maps. For upsampling (US), transposed convolution (ConvT2d) [49] was employed, which reverses the convolution operation to increase spatial resolution in a learnable manner. Following each convolutional layer, batch normalization (BN) [50] was applied to normalize the activations within each mini-batch, thereby accelerating training and improving model stability. To introduce nonlinearity, the rectified linear unit (ReLU) [51] was used.
The CNN pathways were connected and optimized via an end-to-end training process. The entire framework operated in a CIELAB-type color space, which consists of three channels: the lightness channel (L) and two chromatic channels (a and b). The input of the three pathways was a grayscale image. The input grayscale image was derived from an RGB-to-grayscale conversion based on the ITU-R BT.709 standard (Y709 = 0.2126R + 0.7152G + 0.0722B), which is closely aligned with the sRGB standard. The outputs from the contextual and spatial paths represented semantic segmentation results while the chromatic path produced the image planes for a and b color channels, separately. Since the primary objective of image colorization is to generate perceptually realistic chromatic channels from a grayscale image, this study adopted the method proposed by Iizuka [9], where the input grayscale image is treated as the lightness channel in the LAB color space transformation and the calculation of color differences. The final colorized images are represented in the sRGB color space. The conversion from CIELAB to sRGB was carried out in OpenCV-Python using the cvtColor function with the COLOR_Lab2RGB flag. This process involved two steps: (1) transforming CIELAB values into CIE 1931 XYZ values with the D65 illuminant as the reference white and (2) converting the XYZ values into sRGB in compliance with the IEC 61966-2-1:1999 standard [52].
To leverage the superior feature extraction capabilities of the pretrained ResNet50 [53] model, this study adopted the recommended input size of 224 × 224 × 3 in the semantic segmentation network. However, using the same resolution in the coloring network may result in the loss of details of small objects (such as brake lights and traffic signals). To address this issue, this study adopted a high-resolution input of 896 × 896 × 3 in the coloring network to preserve the details of small objects and improve coloring accuracy. This study assumed that the semantic information of large objects is sufficient to provide the overall contextual guidance required for coloring.
2.1
Context Path
An image is typically composed of both foreground and background elements, with the background often occupying a significant portion of the image area. In CNNs, extracting features at multiresolution enables the capture of both global and local information. This is typically achieved by adjusting the stride, which is a parameter that determines how the convolutional filter moves across the image, thereby affecting the resolution of the extracted feature maps. To this end, a pretrained ResNet50 model was employed to construct the contextual pathways. This approach can effectively capture image features even when the input image is in grayscale.
Specifically, image features were extracted at resolutions of 7 × 7 and 28 × 28 and then upsampled to 56 × 56, thereby ensuring a consistent resolution for subsequent processing. This multiscale approach allowed for the extraction of detailed image features while retaining the broader contextual information essential for subsequent processing as presented in Table I. Images typically consist of foreground and background elements, with the background usually occupying a large portion of the image area. In CNNs, feature extraction at multiple resolutions can capture both global and local information. This architectural configuration aimed to enhance the accuracy of image recognition. To this end, ECA-Net and SE-Net were incorporated to refine and selectively enhance relevant image features, thereby ensuring the optimal representation of both foreground and background information and ultimately improving model performance.
Table I.
Context path network architecture.
Output sizeOperatorStrideFilter
112 × 112Conv2d27 × 7 × 3 × 64
56 × 56max pool23 × 3
Conv2d1ResNet50 layer1
28 × 28max pool23 × 3
Conv2d1ResNet50 layer2
14 × 14max pool23 × 3
Conv2d1ResNet50 layer3
7 × 7max pool23 × 3
Conv2d1ResNet50 layer4
2.2
Spatial Path
Images commonly consist of multiple objects, whose spatial arrangements are crucial for accurate interpretation. As the depth of a neural network increases, its capacity to preserve absolute position information decreases. To address this issue, we have devised a wide and shallow architectural configuration comprising just four convolutional layers in the spatial path as detailed in Table II. The initial convolutional layer employed a kernel size of 5 × 5 and a stride of 2, enabling the capture of greater spatial detail at an early stage while minimizing positional loss. This architectural configuration is advantageous because it preserves absolute positional information, which is vital for tasks that necessitate precise localization. Furthermore, our design addresses color overflow issues, guaranteeing accurate color differentiation between adjacent objects or regions within the image.
Table II.
Spatial path network architecture.
Output sizeOperatorStrideFilter size
112 × 112Conv2d, BN, ReLU25 × 5 × 3 × 32
112 × 112Conv2d, BN, ReLU13 × 3 × 32 × 32
56 × 56Conv2d, BN, ReLU23 × 3 × 32 × 64
56 × 56Conv2d, BN, ReLU13 × 3 × 64 × 64
2.3
Fusing Context and Spatial Features
In the context of the color model, the efficient transmission of both spatial and contextual information is paramount. This was accomplished by integrating spatial and contextual data and applying an attention model to selectively extract the most relevant information. At the intermediate resolution level of 56 × 56, two branches were created: one directed towards the color neural network and the other dedicated to image segmentation. This dual-branch structure ensured that the position and content of each object within the image are accurately represented, enhancing both the precision of color application and the clarity of object boundaries.
The ECA-Net has been shown to improve the accuracy of classification results. The attention channel mechanism was employed in the branch path of the context path, and the number of channels in the two branches is equivalent, utilizing 1 × 1 convolution. Following the connection of the spatial path to the context path, a 1 × 1 convolution and SE-Net were employed to determine the significance of different channels and enhance salient features. The attention calculation is shown in Eq. (1). The ECA-Net and SE-Net can be calculated by Eqs. (2) and (3).
(1)
X̃c=αcXc,c{1,2,,C},
where Xc refers to the original input feature map of the cth channel and X̃c denotes the output feature map of the cth channel after being weighted by the attention coefficient αc. The attention weight αc is computed differently depending on the method.
The input features of ECA-Net and SE-Net are XRH×W×C, which aggregates spatial information into a channel descriptor zRC through global average pooling. The ECA-Net module introduces a lightweight and effective mechanism to capture channel-wise dependencies without dimensionality reduction.
(2)
α=σ(Conv1Dk(z)),zc=1H×Wi=1Hj=1WXi,j,c,k=log2Cγ+bγodd,
where Conv1Dk(⋅) denotes a 1D convolution with a kernel size of k applied to the channel dimension, where γ and b are hyperparameters typically set at 2 and 1, respectively, and |⋅|odd denotes rounding to the nearest odd integer. The parameter Xi, j, c represents the value at the spatial position (i, j) in the cth channel of the input feature map XRH×W×C. Global average pooling is applied across the spatial dimensions. This formulation ensured that the kernel size scales reasonably with increasing channel numbers while preserving efficient local cross-channel interaction.
The SE-Net block enhanced channel-wise feature representations by modeling inter-channel dependencies. The SE block first applied a squeeze operation. This was followed by an excitation operation that captures channel-wise dependencies using two fully connected (FC) layers with a ReLU activation.
(3)
α=σ(W2δ(W1z)),zc=1H×Wi=1Hj=1WXi,j,c,
where W1R(CrC and W2RC×(Cr) are the weight matrices of the FC layers, δ(⋅) denotes the ReLU function, and σ(⋅) is the sigmoid [54] function, which maps input values to the range (0,1) to represent probabilities or attention weights. The parameter Xi, j, c represents the value at the spatial position (i, j) in the cth channel of the input feature map XRH×W×C. Global average pooling is applied across the spatial dimensions. The reduction ratio r is a hyperparameter (typically set at 16) that controls the capacity and complexity of the excitation operation. The ECA-Net and SE-Net attention weights αRC were obtained after applying the sigmoid activation σ(⋅).
As both models are lightweight, they did not result in a substantial increase in the computational complexity of the model. Presently, the feature size is set at 56 × 56 × 256. Finally, the output was upsampled and output with different 1 × 1 convolutional layers for image segmentation and integration with the colorization network. The segmentation was performed using the Softmax activation function [55], which converts raw output values into a normalized probability distribution across classes.
2.4
Colorization Network
Three design principles are particularly important in the coloring model’s design: high-resolution image input, U-Net architecture with instance normalization (IN) [56], and optimal timing for incorporating image segmentation information.
If the same 224 × 224 × 3 input size is employed as in the context and spatial paths, essential details of smaller objects, such as brake lights, will be at risk of being lost due to lower resolution. To address this issue, an input image size of 896 × 896 was employed within the colorization network, ensuring that even small objects retain sufficient detail for accurate color application.
The U-Net architecture was employed extensively across a range of domains. In CNNs, downsampling is typically accompanied by doubling the number of channels to compensate for the loss of information caused by the reduction in image resolution. However, this approach cannot fully preserve all feature details, often resulting in incomplete feature recovery during upsampling. Instance normalization, followed by concatenation with the upsampling layers, was adopted to address this issue. This strategy helped restore lost information and proved beneficial in high-contrast industrial scenes or when working with semantic masks containing uniform color regions.
In the colorization model, the input image was initially downsampled to a resolution of 112 × 112 while texture, edges, contrast, and related attributes were extracted at this stage. This information was inadequate for accurate colorization. To address this limitation, semantic features were incorporated, facilitating a more comprehensive reconstruction of colorization results during the upsampling process. The aforementioned three details are presented in Table III.
Table III.
Colorization network architecture.
ItemOutput sizeOperatorStrideFilter size
DS1896 × 896Conv2d, BN, ReLU13 × 3 × 3 × 32
DS2Conv2d, BN, ReLU13 × 3 × 32 × 32
DS3448 × 448Conv2d, BN, ReLU23 × 3 × 32 × 64
DS4Conv2d, BN, ReLU13 × 3 × 64 × 64
DS5Conv2d, BN, ReLU13 × 3 × 64 × 64
DS6224 × 224Conv2d, BN, ReLU23 × 3 × 64 × 128
DS7Conv2d, BN, ReLU13 × 3 × 128 × 128
DS8Conv2d, BN, ReLU13 × 3 × 128 × 128
ConcatenateConcatenate with Segmentation information
DS9112 × 112Conv2d, BN, ReLU23 × 3 × 256 × 256
DS10Conv2d, BN, ReLU13 × 3 × 256 × 256
DS11Conv2d, BN, ReLU13 × 3 × 256 × 256
DS12Conv2d, BN, ReLU13 × 3 × 256 × 256
US1224 × 224ConvT2d, BN, ReLU23 × 3 × 256 × 128
ConcatenateConcatenate with (DS8 + IN)
US2ConvT2d, BN, ReLU13 × 3 × 256 × 128
US3ConvT2d, BN, ReLU13 × 3 × 128 × 128
US4448 × 448ConvT2d, BN, ReLU23 × 3 × 128 × 64
ConcatenateConcatenate with (DS5 + IN)
US5ConvT2d, BN, ReLU13 × 3 × 128 × 64
US6ConvT2d, BN, ReLU13 × 3 × 64 × 64
US7896 × 896ConvT2d, BN, ReLU23 × 3 × 64 × 32
ConcatenateConcatenate with (DS2 + IN)
US8ConvT2d, BN, ReLU13 × 3 × 64 × 32
US9ConvT2d, BN, ReLU13 × 3 × 32 × 32
US10Conv2d, BN, sigmoid11 × 1 × 32 × 2
The colorization network was trained using the Huber loss function in the neural network, with the hyperparameter set at 0.5. However, incorporating semantic information from the images, such as the presence of specific objects like a bus or a building, is necessary for optimal coloring results. Concurrent training on the coloring and semantic segmentation networks was conducted to address this limitation. The training process utilized 11 categories of data, with segmentation labels provided for each category. These labels enabled the division of an image into multiple local regions, which was especially advantageous for accurate local image coloring. The 11 categories included building, bus, car, road, sidewalk, sky, traffic sign, tree, truck, vegetation, and wall.
The final colorized image was generated by combining semantic segmentation and training with the Huber loss function. However, the extent of colorization in an image was contingent upon psychophysical factors. To enhance the perceptual quality of the results, we incorporated perceptual loss into the training process. Unlike the traditional mean square error (MSE), perceptual loss is more aligned with subjective visual perception. In the proposed algorithm, the colorized CIELAB image was first converted to the sRGB color space using the procedure described earlier. Then, a perceptual loss was applied for further refinement, ensuring that the final output aligned closely with human visual expectations. The Huber loss LHuber, cross-entropy loss LCE, and perceptual loss Lperceptual can be calculated by Eqs. (4)–(6). The total loss calculation is shown in Eq. (7).
(4)
LHuber (x,y)=12(xy)2if  |xy|<δδ|xy|12δ,otherwise ,
where x is the colorized image while y represents the ground truth; δ is an adjustable parameter, which is set at 0.5 in this study.
(5)
LCE==1Cyclog(ŷc),
where C is the number of categories, yc denotes the one-hot encoded indicator for the true class, and ŷc represents the model’s predicted probability for class c.
(6)
Lperceptual (x,y)=1CjHjWjl(x)l(y)22,
where ∅l(x) and ∅l(y) represent the predicted image and the real image from a VGG16 pretrained network. Parameters Cj, Hj, and Wj represent the number of channels, height, and width of feature maps at the jth layer, respectively. The 16th layer was used in this study to ensure that the coloring results in a wide range would be more consistent with the psychophysics.
(7)
Ltotal=λ1LHuber (x,y)+λ2LCE+λ3Lperceptual (x,y).
The loss weight ratio of λ1, λ2, and λ3 was 10,000:8:15.
The analysis of previous image colorization results revealed that the chroma of generated images is often lower than that of the ground truth. To address this, the a and b channels were each scaled by a factor of 1.3, effectively increasing the chroma (C) of the output images. This factor was determined empirically, as larger values may result in unnatural colorization.
The architecture of our network employed the AdaDelta [57] optimizer, with a learning rate of 0.004 and a batch size of 4. To prevent overfitting, this study employed early stopping, where learning was halted once the test loss did not decrease after 16 consecutive repetitions. The model outputs the image with the lowest loss as the colorization result.
2.5
GTA5 Dataset
The GTA5 [43] dataset consists of 24,966 synthetic images with pixel-level semantic annotations, which were rendered using the open-world video game Grand Theft Auto 5. These images depict street scenes in an American virtual city from the perspective of a vehicle. The dataset under consideration encompasses 19 semantic categories. Following a rigorous evaluation of the dataset, 11 categories were selected for further analysis in this study. The selection was based on two criteria: the frequency of segmented objects and the necessity of colorization. The following categories are included: building, bus, car, road, sidewalk, sky, traffic sign, tree, truck, vegetation, and wall. The remaining unused categories are bicycle, person, fence, motorcycle, pole, rider, traffic light, and train. The image resolution of the dataset in question is 1914 × 1052. Prior to importing the model, it underwent a resizing process to align with the study design’s dimensions. Subsequently, a comparison was made between the model and the original image, employing the same scale resolution.
The GTA5 dataset offers three notable advantages: (1) a diverse and extensive collection of artifacts, (2) highly complex and varied scenes, and (3) comprehensive label information for complete images. These characteristics make it an optimal choice for training models that require both detailed object recognition and contextual understanding. The training dataset consisted of 20,466 images while the testing dataset comprised 4500 images.
To enhance the model’s robustness, a random horizontal flip of the input images was applied with 50% probability. This data augmentation technique helped to reduce overfitting and improve the model’s generalization across diverse scenarios.
To validate the performance of the proposed neural network architecture, we conducted a series of experiments under various configurations. Six configurations were evaluated in this study. Method 1 employed the baseline architecture. Method 2 incorporated a pretrained model within the context paths. Method 3 modified the input resolution of the colorization network to 896 × 896 × 3. Method 4 added a perceptual loss to the training objective. Method 5 applied instance normalization before the concatenation operation in the colorization network. Method 6 increased the chroma component in the LAB color space by a factor of 1.3.
It is widely recognized that deep learning models generally require large-scale datasets to achieve optimal performance. Nevertheless, demonstrating robust performance under small-data conditions is also meaningful, as it highlights the model’s ability to generalize in resource-constrained scenarios. Therefore, an additional experiment was conducted under a reduced protocol with 2000 training images and 500 validation images, following the settings adopted by Zabari and Iizuka. Under this protocol, the Zabari and Iizuka models and the proposed method (Method 6) were evaluated under identical conditions. The corresponding comparative results are presented and analyzed in Section 4.
The implementation was developed in Python 3.7.16 with PyTorch 1.12.1 (CUDA 11.3), cuDNN 8.3.2, and OpenCV 3.4.17. Training and inference were conducted on a Windows 10 system equipped with one NVIDIA GeForce RTX 2080 Ti (11 GB), an Intel Core i7-9700F CPU, and 32 GB RAM.
3.
Results
Figure 2 illustrates the colorization results on the GTA5 validation set, highlighting the model’s ability to colorize small objects with diverse color distributions accurately. Leveraging image segmentation, the model achieved realistic colorization for categories such as pedestrians and vehicles. The entire process was fully automated, requiring no human intervention.
Figure 2.
Colorization results on the GTA5 validation set using the proposed model (Method 6).
Previous research on the perception of color differences in large printed images [58] demonstrated that statistical measures of extreme color deviations correlate more strongly with perceived image color differences than mean color differences do. The results of image colorization using the proposed methods, in terms of the mean and 95th percentile of CAM16-UCS color differences (denoted as ΔECAM16-UCS) [59] between the colorized images and the corresponding ground-truth images, are presented in Table IV. The global color difference refers to the mean and the 95th percentile of ΔECAM16-UCS computed across all pixels in the image. In contrast, the traffic light color difference refers explicitly to the mean and the 95th percentile of ΔECAM16-UCS computed only over pixels corresponding to traffic signal lights. In Methods 1 and 2, a ResNet50 pretrained model was employed to enhance model diversity and to ensure optimal performance even with limited training data. The results are illustrated in Figure 3(a). In Methods 3 and 4, the ability to colorize small objects, such as red traffic signs with traffic horns, was enhanced by increasing the resolution and reducing the perceptual loss as illustrated in Fig. 3(b). Figure 3(c) demonstrates that the ability to colorize artificial objects can be enhanced by instance normalization and by increasing the chroma of the output images by a factor of 1.3 in Methods 4–6.
Figure 3.
Colorization results of Methods 1–6.
Table IV.
Mean and 95th percentile color differences between the colorized images generated by Methods 1–6 and the ground-truth images (averaged over 4500 test images).
MethodsBasic architecturePretrained modelHigh-resolution inputPerceptual lossINChroma (C) scaled by 1.3Global color difference (test data)Traffic light color difference (test data)
Method 1 3.8/10.320.9/27.6
Method 2 3.5/8.421.2/27.7
Method 3 3.0/8.720.6/25.3
Method 4 2.8/8.00.66/5.4
Method 5 2.7/7.60.61/5.8
Method 62.7/7.30.52/5.0
(Metrics: mean ΔECAM16-UCS/95th percentile of ΔECAM16-UCS)
4.
Discussion
To further contextualize these findings, it is necessary to compare the method with representative prior studies. Zabari proposed a text-guided latent diffusion framework for image colorization, which integrates Cold Diffusion with a CLIP (Contrastive Language–Image Pretraining; Radford et al., 2021) [60]-based ranking mechanism to provide flexible and diverse results, albeit at a relatively high computational cost. In contrast, Iizuka designed a CNN-based architecture that performs colorization by fusing global scene priors with local features through image recognition. Their model benefits from implicit semantic guidance via scene classification, enabling natural colorization across a wide variety of images. As shown in Figure 4, the proposed model (Method 6) is compared with Zabari’s diffusion model [29] and Iizuka’s model [9] for image colorization. The model’s performance was evaluated using 2000 training samples and 500 validation samples. Although implementations employed a greater number of training samples, this strategy is not consistently practical because the data collection process is often characterized by its labor-intensive and time-consuming nature, particularly in real-world applications. Notably, the ability to achieve competitive results with a reduced dataset underscores the efficiency of the proposed method and indicates its robust generalization capabilities while substantially reducing training resource demands.
Figure 4.
A comparison of Zabari’s model, Iizuka’s model, and the proposed model (Method 6) in image colorization.
The results indicate that elements such as trees and artifacts are not effectively colorized in the Zabari and Iizuka models. Although the sky exhibits some blue tones, the colorized regions remain imprecise. The proposed model has been demonstrated to exhibit superior performance in identifying both natural and artificial objects. The results of the ΔECAM16-UCS color difference comparison for various semantic categories are presented in Table V.
Table V.
Comparison of mean color differences among various semantic categories.
BuildingBusCarRoadSidewalkSkyTraffic signTreeTruckVegetationWall
Iizuka [9]9.98.99.78.99.88.811.510.07.38.89.4
Zabari [29]8.67.87.75.47.68.210.48.76.97.27.9
Ours4.50.264.83.13.93.21.44.63.43.64.1
(Unit: ΔECAM16-UCS)
The results indicate that the proposed model achieves superior color restoration for categories such as sky and trees. This improvement can be attributed to the relatively consistent spatial positions, features, and color distributions of elements like the sky, trees, and vegetation.
In terms of artificial objects, the model employed in this study utilizes a multipath neural network, enabling precise identification of the necessity for distinct colors to denote varied semantic objects.
In the context of image segmentation, the proposed model’s categorization of vehicles, including cars, trucks, and buses, enables the discernment of color variations across different categories. This capability is supported by the neural network’s learning process, which integrates diverse annotated data. Analysis revealed that the color differences between buses and trucks were within acceptable limits. However, the outcome for cars was less optimal, likely due to the inability to distinguish between different cars solely through image segmentation when multiple cars were present in a single image. However, the Zabari and Iizuka models demonstrated even poorer performance, with the color of the cars often being different from the ground truth and being unable to distinguish the saturated red of the brake lights.
In the case of roads, walls, and sidewalks, these features in the image were very close to each other. Therefore, image segmentation is necessary to determine the location of the road and the presence of people. The proposed model outperformed the competitor’s model, as it improved the visibility of roads, walls, and sidewalks.
In the coloring of small objects, such as traffic lights and brake lights, the proposed model significantly outperformed other models due to the use of a high-resolution U-Net architecture in the colorization path, which preserved the loss of image encoding by concatenating the encoder and decoder information.
Coloring buildings presents a unique challenge compared to other artifacts due to the wide range of styles and colors found in this category. To address this challenge, we propose a model for segmenting images into various categories, thereby enhancing learning. This approach aimed to enhance the diversity of building colors and minimize the impact on other categories. The results of the proposed model demonstrated the efficacy of this method, surpassing the performance of different models.
The ΔECAM16-UCS calculation method in this study is as follows. First, convert both the colorized image and the ground-truth image from the sRGB space to the CAM16-UCS J’a’b’ space, using a D65 reference white and an adapting field luminance of 20 cd/m2 under dim surround conditions. Then, compute the Euclidean distance between them in the J’a’b’ space. Statistics are computed only for pixels belonging to the corresponding category in the segmentation mask (i.e., ΔECAM16-UCS is calculated for all pixels of each object), and no low-pass filtering is applied, primarily to preserve image detail. However, such pixel-wise statistics cannot account for perceptual phenomena such as visual masking or color assimilation. To further validate whether the model aligns with human subjective perception and to address these limitations, psychophysical experiments can serve as a valuable extension of this study.
The proposed method has two primary limitations. First, although the coloring accuracy of small objects can be improved by using a high-resolution U-Net structure, certain inaccuracies still persist. For instance, as shown in Figure 5(a), sign lights that are originally yellow may be incorrectly colored as green in the output. Second, referring to Figs. 5(b) and 5(c), in more complex scenes, the model sometimes applies unnatural colors, such as gray or red, which negatively impacts the overall realism of the image.
Figure 5.
Examples of failed colorization using the proposed model (Method 6).
Although the proposed model performs well on the synthetic image dataset GTA5, it is still necessary to further validate its performance on real-world images. The BDD100K [61] dataset was released by the Berkeley Artificial Intelligence Research laboratory in collaboration with the Berkeley DeepDrive Industrial Consortium. It is one of the largest and most diverse publicly available datasets of driving videos. The dataset consists of 720p-resolution driving videos collected across multiple regions of the United States, including metropolitan areas such as New York City and the San Francisco Bay Area. The model (Method 6) proposed in this study achieves satisfactory results in terms of color filling performance on real-world scene datasets as shown in Figure 6. However, compared to the colorized outputs, the color differences from the ground-truth images are still relatively high. For memory colors (e.g., trees, roads, sky), the colorization results are satisfactory, likely because these categories exhibit relatively consistent canonical colors across scenes. An additional observation is that sky regions exhibit lower chroma, likely due to a dataset-induced bias in GTA5, where skies tend to appear less vividly blue. In the coloring of small objects, the red color of brake lights is consistently present, indicating that the high-resolution input of the proposed model effectively enhances the coloring results, particularly ensuring accuracy in coloring small objects on the road. Although the coloring of nonmemory colors is less accurate, the overall image still demonstrates noticeable color diversity. Additionally, based on the color filling performance of the proposed model, transfer learning can be applied to train the model on different application datasets, thereby improving color filling performance and reducing color differences. Furthermore, incorporating panoptic segmentation may also enhance the coloring of vehicles.
Figure 6.
Colorization results for the BDD100K dataset using the proposed model (Method 6).
5.
Conclusions
In this study, a novel automatic image colorization method was proposed, which integrates a multipath neural network with semantic segmentation to enhance the accuracy of color prediction. The experimental results on the GTA5 dataset demonstrated that the proposed method significantly improved color fidelity and object edge preservation compared to existing CNN and diffusion models. Using a high-resolution training dataset, the proposed method achieved small global color differences on 4500 test images relative to the ground truth, with mean ΔECAM16-UCS = 2.7 and 95th percentile ΔECAM16-UCS = 7.3. Moreover, even when trained on a small dataset, the method consistently outperformed the CNN and diffusion models across all categories. However, in highly complex scenes, this method did not produce ideal coloring results.
Future work will focus on improving image segmentation through two main directions. One direction is the application of panoptic segmentation to achieve more comprehensive scene understanding. Another direction is the incorporation of semantic guidance for objects with characteristic colors, such as taxis or airplanes of specific airlines. These strategies are expected to reduce color misclassification and further enhance color accuracy and diversity in complex scenes. In addition, existing color difference formulas are considered insufficient to fully capture human perception in complex images. To address this limitation, a preliminary experimental framework has been designed to recruit participants for subjective evaluations. Specifically, participants will compare the colorized results generated by the deep learning model with the corresponding ground-truth color images and provide naturalness ratings. This evaluation aims to assess the perceptual performance of colorization models from the perspective of human visual perception. This study can help us gain a deeper understanding of the key aspects of perceived color differences in image colorization.
Availability of Materials
Code and model availability: The source code and trained model can be provided upon reasonable request for research use. Requests can be directed to D10822501@mail.ntust.edu.tw or plsun@mail.ntust.edu.tw. Redistribution and commercial use are prohibited.
References
1ChenT.WangY.SchillingsV.MeinelC.2004Grayscale image matting and colorizationProc. Asian Conf. Comput. Vis.116411691164–9SpringerJeju, Korea
2YatzivL.SapiroG.2006Fast image and video colorization using chrominance blendingIEEE Trans. Image Process.15112010.1109/TIP.2005.864231
3ChiaA. Y. S.ZhuoS.GuptaR. K.TaiY. W.ChoS. Y.TanP.LinS.2011Semantic colorization with internet imagesACM Trans. Graph.30110.1145/2070781.2024190
4CarlucciF. M.RussoP.CaputoB.2018(DE)2CO: deep depth colorizationIEEE Robot. Autom. Lett.3238610.1109/LRA.2018.2812225
5ChengZ.YangQ.ShengB.2015Deep colorizationProc. IEEE Int’l. Conf. Comput. Vis.415423415–23IEEEPiscataway, NJ10.1109/ICCV.2015.55
6HuZ.ShkuratO.KasnerM.2024Grayscale image colorization method based on U-net networkInt. J. Image, Graph. Signal Process.1670
7ZhangR.IsolaP.EfrosA. A.2016Colorful image colorizationProc. Eur. Conf. Comput. Vis.649666649–66SpringerAmsterdam, Netherlands10.1007/978-3-319-46487-9_40
8LarssonG.MaireM.ShakhnarovichG.2016Learning representations for automatic colorizationProc. Eur. Conf. Comput. Vis.577593577–93SpringerAmsterdam, Netherlands10.1007/978-3-319-46493-0_35
9IizukaS.Simo-SerraE.IshikawaH.2016Let there be color! Joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classificationACM Trans. Graph.351
10HeM.ChenD.LiaoJ.SanderP. V.YuanL.2018Deep exemplar-based colorizationACM Trans. Graph.371
11SangkloyP.LuJ.FangC.YuF.HaysJ.2017Scribbler: controlling deep image synthesis with sketch and colorProc. IEEE Conf. Comput. Vis. Pattern Recognit.540054095400–9IEEEPiscataway, NJ10.1109/CVPR.2017.723
12ZhangR.ZhuJ. Y.IsolaP.GengX.LinA. S.YuT.EfrosA. A.2017Real-time user-guided image colorization with learned deep priorsACM Trans. Graph.361
13XiaoY.ZhouP.ZhengY.LeungC. S.2019Interactive deep colorization using simultaneous global and local inputsProc. IEEE Int’l. Conf. Acoust., Speech, Signal Process188718911887–91IEEEPiscataway, NJ10.1109/ICASSP.2019.8683686
14CiY.MaX.WangZ.LiH.LuoZ.2018User-guided deep anime line art colorization with conditional adversarial networksProc. ACM Int’l. Conf. Multimedia153615441536–44ACMSeoul, Korea10.1145/3240508.3240661
15DeshpandeA.LuJ.YehM. C.ChongM. J.ForsythD.2017Learning diverse image colorizationProc. IEEE Conf. Comput. Vis. Pattern Recognit.683768456837–45IEEEPiscataway, NJ10.1109/CVPR.2017.307
16FransK.“Outline colorization through tandem adversarial networks,” Preprint, arXiv:1704.08834 (2017)
17NazeriK.NgE.EbrahimiM.2018Image colorization using generative adversarial networksProc. Int’l. Conf. Articulated Motion and Deformable Objects859485–94SpringerPalma de Mallorca, Spain10.1007/978-3-319-94544-6_9
18VitoriaP.RaadL.BallesterC.2020ChromaGAN: adversarial picture colorization with semantic class distributionProc. IEEE Winter Conf. Appl. Comput. Vis.244524542445–54IEEEPiscataway, NJ10.1109/WACV45572.2020.9093389
19LiB.LuY.PangW.XuH.2023Image colorization using CycleGAN with semantic and spatial rationalityMultimed. Tools Appl.82110.1007/s11042-022-12047-3
20GuadarramaS.DahlR.BieberD.NorouziM.ShlensJ.MurphyK.2017PixColor: pixel recursive colorizationProc. Brit. Mach. Vis. Conf.1121–12BMVALondon, UK
21ZhaoJ.HanJ.ShaoL.SnoekC. G.2020Pixelated semantic colorizationInt. J. Comput. Vis.12881810.1007/s11263-019-01271-4
22ZhaoJ.LiuL.SnoekC.HanJ.ShaoL.2018Pixel-level semantics guided image colorizationProc. Brit. Mach. Vis. Conf.156BMVANewcastle, UK
23HeM.ChenD.LiaoJ.SanderP. V.YuanL.2018Deep exemplar-based colorizationACM Trans. Graph.3747
24SuJ. W.ChuH. K.HuangJ. B.2020Instance-aware image colorizationProc. IEEE Conf. Comput. Vis. Pattern Recognit.796879777968–77IEEEPiscataway, NJ10.1109/CVPR42600.2020.00799
25KumarM.WeissenbornD.KalchbrennerN.2021Colorization transformerProc. Int’l. Conf. Learn. Represent.ICLR, VirtualVienna, Austria
26WengS.SunJ.LiY.LiS.ShiB.2022CT2: colorization transformer via color tokensProc. Eur. Conf. Comput. Vis.1101–10SpringerTel Aviv, Israel10.1007/978-3-031-20071-7_1
27ShafiqH.LeeB.2024Transforming color: a novel image colorization methodElectronics13251110.3390/electronics13132511
28WangH.ChaiX.WangY.ZhangY.XieR.SongL.2024Multimodal semantic-aware automatic colorization with diffusion priorProc. IEEE Int’l. Conf. Multimedia Expo Workshops161–6IEEEPiscataway, NJ
29ZabariN.AzulayA.GorkorA.HalperinT.FriedO.2023Diffusing colors: image colorization with text guided diffusionProc. SIGGRAPH Asia Conf. Papers1111–11ACMSydney, Australia10.1145/3610548.3618180
30SuJ. W.ChuH. K.HuangJ. B.2020Instance-aware image colorizationProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.796879777968–77IEEEPiscataway, NJ10.1109/CVPR42600.2020.00799
31AnwarS.TahirM.LiC.MianA.KhanF. S.MuzaffarA. W.2025Image colorization: a survey and datasetInf. Fusion11410272010.1016/j.inffus.2024.102720
32LiangZ.LiZ.ZhouS.LiC.LoyC. C.2025Control color: multimodal diffusion-based interactive image colorizationInt. J. Comput. Vis.
33GarcíaR.RandallG.RaadL.2024A short analysis of BigColor for image colorizationImage Process. Online1414410.5201/ipol.2024.542
34CordtsM.OmranM.RamosS.RehfeldT.EnzweilerM.Benenson, …R.SchieleB.2016The Cityscapes dataset for semantic urban scene understandingProc. IEEE Conf. Comput. Vis. Pattern Recognit.321332233213–23IEEEPiscataway, NJ10.1109/CVPR.2016.350
35NeuholdG.OllmannT.BuloS. R.KontschiederP.2017The Mapillary Vistas dataset for semantic understanding of street scenesProc. IEEE Int’l. Conf. Comput. Vis.499049994990–9IEEEPiscataway, NJ10.1109/ICCV.2017.534
36BajbaaK.UsmanM.AnwarS.RadwanI.BaisA.“Bird’s-eye view to street-view: a survey,” Preprint, arXiv:2405.08961 (2024)
37LiY.YangS.LiuJ.“Language-based image colorization: a benchmark and beyond,” Preprint, arXiv:2503.14974 (2025)
38MaC.ShiZ.LuZ.XieS.ChaoF.SuiY.“A survey on image quality assessment: insights, analysis, and future outlook,” Preprint, arXiv:2502.08540 (2025)
39XuM.2025Image colorization based on transformerSci. Rep.152131110.1038/s41598-025-05485-0
40XuZ.GengC.2024Color restoration of mural images based on a reversible neural networkHeritage Sci.1235110.1186/s40494-024-01471-3
41YuC.GaoC.WangJ.YuG.ShenC.SangN.2021BiSeNet V2: bilateral network with guided aggregation for real-time semantic segmentationInt. J. Comput. Vis.129305110.1007/s11263-021-01515-2
42ZhaoH.ShiJ.QiX.WangX.JiaJ.2017Pyramid scene parsing networkProc. IEEE Conf. Comput. Vis. Pattern Recognit.288128902881–90IEEEPiscataway, NJ10.1109/CVPR.2017.660
43LinG.MilanA.ShenC.ReidI.2017RefineNet: multi-path refinement networks for high-resolution semantic segmentationProc. IEEE Conf. Comput. Vis. Pattern Recognit.192519341925–34IEEEPiscataway, NJ10.1109/CVPR.2017.549
44ChenL. C.ZhuY.PapandreouG.SchroffF.AdamH.2018Encoder-decoder with atrous separable convolution for semantic image segmentationProc. Eur. Conf. Comput. Vis.801818801–18SpringerMunich, Germany10.1007/978-3-030-01234-2_49
45RonnebergerO.FischerP.BroxT.2015U-net: convolutional networks for biomedical image segmentationProc. Med. Image Comput. Comput.-Assist. Interv. (MICCAI)234241234–41SpringerMunich, Germany10.1007/978-3-319-24574-4_28
46WangQ.WuB.ZhuP.LiP.ZuoW.HuQ.2020ECA-Net: efficient channel attention for deep convolutional neural networksProc. IEEE Conf. Comput. Vis. Pattern Recognit.115341154211534–42IEEEPiscataway, NJ10.1109/CVPR42600.2020.01155
47HuJ.ShenL.AlbanieS.SunG.WuE.2018Squeeze-and-excitation networksProc. IEEE Conf. Comput. Vis. Pattern Recognit.713271417132–41IEEEPiscataway, NJ10.1109/TPAMI.2019.2913372
48LeCunY.BottouL.BengioY.HaffnerP.1998Gradient-based learning applied to document recognitionProc. IEEE86227810.1109/5.726791
49ZeilerM. D.TaylorG. W.FergusR.2011Adaptive deconvolutional networks for mid and high-level feature learningProc. IEEE Int’l. Conf. Comput. Vis.201820252018–25IEEEPiscataway, NJ10.1109/ICCV.2011.6126474
50IoffeS.SzegedyC.2015Batch normalization: accelerating deep network training by reducing internal covariate shiftProc. Int’l. Conf. Mach. Learn.448456448–56PMLRLille, France
51GlorotX.BordesA.BengioY.2011Deep sparse rectifier neural networksProc. Int’l. Conf. Artif. Intell. Stat.770778770–8JMLRFort Lauderdale, FL, USA
52International Electrotechnical Commission, “IEC 61966-2-1: Multimedia systems and equipment - Colour measurement and management - Part 2-1: Colour management-Default RGB colour space-sRGB,” Standard IEC 61966-2-1, International Electrotechnical Commission, (1999)
53HeK.ZhangX.RenS.SunJ.2016Deep residual learning for image recognitionProc. IEEE Conf. Comput. Vis. Pattern Recognit.770778770–8IEEEPiscataway, NJ10.1109/CVPR.2016.90
54HanJ.MoragaC.1995The influence of the sigmoid function parameters on the speed of backpropagation learningProc. Int’l. Workshop Artif. Neural Netw.195201195–201SpringerSitges, Spain10.1007/3-540-59497-3_175
55BridleJ. S.1990Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognitionNeurocomputing68227
56UlyanovD.VedaldiA.LempitskyV.“Instance normalization: the missing ingredient for fast stylization,” Preprint, arXiv:1607.08022 (2016)
57ZeilerM. D.“ADADELTA: an adaptive learning rate method,” Preprint, arXiv:1212.5701 (2012)
58UrozJ.LuoR.MorovicJ.MacDonaldL.LuoM. R.2002Perception of colour differences in large printed imagesColour Image Science: Exploiting Digital Media497349–73John Wiley & SonsChichester, UK
59LiC. J.LiZ. Q.WangZ. F.XuY.LuoM. R.CuiG. H.MelgosaM.BrillH.PointerM.2017Comprehensive color solutions: CAM16, CAT16, and CAM16-UCSColor Res. Appl.4270310.1002/col.22131
60RadfordA.KimJ. W.HallacyC.RameshA.GohG.AgarwalS.SastryG.AskellA.MishkinP.ClarkJ.KruegerG.SutskeverI.2021Learning transferable visual models from natural language supervisionProc. Int’l. Conf. Mach. Learn.874887638748–63PMLRVienna, Austria
61YuF.ChenH.WangX.XianW.ChenY.MuF.KoltunV.DarrellT.“BDD100K: a diverse driving video database with scalable annotation tooling,” Preprint, arXiv:1805.04687 (2018)