IS&T | Library

Towards an image-computable visual text quality metric using deep neural networks

Abstract

The 3D extension of the High Efficiency Video Coding (3D-HEVC) standard has improved the coding efficiency for 3D videos significantly. However, this improvement has been achieved with a significant rise in computational complexity. Specifically, the encoding process for the depth map in the 3D-HEVC standard occupies 84% of the total encoding time. This extended time is primarily due to the need to traverse coding unit (CU) depth levels in depth map encoding to determine the most suitable CU size. Acknowledging the evident texture distribution patterns within a depth map and the strong correlation between encoding size selection and the texture complexity of the current encoding block, an adaptive depth early termination convolutional neural network, named ADET-CNN, is designed for the depth map in this paper. It takes an original 64 × 64 coding tree unit (CTU) as the input and provides segmentation probabilities for various CU sizes within the CTU, which eliminates the need for exhaustive calculations and the comparison for determining the optimal CU size, thereby enabling faster intra-coding for the depth map. Experimental results indicate that the proposed method achieves a time saving of 58% depth map encoding while maintaining the quality of synthetic views.

Digital Library: JIST

Published Online: March 2025

Article

133 40

text quality
quality metric
deep learning
convolutional neural network

Ling-Qi Zhang, Minjung Kim, James Hillis, Trisha Lian

DOI

10.2352/EI.2023.35.8.IQSP-310

Volume 35

Issue 8

Limitations of CNNs for Approximating the Ideal Observer Despite Quantity of Training Data or Depth of Network

Abstract

Image quality metrics have become invaluable tools for image processing and display system development. These metrics are typically developed for and tested on images and videos of natural content. Text, on the other hand, has unique features and supports a distinct visual function: reading. It is therefore not clear if these image quality metrics are efficient or optimal as measures of text quality. Here, we developed a domain-specific image quality metric for text and compared its performance against quality metrics developed for natural images. To develop our metric, we first trained a deep neural network to perform text classification on a data set of distorted letter images. We then compute the responses of internal layers of the network to uncorrupted and corrupted images of text, respectively. We used the cosine dissimilarity between the responses as a measure of text quality. Preliminary results indicate that both our model and more established quality metrics (e.g., SSIM) are able to predict general trends in participants’ text quality ratings. In some cases, our model is able to outperform SSIM. We further developed our model to predict response data in a two-alternative forced choice experiment, on which only our model achieved very high accuracy.

Digital Library: EI

Published Online: January 2023

JIST-first

3 0

computational imaging
ideal obsever
convolutional neural network

Khalid Omer, Luca Caucci, Meredith Kupinski

Pages 60408-1 - 60408-11, November 2020, © Society for Imaging Science and Technology 2021

DOI

10.2352/J.ImagingSci.Technol.2020.64.6.060408

Volume 33

Issue 15

The performance of a convolutional neural network (CNN) on an image texture detection task as a function of linear image processing and the number of training images is investigated. Performance is quantified by the area under (AUC) the receiver operating characteristic (ROC) curve. The Ideal Observer (IO) maximizes AUC but depends on high-dimensional image likelihoods. In many cases, the CNN performance can approximate the IO performance. This work demonstrates counterexamples where a full-rank linear transform degrades the CNN performancebelow the IO in the limit of large quantities of training dataand network layers. A subsequent linear transform changes theimages’ correlation structure, improves the AUC, and again demonstrates the CNN dependence on linear processing. Compression strictly decreases or maintains the IO detection performance while compression can increase the CNN performance especially for small quantities of training data. Results indicate an optimal compression ratio for the CNN based on task difficulty, compression method, and number of training images. c 2020 Society for Imaging Science and Technology.

Digital Library: EI

Published Online: November 2020

Real-world fence removal from a single-image via deep neural network

48 10

de-fencing
deep learning
image restoration
object removal
convolutional neural network

Takuro Matsui, Takuro Yamaguchi, Masaaki Iheara

Pages 26-1 - 26-7, January 2020, © Society for Imaging Science and Technology 2020

DOI

10.2352/ISSN.2470-1173.2020.10.IPAS-026

Volume 32

Issue 10

At public space such as a zoo and sports facilities, the presence of fence often annoys tourists and professional photographers. There is a demand for a post-processing tool to produce a non-occluded view from an image or video. This “de-fencing” task is divided into two stages: one is to detect fence regions and the other is to fill the missing part. For a decade or more, various methods have been proposed for video-based de-fencing. However, only a few single-image-based methods are proposed. In this paper, we mainly focus on single-image fence removal. Conventional approaches suffer from inaccurate and non-robust fence detection and inpainting due to less content information. To solve these problems, we combine novel methods based on a deep convolutional neural network (CNN) and classical domain knowledge in image processing. In the training process, we are required to obtain both fence images and corresponding non-fence ground truth images. Therefore, we synthesize natural fence image from real images. Moreover, spacial filtering processing (e.g. a Laplacian filter and a Gaussian filter) improves the performance of the CNN for detecting and inpainting. Our proposed method can automatically detect a fence and generate a clean image without any user input. Experimental results demonstrate that our method is effective for a broad range of fence images.

Digital Library: EI

Published Online: January 2020

A Local-Global Aggregate Network for Facial Landmark Localization

77 2

facial landmark localization
face alignment
convolutional neural network

Ruiyi Mao, Qian Lin, Jan P. Allebach

Pages 185-1 - 185-6, January 2020, © Society for Imaging Science and Technology 2020

DOI

10.2352/ISSN.2470-1173.2020.8.IMAWM-185

Volume 32

Issue 8

Deep Image Demosaicing for Submicron Image Sensors

Facial landmark localization plays a critical role in many face analysis tasks. In this paper, we present a novel local-global aggregate network (LGA-Net) for robust facial landmark localization of faces in the wild. The network consists of two convolutional neural network levels which aggregate local and global information for better prediction accuracy and robustness. Experimental results show our method overcomes typical problems of cascaded networks and outperforms state-of-the-art methods on the 300-W [1] benchmark.

Digital Library: EI

Published Online: January 2020

JIST-first

118 55

demosaicing
image processing
image signal processor (ISP)
deep learning
convolutional neural network
Tetracell image sensor
image restoration
Quad Bayer Color Filter Array

Irina Kim, Seongwook Song, Soonkeun Chang, Sukhwan Lim, Kai Guo

DOI

10.2352/J.ImagingSci.Technol.2019.63.6.060410

Volume 32

Issue 7

Latest trend in image sensor technology allowing submicron pixel size for high-end mobile devices comes at very high image resolutions and with irregularly sampled Quad Bayer color filter array (CFA). Sustaining image quality becomes a challenge for the image signal processor (ISP), namely for demosaicing. Inspired by the success of deep learning approach to standard Bayer demosaicing, we aim to investigate how artifacts-prone Quad Bayer array can benefit from it. We found that deeper networks are capable to improve image quality and reduce artifacts; however, deeper networks can be hardly deployed on mobile devices given very high image resolutions: 24MP, 36MP, 48MP. In this article, we propose an efficient end-to-end solution to bridge this gap—a duplex pyramid network (DPN). Deep hierarchical structure, residual learning, and linear feature map depth growth allow very large receptive field, yielding better details restoration and artifacts reduction, while staying computationally efficient. Experiments show that the proposed network outperforms state of the art for standard and Quad Bayer demosaicing. For the challenging Quad Bayer CFA, the proposed method reduces visual artifacts better than state-of-the-art deep networks including artifacts existing in conventional commercial solutions. While superior in image quality, it is 2–25 times faster than state-of-the-art deep neural networks and therefore feasible for deployment on mobile devices, paving the way for a new era of on-device deep ISPs.

Digital Library: EI

Published Online: November 2019

Construction of facial emotion database through subjective experiments and its application to deep learning-based facial image processing

44 0

facial emotion image
big data
good data
convolutional neural network
generative adversarial network
accurate emotion tag

Tomoyuki Takanashi, Keita Hirai, Takahiko Horiuchi

DOI

10.2352/ISSN.2470-1173.2019.11.IPAS-267

Volume 31

Issue 11

As the development of interactive robots and machines, studies to understand and reproduce facial emotions by computers have become important research areas. For achieving this goal, several deep learning-based facial image analysis and synthesis techniques recently have been proposed. However, there are difficulties in the construction of facial image dataset having accurate emotion tags (annotations, metadata), because such emotion tags significantly depend on human perception and cognition. In this study, we constructed facial image dataset having accurate emotion tags through subjective experiments. First, based on image retrieval using the emotion terms, we collected more than 1,600,000 facial images from SNS. Next, based on a face detection image processing, we obtained approximately 380,000 facial region images as “big data.” Then, through subjective experiments, we manually checked the facial expression and the corresponding emotion tags of the facial regions. Finally, we achieved approximately 5,500 facial images having accurate emotion tags as “good data.” For validating our facial image dataset in deep learning-based facial image analysis and synthesis, we applied our dataset to CNN-based facial emotion recognition and GAN-based facial emotion reconstruction. Through these experiments, we confirmed the feasibility of our facial image dataset in deep learning-based emotion recognition and reconstruction.

Digital Library: EI

Published Online: January 2019

Pixelwise JPEG compression detection and quality factor estimation based on convolutional neural network

139 16

JPEG quality factor estimation
JPEG compression detection
convolutional neural network
image forensic analysis

Kazutaka Uchida, Masayuki Tanaka, Masatoshi Okutomi

DOI

10.2352/ISSN.2470-1173.2019.11.IPAS-276

Volume 31

Issue 11