A relatively recent thrust in IQA research has focused on estimating the quality of a distorted image without access to the original (reference) image. Algorithms for this so-called noreference IQA (NR IQA) have made great strides over the last several years, with some NR algorithms rivaling full-reference algorithms in terms of prediction accuracy. However, there still remains a large gap in terms of runtime performance; NR algorithms remain significantly slower than FR algorithms, owing largely to their reliance on natural-scene statistics and other ensemble-based computations. To address this issue, this paper presents a GPGPU implementation, using NVidia's CUDA platform, of the popular Blind Image Integrity Notator using DCT Statistics (BLIINDS-II) algorithm [8], a state of the art NR-IQA algorithm. We copied the image over to the GPU and performed the DCT and the statistical modeling using the GPU. These operations, for each 5x5 pixel window, are executed in parallel. We evaluated the implementation by using NVidia Visual Profiler, and we compared the implementation to a previously optimized CPU C++ implementation. By employing suitable optimizations on code, we were able to reduce the runtime for each 512x512 image from approximately 270 ms down to approximately 9 ms, which includes the time for all data transfers across PCIe bus. We discuss our unique implementation of BLIINDS-II designed specifically for use on the GPU, the insights gained from the runtime analyses, and how the GPGPU techniques developed here can be adapted for use in other NR IQA algorithms.
Due to the massive popularity of digital images and videos over the past several decades, the need for automated quality assessment (QA) is greater than ever. Accordingly, the impetus on QA research has focused on improving prediction accuracy. However, for many application areas, such as consumer electronics, the runtime performance and related computational considerations are equally as important as the accuracy. Most modern QA algorithms exhibit a large computational complexity. However, the large complexity of these algorithms does not necessarily prohibit their ability of achieving low runtimes if hardware resources are used appropriately. GPUs, which offer a large amount of parallelism and a specialized memory hierarchy, should be well-suited for QA algorithm deployment. In this paper, we analyze a massively parallel GPU implementation of the most apparent distortion (MAD) full-reference image QA algorithm with optimizations guided by a microarchitectural analysis. A shared memory based implementation of the local statistics computation has yielded 25% speedup over its original implementation. We describe the optimizations that produce the best results. We also justify our optimization recommendations with descriptions of the microarchitectural underpinnings. Although our study focuses on a single algorithm, the image-processing primitives used in this algorithm are fundamentally similar to those used in most modern QA algorithms.
As datasets continue to increase in size and complexity, new techniques are required to visualize surface flow effectively. In this work, we introduce a novel technique for visualizing flow on arbitrary surface meshes. This new method utilizes the closest point method (CPM), an embedding technique for solving partial differential equations (PDE) on surfaces. The CPM operates by extending values off the surface into the grid and using standard three dimensional PDE stencils to solve embedded two dimensional surface problems. To adapt unsteady flow visualization for the CPM, unsteady flow line integral convolution (UFLIC) is applied in three dimensions to the embedded surface in the grid to visualize flow on an arbitrary surface. To address the increased size and complexity of datasets, we introduce the closest point sparse octree to efficiently represent an embedded surface. By constructing a closest point sparse octree, complex surfaces can be represented in a memory efficient manner. Further, various techniques, such as a Laplacian filter, can be applied more easily to the embedded surface because of the CPM. Finally, the memory efficiency of our new sparse octree approach allows grids to be constructed up to 8, 1923 in size on a GPU with 12GB of RAM.