This paper presents a self-supervised monocular visual odometry (VO) method guided by attention feature maps, aimed at effectively mitigating the impact of redundant pixels during network training. Existing self-supervised VO methods typically treat all pixels equally when computing photometric error, which can lead to increased sensitivity to noise from irrelevant pixels and, consequently, training errors. To address this issue, the authors adopt a soft-attention mechanism to generate attention feature maps that allow the model to focus on more relevant pixels while downweighting the influence of disruptive ones. This approach enhances the robustness and accuracy of depth estimation and pose tracking. The proposed method achieves competitive results on the KITTI dataset, with Sequences 09 and 10 demonstrating relative rotation errors of 0.022 and 0.032 and relative translation errors of 5.56 and 7.29, respectively.
For the automated analysis of metaphase chromosome images, the chromosomes on the images need to be segmented first. However, the segmentation results often contain several non-chromosome objects. Elimination of non-chromosome objects is essential in automated chromosome image analysis. This study aims to exclude non-chromosome objects from segmented chromosome candidates for further analysis. A feature-based method was developed to eliminate non-chromosome objects from metaphase chromosome images. In a metaphase chromosome image, the chromosome candidates were segmented by a threshold first. After segmenting the chromosome candidates, four classes of features, namely, area, density-based features, roughness-based features, and widths, of the segmented candidates were extracted to discriminate between chromosomes and non-chromosome objects. Seven classifiers were used and compared to combine the extracted features to perform classifications. The experimental results show the usefulness of the combination of extracted features in distinguishing between chromosomes and non-chromosome objects. The proposed method can effectively separate non-chromosome objects from chromosomes and could be used as the preprocessing procedure for chromosome image analysis.
Recent advancements in artificial intelligence have significantly impacted many color imaging applications, including skin segmentation and enhancement. Although state-of-the-art methods emphasize geometric image augmentations to improve model performance, the role of color-based augmentations and color spaces in enhancing skin segmentation accuracy remains underexplored. This study addresses this gap by systematically evaluating the impact of various color-based image augmentations and color spaces on skin segmentation models based on convolutional neural networks (CNNs). We investigate the effects of color transformations—including brightness, contrast, and saturation adjustments—in three color spaces: sRGB, YCbCr, and CIELab. To represent CNN models, an existing semantic segmentation model is trained using these color augmentations on a custom dataset of 900 images with annotated skin masks, covering diverse skin tones and lighting conditions. Our findings reveal that current training practices, which primarily rely on single-color augmentation in the sRGB space and focus mainly on geometric augmentations, limit model generalization in color-related applications like skin segmentation. Models trained with a greater variety of color augmentations show improved skin segmentation, particularly under over- and underexposure conditions. Additionally, models trained in YCbCr outperform those trained in sRGB color space when combined with color augmentation while CIELab leads to comparable performance to sRGB. We also observe significant performance discrepancies across skin tones, highlighting challenges in achieving consistent segmentation under varying lighting. This study highlights gaps in existing image augmentation approaches and provides insights into the role of various color augmentations and color spaces in improving the accuracy and inclusivity of skin segmentation models.
Seal-related tasks in document processing—such as seal segmentation, authenticity verification, seal removal, and text recognition under seals—hold substantial commercial importance. However, progress in these areas has been hindered by the scarcity of labeled document seal datasets, which are essential for supervised learning. To address this limitation, we propose Seal2Real, a novel generative framework designed to synthesize large-scale labeled document seal data. As part of this work, we also present Seal-DB, a comprehensive dataset containing 20,000 labeled images to support seal-related research. Seal2Real introduces a prompt prior learning architecture built upon a pretrained Stable Diffusion model, effectively transferring its generative capability to the unsupervised domain of seal image synthesis. By producing highly realistic synthetic seal images, Seal2Real significantly enhances the performance of downstream seal-related tasks on real-world data. Experimental evaluations on the Seal-DB dataset demonstrate the effectiveness and practical value of the proposed framework.
The tomato leaf is a significant organ that reflects the health and growth of tomato plants. Early detection of leaf diseases is crucial to both crop yield and the income of farmers. However, the global distribution of diseases across tomato leaves coupled with fine-grained differences among various diseases poses significant challenges for accurate disease detection. To tackle these obstacles, we propose an accurate tomato leaf disease identification method based on an improved Swin Transformer. The proposed method consists of three parts: the Swin Transformer backbone, a Local Feature Perception (LFP) module, and a Spatial Texture Attention (STA) module. The backbone can model long-range dependencies of leaf diseases for representative features while the LFP module adopts a multi-scale aggregation strategy to enhance the capability of the Swin Transformer in local feature extraction. Moreover, the STA module integrates hierarchical features from different stages of the Swin Transformer to capture fine-grained features for the classification head and boost overall performance. Extensive experiments are conducted on the public LBFtomato dataset, and the results demonstrate the superior performance of our proposed method. Our proposed model achieves the scores of 99.28% in Accuracy, 99.07% in Precision, 99.36% in Recall, and 99.24% in F1-score metrics.
We propose an efficient multi-scale residual network that integrates 3D face alignment with head pose estimation from an RGB image. Existing methods excel in performing each task independently but often fail to acknowledge the interdependence between them. Additionally, these approaches lack a progressive fine-tuning process for 3D face alignment, which could otherwise require excessive computational resources and memory. To address these limitations, we introduce a hierarchical network that incorporates a frontal face constraint, significantly enhancing the accuracy of both tasks. Moreover, we implement a multi-scale residual merging process that allows for multi-stage refinement without compromising the efficiency of the model. Our experimental results demonstrate the superiority of our method compared to state-of-the-art approaches.
As pets now outnumber newborns in households, the demand for pet medical care and attention has surged. This has led to a significant burden for pet owners. To address this, our experiment utilizes image recognition technology to preliminarily assess the health condition of dogs, providing a rapid and economical health assessment method. By collaboration, we collected 2613 stool photos, which were enhanced to a total of 6079 images and analyzed using LabVIEW and the YOLOv8 segmentation model. The model performed excellently, achieving a precision of 86.805%, a recall rate of 74.672%, and an mAP50 of 83.354%. This proves its high recognition rate in determining the condition of dog stools. With the advancement of technology and the proliferation of mobile devices, the aim of this experiment is to develop an application that allows pet owners to assess their pets’ health anytime and manage it more conveniently. Additionally, the experiment aims to expand the database through cloud computing, optimize the model, and establish a global pet health interactive community. These developments not only propel innovation in the field of pet medical care but also provide practical health management tools for pet families, potentially offering substantial help to more pet owners in the future.
Multichannel methods have attracted much attention in color image denoising. These are image denoising methods that combine the low-rankness of a matrix with the nonlocal self-similarity of a natural image. The methods apply to color images with noise of different intensities in each color channel. Denoising methods based on the low-rankness of tensors, and extensions of matrices, have also attracted attention in recent years. Many tensor-based methods have been proposed as extensions of matrix-based methods and have achieved higher denoising performance than matrix-based methods. Tensor-based methods perform denoising using an approximate function of the tensor rank. However, unlike multichannel methods, tensor-based methods do not assume different noise intensities for each channel. On the other hand, the tensor nuclear norm minus Frobenius norm (TNNFN) has been proposed in the domain of traffic data completion. The TNNFN is one of the tensor rank approximation functions and is known to have high performance in traffic data completion, but it has not been applied to image restoration. In this paper, we propose MC-TNNFN as a tensor-based multichannel method. It is a TNNFN-based multichannel method that uses TNNFN to remove noise from a tensor constructed from similar patches and then estimates the original image. Experimental results using natural images show that the proposed method outperforms existing methods objectively and subjectively.
Underwater images are afflicted by dynamic blur, low illumination, poor contrast, and noise interference, hampering the accuracy of underwater robot proximity detection and its application in marine development. This study introduces a solution utilizing the MIMO-UNet network. The network integrates the Atrous Spatial Pyramid Pooling module between the encoder and the decoder to augment feature extraction and contextual information retrieval. Furthermore, the addition of a channel attention module in the decoder enhances detailed feature extraction. A novel technique combines multi-scale content loss, frequency loss, and mean squared error loss to optimize network weight updates, enhance high-frequency loss information, and ensure network convergence. The effectiveness of the method is assessed using the UIEB dataset. Ablation experiments confirm the efficacy and reasoning behind each module design while performance comparisons demonstrate the algorithm’s superiority over other underwater enhancement methods.
We propose a new convolutional neural network called Physics-guided Encoder–Decoder Network (PEDNet) designed for end-to-end single image dehazing. The network uses a reformulated atmospheric scattering model, which is embedded into the network for end-to-end learning. The overall structure is in the form of an encoder–decoder, which fully extracts and fuses contextual information from four different scales through skip connections. In addition, in view of the uneven spread of haze in the real world, we design a Res2FA module based on Res2Net, which introduces a Feature Attention block that is able to focus on important information at a finer granularity. The PEDNet is more adaptable when handling various hazy image types since it employs a physically driven dehazing model. The efficacy of every network module is demonstrated by ablation experiment results. Our suggested solution is superior to current state-of-the-art methods according to experimental results from both synthetic and real-world datasets.