Steel surface defect detection in industrial quality control has always been a challenging objective detection task in the field of computer vision. However, unlike other detection problems, some surface defects on steel are relatively small compared to the entire inspection object, leading to less prominent defect features in the detection. To address these issues, we propose a YOLOv5-based steel defect detection method enhanced with multi-scale feature extraction and contextual augmentation (MSCA-YOLO). Specifically, adopting the YOLOv5 as the backbone network, we first add the C3-RFE to expand the receptive. Then, we design a neck network structure via combining multi-scale guided upsampling, which effectively enhances the model’s ability to handle multi-scale features and improves the model’s feature extraction ability for small defects. Finally, we propose a context mechanism that provides the model with a deeper context analysis capability, offering richer up-and-down information. The experiments on the NEU-DET dataset show that MSCA-YOLO achieves a mean Average Precision of 0.645 while maintaining rapid detection, especially at an Intersection over Union threshold of 0.5. It also exhibits substantial improvements in Precision compared to YOLOv5 across six defect types: Crazing (18.5% increase), Inclusion (1.2% increase), Patches (1.9% increase), Pitted_Surface (7.8% increase), Rolled-in_Scale (8.9% increase), and Scratches (6.5% increase). This achievement marks the efficiency and reliability of MSCA-YOLO in automated steel surface defect detection, providing a new solution for real-time inspection of steel surface defects.
Object detection has been used in a wide range of industries. For example, in autonomous driving, the task of object detection is to accurately and efficiently identify and locate a large number of predefined classes of object instances (vehicles, pedestrians, traffic signs, etc.) from road videos. In robotics, the industrial robot needs to recognize specific machine elements. In the security field, the camera should accurately recognize people’s faces. With the wide application of deep learning, the accuracy and efficiency of object detection have greatly improved, but object detection based on deep learning still faces challenges. Different applications of object detection have different requirements, including highly accurate detection, multi-category object detection, real-time detection, robustness to occlusions, etc. To address the above challenges, based on extensive literature research, this paper analyzes methods for improving and optimizing mainstream object detection algorithms from the perspective of evolution of one-stage and two-stage object detection algorithms. Furthermore, this article proposes methods for improving object detection accuracy from the perspective of changing receptive fields. The new model is based on the original YOLOv5 (You Look Only Once) with some modifications. The structure of the head part of YOLOv5 is modified by adding asymmetrical pooling layers. As a result, the accuracy of the algorithm is improved while ensuring speed. The performance of the new model in this article is compared with that of the original YOLOv5 model and analyzed by several parameters. In addition, the new model is evaluated under four scenarios. Moreover, a summary and outlook on the problems to be solved and the research directions in the future are presented.
Scale invariance and high miss detection rates for small objects are some of the challenging issues for object detection and often lead to inaccurate results. This research aims to provide an accurate detection model for crowd counting by focusing on human head detection from natural scenes acquired from publicly available datasets of Casablanca, Hollywood-Heads and Scut-head. In this study, we tuned a yolov5, a deep convolutional neural network (CNN) based object detection architecture, and then evaluated the model using mean average precision (mAP) score, precision, and recall. The transfer learning approach is used for fine-tuning the architecture. Training on one dataset and testing the model on another leads to inaccurate results due to different types of heads in different datasets. Another main contribution of our research is combining the three datasets into a single dataset, including every kind of head that is medium, large and small. From the experimental results, it can be seen that this yolov5 architecture showed significant improvements in small head detections in crowded scenes as compared to the other baseline approaches, such as the Faster R-CNN and VGG-16-based SSD MultiBox Detector.