Steel surface defect detection in industrial quality control has always been a challenging objective detection task in the field of computer vision. However, unlike other detection problems, some surface defects on steel are relatively small compared to the entire inspection object, leading to less prominent defect features in the detection. To address these issues, we propose a YOLOv5-based steel defect detection method enhanced with multi-scale feature extraction and contextual augmentation (MSCA-YOLO). Specifically, adopting the YOLOv5 as the backbone network, we first add the C3-RFE to expand the receptive. Then, we design a neck network structure via combining multi-scale guided upsampling, which effectively enhances the model’s ability to handle multi-scale features and improves the model’s feature extraction ability for small defects. Finally, we propose a context mechanism that provides the model with a deeper context analysis capability, offering richer up-and-down information. The experiments on the NEU-DET dataset show that MSCA-YOLO achieves a mean Average Precision of 0.645 while maintaining rapid detection, especially at an Intersection over Union threshold of 0.5. It also exhibits substantial improvements in Precision compared to YOLOv5 across six defect types: Crazing (18.5% increase), Inclusion (1.2% increase), Patches (1.9% increase), Pitted_Surface (7.8% increase), Rolled-in_Scale (8.9% increase), and Scratches (6.5% increase). This achievement marks the efficiency and reliability of MSCA-YOLO in automated steel surface defect detection, providing a new solution for real-time inspection of steel surface defects.