
This paper proposes a deep-learning-based method for detecting color defects in book covers, achieved by integrating an improved Residual Network-18 (ResNet-18) architecture with the squeeze-and-excitation (SE) module for feature optimization. Addressing the color quality monitoring needs in industrial scenarios, a three-stage optimization strategy is adopted: at the data level, diversified samples are generated by combining the Hue, Saturation, and Value color space perturbation with mixed data augmentation techniques, and the synthetic minority oversampling technique is applied to solve class imbalance; at the model level, the ResNet-18 fully connected layers are reconstructed, and the SE channel attention mechanism is embedded to enhance feature representation; at the training level, a binary cross-entropy loss function is designed alongside dynamic learning rate scheduling, and K-fold cross-validation is utilized to ensure model stability. Experimental results show that the proposed method achieves a detection accuracy of 99.82% on the test set (RMSE = 0.1490), with an image processing time of only 57.66 ms per image. Its classification performance, robustness, and computational efficiency significantly outperform traditional pixel analysis methods, support vector machines, and backpropagation neural networks, providing an efficient solution for intelligent printing quality detection.

Facial Expression Recognition (FER) models based on the Vision Transformer (ViT) have demonstrated promising performance on diverse datasets. However, the computational cost of the transformer encoder poses challenges in scenarios where strong computational resources are required. The utilization of large feature maps enhances expression information, but leads to a significant increase in token length. Consequently, the computational complexity grows quadratically with the length of the tokens as O(N2). Tasks involving large feature maps, such as high-resolution FER, encounter computational bottlenecks. To alleviate these challenges, we propose the Additively Comprised Class Attention Encoder as a substitute for the original ViT encoder, which reduces the complexity of the attention computation from O(N2) to O(N). Additionally, we introduce a novel token-level Squeeze-and-Excitation method to facilitate the model’s learning of more efficient representations. Experimental evaluations on the RAF-DB and FERplus datasets show that our approach can improve running speed by at least 27% (for 7 × 7 feature maps) while maintaining comparable accuracy, and it performs more efficiently on larger feature maps (about 49% speedup for 14 × 14 feature maps, and triple the speed for 28 × 28 feature maps).