Back to articles
Work Presented at Electronic Imaging 2025 FastTrack
Volume: 0 | Article ID: 040403
Image
ExtremeMETA: High-speed Lightweight Image Segmentation Model by Remodeling Multi-channel Metamaterial Imagers
Abstract
Abstract

Deep neural networks (DNNs) have heavily relied on traditional computational units, such as CPUs and GPUs. However, this conventional approach brings significant computational burden, latency issues, and high power consumption, limiting their effectiveness. This has sparked the need for lightweight networks such as ExtremeC3Net. Meanwhile, there have been notable advancements in optical computational units, particularly with metamaterials, offering the exciting prospect of energy-efficient neural networks operating at the speed of light. Yet, the digital design of metamaterial neural networks (MNNs) faces precision, noise, and bandwidth challenges, limiting their application to intuitive tasks and low-resolution images. In this study, we proposed a large kernel lightweight segmentation model, ExtremeMETA. Based on ExtremeC3Net, our proposed model, ExtremeMETA maximized the ability of the first convolution layer by exploring a larger convolution kernel and multiple processing paths. With the large kernel convolution model, we extended the optic neural network application boundary to the segmentation task. To further lighten the computation burden of the digital processing part, a set of model compression methods was applied to improve model efficiency in the inference stage. The experimental results on three publicly available datasets demonstrated that the optimized efficient design improved segmentation performance from 92.45 to 95.97 on mIoU while reducing computational FLOPs from 461.07 MMacs to 166.03 MMacs. The large kernel lightweight model ExtremeMETA showcased the hybrid design’s ability on complex tasks.

Subject Areas :
Views 0
Downloads 0
 articleview.views 0
 articleview.downloads 0
  Cite this article 

Quan Liu, Brandon T. Swartz, Ivan Kravchenko, Jason G. Valentine, Yuankai Huo, "ExtremeMETA: High-speed Lightweight Image Segmentation Model by Remodeling Multi-channel Metamaterial Imagersin Journal of Imaging Science and Technology,  2025,  pp 1 - 10,  https://doi.org/10.2352/J.ImagingSci.Technol.2025.69.4.040403

 Copy citation
  Copyright statement 
Copyright © Society for Imaging Science and Technology 2025
 Open access
  Article timeline 
  • received June 2024
  • accepted November 2024
jist
JIMTE6
Journal of Imaging Science and Technology
J. Imaging Sci. Technol.
J. Imaging Sci. Technol.
1062-3701
1943-3522
Society for Imaging Science and Technology
1.
Introduction
In the realm of modern computer vision, digital neural networks play a pivotal role. Arguably, convolutional neural network (CNN) stands out as the most extensively employed AI approach, particularly in tasks like image classification, segmentation, and detection. Traditional CNNs face several challenges when deployed in resource-constrained environments, such as those found in IoT devices, edge computing systems, and drone operations. These applications demand real-time performance with minimal power consumption, low latency, and efficient processing capabilities, which are difficult to achieve with standard CNN architectures due to their computational complexity and large memory requirements. As IoT and edge computing continue to expand in fields like smart cities, autonomous vehicles, and drone-based surveillance, it is critical to develop CNN models that operate effectively in these environments. Addressing these challenges not only enhances the scalability and adaptability of CNN-based solutions but also enables more efficient and reliable system operations in real-time applications. Despite the advent of vision transformer-based models, convolution remains integral for extracting local image features. Presently, CNNs are typically implemented on computational units like CPUs and GPUs. However, this conventional design approach brings forth substantial challenges, including a formidable computational load, notable latency issues, and heightened power consumption. These limitations are prominent in drone operations, Internet of Things (IoT), and edge computing applications, which emphasize the need for a lightweight model to analyze efficiently. Recognizing the critical need for DNN models with reduced energy consumption and lower latency, the AI community has embarked on a quest for more efficient solutions. Despite these efforts, achieving DNNs that are light with low power consumption in the current research trends is an elusive goal.
Recent breakthroughs in optical computational units, including metamaterials (refer to Figure 1), have brought to light the potential for neural networks that operate without energy consumption and at unprecedented speeds. The current cutting-edge metamaterial neural network (MNN) takes on a hybrid form, leveraging optical processors as a lightspeed and energy-free front-end convolutional operator alongside a digital feature aggregator. This novel approach significantly reduces computational latency. By assigning the convolution operations to optical units, more than 90 percent of the floating- point operations (FLOPs) inherent in conventional CNN backbones like VGG and ResNet are effectively off-loaded. This marks a noteworthy departure from traditional architectures, opening up new avenues for efficient and high-performance neural network designs. However, the hybrid design is fundamentally influenced by the physical structure including the limited kernel size and channel number. Moreover, the hybrid system is also limited by what can be fabricated as the first optical layer of the neural network.
Figure 1.
This study provides a hybrid pipeline for designing and optimizing a large kernel digital neural network. The proposed ExtremeMETA is efficient for segmentation tasks with less FLOPs in computation.
Based on our proposed LMNN (large kernel metamaterial neural network) model, the hybrid design achieved promising performance on the classification task. However, LMNN has a few limitations, namely: (1) this model can only perform image classification tasks instead of model complex tasks like image segmentation and object detection; (2) input images are in low resolution (28 × 28), and (3) leverages the computation burden to the optical part, the digital part requires efficiency improvement operation like model compression in the inference stage. While the LMNN reduces computational complexity by offloading much of the burden to the first layer using a metaoptic lens, it faces limitations in segmentation and object detection, where fine-grained spatial understanding is required and the loss of this capability may be due to the early emphasis on feature extraction in LMNN. In practical applications, such as autonomous driving or medical imaging, this limitation can affect the network’s ability to deliver accurate pixel-level segmentation or precise object localization.
In this study, we propose a novel large kernel lightweight segmentation model, ExtremeMETA, which maximizes the efficiency advantages of optic signal computation, while compressing the digital processing model to further improve the model segmentation efficiency. To adapt the segmentation task on large images, the proposed lightweight large kernel model achieves larger receptive fields, the ability to analyze larger images, and covers general vision tasks, image classification segmentation, and detection. Furthermore, the complexity of the model digital processing part is explicitly addressed via a set of model compression methods. We evaluated our design on image segmentation tasks using three public datasets: the portrait dataset, the Stanford dataset, and KITTI dataset. The proposed lightweight large kernel model achieved superior segmentation accuracy as compared with the state-of-the-art (SOTA) segmentation model. Overall, the system’s contributions are as follows:
We propose a new large convolution kernel CNN network to achieve a large reception field, lower energy consumption, and less latency.
We introduce model reparameterization to improve large convolution kernel performance and sparse convolution kernel compression mechanism to compress the multi-branch sparse-convolution design to a single layer for the hybrid system implementation. The model compression mechanism improves the model efficiency for digital processing.
The task limitations of large convolution hybrid models are explicitly addressed via performing segmentation tasks on multiple datasets from different categories.
The rest of the article is organized as follows. In Section 2, we present the background and related research relevant to large kernel convolution, model compression, and ONNs on image processing tasks. In Section 3, our proposed lightweight lightspeed model is presented. It includes the large kernel reparameterization, sparse convolution compression, and multipath model compression. Section 4 details the dataset and experiment implementation details. Section 5 analyzes the experimental results and ablation study. Then, in Sections 6 and 7, we provide the discussion and conclude our work.
2.
Related Work
2.1
Large Kernel Convolution Design
In the realm of CNNs, the design and utilization of large kernel convolutions have garnered significant attention in recent years. Numerous studies have explored the benefits of using larger convolutional kernels, such as 7 × 7 or 11 × 11, to capture broader spatial contexts and more intricate patterns within images [1, 2]. Early research efforts focused on understanding the impact of kernel size on model performance, with findings suggesting that larger kernels can lead to improved feature extraction and recognition accuracy, especially for complex visual tasks [3].
Building on these findings, subsequent studies proposed various strategies to incorporate large kernel convolutions into CNN architectures effectively. These strategies often involved modifying network architectures, adjusting kernel sizes, or integrating multi-scale features to enhance the robustness and versatility of CNN models [4, 5]. Additionally, advancements in hardware acceleration and parallel processing have facilitated the efficient implementation of large kernel convolutions, enabling their widespread adoption across diverse computer vision applications [6, 7].
Overall, the related work on large kernel convolution design underscores its pivotal role in advancing the capabilities of CNNs for tackling increasingly complex and demanding visual recognition tasks [8, 9].
2.2
Optic Neural Network
Optic neural networks (ONNs) have emerged as a promising paradigm for accelerating neural network computations by leveraging the unique properties of optical computing. Inspired by the principles of light-based signal processing, ONNs exploit the parallelism, high bandwidth, and low energy consumption inherent in optical systems to achieve significant computational efficiency gains compared to traditional electronic implementations. A considerable body of research has focused on exploring various aspects of ONNs, including optical device design, system architectures, and algorithmic frameworks tailored to optical computing platforms [1012].
Early studies laid the groundwork for ONNs by demonstrating their potential for accelerating matrix-vector multiplications, a fundamental operation in neural network inference [13, 14]. Subsequent works have extended ONN capabilities to encompass more complex neural network layers and architectures, paving the way for practical applications in tasks such as image classification, object detection, and natural language processing [15, 16].
Key challenges in ONN research include addressing optical noise, device nonlinearity, and scalability issues, which require interdisciplinary efforts spanning optics, photonics, and machine learning [17, 18]. Despite these challenges, ONNs hold great promise for enabling ultra-fast and energy-efficient neural network computations, with the potential to revolutionize various domains of artificial intelligence and computing [19, 20].
2.3
Segmentation Model
Recent advancements in segmentation techniques have introduced novel methods that improve accuracy and robustness in challenging tasks. For instance, the use of a topological loss function based on persistent homology has shown promise in improving the structural integrity of segmentation outputs, particularly in applications where shape preservation is critical [21]. Additionally, the boundary-enhanced dual-stream network has demonstrated significant improvements in semantic segmentation, particularly in high-resolution remote sensing images where fine boundary details are crucial [22]. These models offer innovative solutions for specific segmentation challenges, complementing the growing body of research on improving segmentation accuracy. Our proposed model, ExtremeMETA builds on this foundation by providing a model that is both computationally efficient and highly accurate, making it suitable for a wide range of applications, from general-purpose segmentation to more domain-specific tasks.
2.4
Convolution Neural Network Model Compression
In the field of CNNs, model compression techniques have garnered significant attention as a means to reduce the computational complexity and memory footprint of deep learning models without sacrificing performance. A diverse range of methods has been proposed to compress CNNs, including pruning, quantization, low-rank approximation, knowledge distillation, and weight sharing. Pruning techniques aim to remove redundant or less important parameters from the network, thereby reducing its size and computational cost [23, 24]. Quantization methods reduce the precision of network parameters, often by representing weights and activations with fewer bits, to decrease memory requirements and improve inference speed [25]. Low-rank approximation techniques exploit the underlying structure of weight matrices to factorize them into smaller, more computationally efficient components [26]. Knowledge distillation involves training a compact “student” network to mimic the predictions of a larger “teacher” network, transferring knowledge from the latter to the former [27]. Additionally, weight sharing approaches reduce redundancy by sharing parameters across different parts of the network [28].
Collectively, these model compression techniques offer effective strategies for deploying CNNs on resource-constrained devices or accelerating inference in large-scale deployment scenarios. Ongoing research in this area continues to explore novel compression algorithms, optimization strategies, and application-specific considerations to further improve the efficiency and effectiveness of compressed CNN models.
2.5
Model Efficiency Improvement
Recent studies have focused on improving the efficiency and performance of models in various signal processing and communication-related tasks, which closely align with the objectives of our work. For example, [29] introduced a manifold regularization-based deep convolutional autoencoder for unauthorized broadcasting identification, addressing a critical challenge in signal security and classification. Additionally, [30] and the multi-scale radio transformer method have advanced the field of lightweight automatic modulation classification, particularly in resource-constrained environments like drone communication systems. Similarly, [31] demonstrated how lightweight networks can achieve real-time classification of wireless communication signals, making them highly applicable for low- power devices. Furthermore, CNN-LSTM-driven methods have been proposed for real-time transformer discharge pattern recognition, showcasing the potential of combining CNNs with temporal models in complex pattern recognition tasks.
3.
Method
3.1
Problem Statement
We extensively study the trainability of large kernels on MNNs and unveil three main observations: (i) traditional convolution kernel shows limited improvement on large images; (ii) the MNN is only available on classification task; (iii) metamaterial implementation limited the computation ratio on segmentation model which is typical in a complex structure. Model is shown in Figure 2.
Figure 2.
Lightweight segmentation model with hybrid metaoptics design. The model has two parts: CoarseNet and FineNet. The large kernel block is composed of depthwise convolution layers.
3.2
Large Convolution Design with Multiple Path Design
Limited by the image size and the task for the model, our previous proposed model, LMNN achieved the prediction performance with kernel size 9 × 9. Two major limitations exist when applying the large kernel design to the MNN: (1) the metamaterial implementation limits the image size to a small range; (2) only the classification task is available to be validated on the MNN model when the segmentation task and detection task are too difficult to be implemented under the optic implementation limitation. To address the challenges, we proposed our model from two perspectives: (1) from kernel design, we employ the large convolution kernel with parameterization design to construct the convolution layer (larger than 9 × 9); (2) from model design, our proposed lightweight segmentation model based on the multipath model structure composed of a course segmentation path and a light refinement path proposed by [32].
Figure 3.
Model compression on segmentation model digital processing part. The left panel shows the multipath structure of the advanced C3 block. The right panel shows the compression mechanism.
3.3
Model Compression with Sparse Convolution
Model compression is a crucial technique aimed at enhancing the efficiency of deep learning models by reducing their size and computational demands while maintaining their performance standards. Among various strategies employed for model compression, pruning, and quantization stand out as widely adopted methodologies. Pruning, a prominent model compression technique, involves the systematic removal of redundant or unnecessary parameters from neural networks. By identifying and eliminating connections that contribute minimally to the model’s performance, pruning effectively reduces the model’s size and computational requirements. This process permits a more streamlined network architecture without sacrificing accuracy, making it particularly valuable for resource-constrained environments or deployment on edge devices.
We applied model compression and parameterization together for the sparse convolution kernel which is shown in Figure 3. Sparse convolution refers to a convolution operation where the kernel (filter) contains mostly zero values, resulting in a sparse structure. When using a kernel size of 1 × 3 (1 row and 3 columns), the convolution operation typically involves sliding this kernel over the input data and performing element-wise multiplication followed by summation along the spatial dimensions.
(1)
Oh,w,c=i=02j=0C1Ih,w+i,j×K0,i,j,c,
where, I is the input tensor, K is the kernel tensor, O is the output tensor, and × is the convolution operation.
For the ExtremeC3 block, we have three convolution paths with kernel size k × k, 1 × k, and k × 1. Denoting the individual kernels as kk, kk×k, and kk×1; the compressed convolution kernel is expressed as follows:
(2)
Kcombined(i,j)=w1xk×K1xk(i,j)+wkxk×Kkxk(i,j)+wkx1×Kkx1(i,j).
The compressed multipath convolution block saves computation complexity in the inference stage.
The use of sparse convolution compression in ExtremeMETA significantly improves efficiency by reducing the number of unnecessary computations, particularly in non-critical areas of the network. This technique compresses the model by introducing sparsity convolution and multipath in the convolutional layers, which leads to lower memory usage and faster inference times. In practical deployment, especially in resource-constrained environments such as edge devices or IoT systems, this results in reduced computational load, lower power consumption, and faster real-time performance without compromising model accuracy.
4.
Data and Experimental Design
4.1
Data Description
Three public datasets, EG1800 [33], Stanford Car dataset [34], and KITTI dataset [35], were used to evaluate the lightweight large kernel model on segmentation tasks. For the EG1800 dataset, we employed 1887 images in 600 × 800 resolution with semantic segmentation masks. The EG1800 dataset was collected from Flickr with the manually annotated mask of the portrait. The Stanford Car dataset is composed of 16,185 RGB images of cars with the point coordinate of the car’s location in the images. The KITTI dataset is popular in mobile robotics and autonomous driving and features diverse traffic scenarios captured using high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. However, it lacks inherent ground truth annotations for semantic segmentation. To adapt to the segmentation task, both the Stanford Car dataset and the KITTI dataset need to address the annotation limitation.
4.2
Data Generation with Foundation Model
Regarding the lack of segmentation annotation in Stanford Car and KITTI datasets, we employed the Segment Anything Model (SAM) [36] to generate the object mask based on the prompts of object location. SAM is a foundation model that has a zero-shot ability to segment objects on new image distributions. The RGB image of Stanford Car and KITTI datasets and bounding box coordinate is provided to SAM, which generates the object masks. With the help of the SAM, the RGB images with object mask annotations are available for model training.
4.3
Large Kernel Digital Design on Segmentation Model
The large kernel design was applied to the segmentation network’s first convolution layer design. Since the first layer was designed to be substituted by the metaoptic lens in the inference stage, our large kernel design was under physical limitation. On the other hand, the optic lens provided lightspeed computation which we took advantage of. Based on the multipath segmentation network, the first convolution layers of the CoarseNet and FineNet parts were redesigned with the large convolution kernel with parameterization following the strategy in our previous work LMNN [37]. Since the image was large compared with FashionMNIST previously used, our kernel size increased from 9 × 9 to 15 × 15. The channel number was expanded from 12 to 48. The Larger convolution kernel and channel number provided the capability of the first layers and handled the complex situation.
4.4
Model Design with Optic Constrain
Constrained by fabrication issues, the metaoptic layer has limitations on both channel number and input size. The trade-off in model performance between input size and channel number is discussed. The size-first design uses the largest input image size under fabrication constraint. Channel-first design prefers more channel numbers under the fabrication limitation.
4.5
Model Compression Efficiency
Besides enlarging the capability of the first layer, our proposed lightweight segmentation network is compressed in the digital part. Since compression affects the model’s complexity and efficiency, we evaluated if the compressed model loses accuracy. To test the efficiency of the model compression strategy, the model FLOPs, parameters, and FLOPs ratio of the first convolution layer.
5.
Result
In this section, we evaluate our proposed lightweight segmentation network with a simple model structure, using the EG1800 dataset, Stanford Car dataset, and KITTI dataset. Since the Stanford Car dataset and KITTI dataset are car images, we train the model and test the two datasets together.
5.1
Segmentation Performance on Portrait Dataset
We evaluate the lightweight segmentation model on EG1800 dataset together with model parameters and first convolution FLOPs ratio. As shown in Table I, the original ExtremeC3 model cannot take advantage of the large convolution kernel on the first layer, 15 × 15 kernel showed even lower performance than 11 × 11. The model performance without the first convolution layer showed a 2% drop compared with the ExtremeC3 model with 3 × 3 kernel size. Our proposed hybrid lightweight segmentation model achieved the best performance with 15 × 15 convolution kernel which had the same digital computation FLOPs.
Table I.
Segmentation performance on EG1800.
ModelKernel1st ConvModelDigitalTest
sizeFLOPs (%)FLOPsFLOPs(mIoU)
ExtremeC33 × 310.87199.4199.40.9249
11 × 1162.11469.14469.140.9323
15 × 1575.30719.62719.620.9301
DigitalN/AN/A174.10174.100.9086
Ours1 × 12.80182.06174.100.9137
3 × 310.87199.40174.100.9234
11 × 1159.68431.81174.100.9415
15 × 1563.36475.16174.100.9418
Model FLOPs and digital FLOPs unit is MMacs.
Besides improving the model performance with advanced design on the first convolution layer, we evaluate the model efficiency improvement by model compression. Following the experiment setting in Table I, we applied model compression, including sparse convolution kernel compression and multipath parameterization, to each model design and show the efficiency evaluation matrix in Table II. The compression method showed efficient computation on digital FLOPs without affecting model performance (mIoU).
Table II.
Segmentation performance on EG1800 after model compression.
ModelKernel1st ConvModelDigitalTest
sizeFLOPs (%)FLOPsFLOPs(mIoU)
ExtremeC33 × 311.33191.32191.320.9233
11 × 1163.21461.07461.070.9315
15 × 1576.16711.55711.550.9289
DigitalN/AN/A166.03166.030.9031
Ours1 × 13.17174.25166.030.9121
3 × 311.33191.32166.030.9217
11 × 1160.81423.74166.030.9404
15 × 1564.45467.09166.030.9420
Model FLOPs and digital FLOPs unit is MMacs.
In comparison to traditional CNN architectures, ExtremeMETA achieved lower computational complexity by employing sparse convolution compression and metaoptic lens techniques, which reduced redundant operations in early layers. This resulted in faster processing and reduced memory requirements. However, like most efficient models, MobileNets [38] and EfficientNet [7], there is a trade-off between computational efficiency and segmentation accuracy. In practical applications, such as real-time image segmentation on edge devices, ExtremeMETA demonstrated improved processing speed and reduced power consumption while maintaining competitive segmentation accuracy. The trade-off is most noticeable in tasks requiring extremely fine-grained segmentation, where traditional CNNs may offer marginally better accuracy at the cost of significantly higher computational demands.
5.2
Segmentation Performance on Car Dataset
To validate our lightweight segmentation model with more datasets, we conducted experiments on the car dataset, including the Stanford Car dataset and KITTI dataset with semantic segmentation mask as ground truth. Both the Stanford Car dataset and the KITTI datasets were used for model training, even though the resolutions were different. Using the same experimental setup described in Table III, we applied model compression and multipath parameterization to model design, and present the resulting efficiency evaluation matrix in Table IV.
Table III.
Segmentation performance on car dataset.
ModelKernelTrainTestKITTIStanford
size(KITTI+Stanford)(mIoU)
ExtremeC33*395.0292.5184.4595.23
11*1195.1292.0984.3795.39
15*1576.0970.2522.6995.22
DigitalN/A93.3189.1178.4794.27
Ours1*194.1390.9482.6893.15
3*394.9792.0185.0594.77
11*1195.7992.9185.3395.97
15*1596.0593.1787.4195.19
Model FLOPs and digital FLOPs unit is MMacs.
Table IV.
Segmentation performance on car dataset after model compression.
ModelKernel1st Conv (%)ModelDigitalTest
sizeFLOPs (%)FLOPsFLOPs(mIoU)
ExtremeC33*311.33191.32191.3291.36
11*1163.21461.07461.0792.45
15*1576.16711.55711.5570.01
DigitalN/AN/A166.03166.0388.97
Ours1*13.17174.25166.0390.94
3*311.33191.32166.0394.25
11*1160.81423.74166.0395.32
15*1564.45467.09166.0393.05
Model FLOPs and digital FLOPs unit is MMacs.
5.3
Model Robustness
To evaluate the generalization ability of ExtremeMETA, we conducted experiments on datasets beyond those used for training, specifically the Portrait and Pet datasets. As shown in Table V, ExtremeMETA achieved an mIoU of 91.8439 on the Portrait dataset and 73.8717 on the Pet dataset, significantly outperforming YOLO on both. These results demonstrate that ExtremeMETA generalizes well across different types of images, even in tasks that involve varying levels of complexity, such as fine-grained segmentation in the Pet dataset. The model’s architecture, including sparse convolution compression and metaoptic lens techniques, allows it to adapt to different domains with minimal loss in performance, making it a versatile solution for various practical applications.
Table V.
Comparison of model performance on Portrait and Pet datasets.
ModelPortrait (mIoU)Pet (mIoU)
YOLO92.6770.48
Ours91.8473.87
Figure 4.
Segmentation results on the Portrait and Pet datasets. The first column shows the original images, the second column presents the segmentation results from ExtremeMETA, and the third column displays the ground truth. The results show that ExtremeMETA effectively captured the boundaries and shapes of objects with high accuracy.
To provide a visual comparison of the segmentation results, we present qualitative examples from the Portrait and Pet datasets in Figure 4. The figure shows the original images, the segmentation outputs generated by ExtremeMETA, and the corresponding ground truth. As demonstrated, ExtremeMETA accurately segmented both human portraits and animal shapes, closely matching the ground truth in each case. These visual results further validate the effectiveness of ExtremeMETA in diverse segmentation tasks, showing that it can generalize well across different image types and maintain high segmentation accuracy.
5.4
Ablation Studies
Due to the fabrication limitation of the metalens array, the priority of channel number and input image size need to be fixed. The results of the experiment are shown in Figure 5. The left panel illustrates how increasing the input image size enhances performance compared to expanding the number of channels in a convolution layer. The gray area depicts the performance disparity expressed as mIoU. Increasing the input image size enhances the model’s ability to capture finer spatial details, which improves performance in tasks like semantic segmentation. However, it also increases computational cost. Expanding the number of channels, while boosting the model’s capacity to extract complex features, raises the risk of overfitting and computational load. The gray area in the left panel highlights that, in this case, increasing the input size led to a greater improvement in mIoU than expanding the number of channels, indicating that capturing spatial details was more impactful for performance. On the right panel, the effectiveness of utilizing large convolution kernels is shown. Circles of various colors represent different convolution layer architectures, with the area of each circle indicating the ratio of FLOPs for the layer when implemented using metaoptic materials. The x-axis represents the model’s FLOPs, excluding the layer intended for fabrication.
Figure 5.
Model ablation study. Left panel: trade-off between input image size and channel number of convolution layer. Right panel: model efficiency visualization comparing model FLOPs and mIoU.
5.5
Model Compression
Figure 6 demonstrates that the compressed model achieves a reduction of 8 MMacs in FLOPs, decreasing from 174.10 MMacs to 166.03 MMacs. The right panel indicates that the compressed model maintains equivalent performance to the original model. This consistency in performance establishes that ExtremeMETA not only enhances the efficiency of the digital components but also contributes to the overall optimization of the hybrid system.
Figure 6.
Model compression performance. Left panel: origin model, ExtremeMETA, and compressed model parameters comparison; right panel: model performance after compression.
6.
Discussion
Given the demonstrated superior performance of large convolution kernels in tasks such as image classification and segmentation, there exists substantial potential for their application in a wider array of complex computer vision tasks. Large convolution kernels have shown remarkable effectiveness in tasks like image classification and segmentation, primarily due to their ability to capture more extensive spatial information and intricate patterns within images. This success suggests that employing large convolution kernels in other computer vision tasks could yield significant improvements.
One such task is object detection, where accurately identifying and localizing objects within images is crucial. By utilizing large convolution kernels, the model can better discern the detailed features of objects, leading to more precise detection results. This can be particularly beneficial in scenarios with small or occluded objects, where finer details are essential for accurate recognition as the results shown in the experiments on the car dataset.
Furthermore, in tasks involving image generation or synthesis, such as style transfer or super- resolution, large convolution kernels can enhance the model’s ability to capture intricate textures and details, resulting in more realistic and high-fidelity output images. These kernels can effectively extract and preserve fine-grained features, which are instrumental in faithfully replicating the characteristics of the input images.
The application can be extended to video processing tasks such as action recognition or video segmentation, where large convolution kernels can enhance the model’s capability to analyze temporal and spatial dependencies across frames. By incorporating information from a broader context, these kernels enable a more robust understanding of dynamic scenes, leading to improved performance in tasks requiring temporal coherence and contextual understanding.
The adoption of large convolution kernels holds promise for advancing various complex computer vision tasks beyond traditional image classification and segmentation. Their ability to capture intricate details and spatial relationships makes them a valuable tool for enhancing the performance and capabilities of computer vision models across diverse applications.
7.
Conclusion
In this study, we presented a novel large kernel lightweight segmentation model that harnesses the efficiency advantages of optical signal computation while integrating digital processing model compression techniques to further enhance segmentation efficiency. Our model offers larger receptive fields tailored for segmentation tasks on large images, extending its applicability to various vision tasks including image classification, segmentation, and detection. Through extensive evaluations on diverse datasets, including the portrait, Stanford, and KITTI datasets, our proposed approach has demonstrated superior segmentation accuracy compared to state-of-the-art models. Our contributions encompass the introduction of a novel large convolution kernel CNN network for larger reception fields, reduced energy consumption, and lower latency, alongside the introduction of model reparameterization and sparse convolution kernel compression mechanisms to enhance model performance and efficiency in digital processing. By explicitly addressing task limitations and conducting segmentation tasks on multiple datasets from different categories, our work represents a significant step forward in the development of efficient and effective segmentation models for a wide range of computer vision applications.
Summary of Contributions: Our work offers key advancements in computer vision by addressing the computational and practical challenges in deploying CNN-based models in real-world scenarios. We introduced an architecture that not only improves segmentation accuracy but also reduces computational complexity, making it highly suitable for resource-constrained environments such as IoT devices and edge computing. The proposed model compression techniques further contribute to lower energy consumption and faster processing times, highlighting the potential for widespread adoption across various industries, from autonomous systems to medical imaging. Our findings push the boundaries of segmentation model efficiency and performance, paving the way for future innovations in various fields.
Acknowledgment
Y.H. and Q.L. acknowledge support from NIH under contract R01DK135597. Y.H. is the corresponding author. B.T.S. and J.G.V. acknowledge support from DARPA under contract HR001118C0015, NAVAIR under contract N6893622C0030 and ONR under contract N000142112468. Metaoptic devices were manufactured as part of a user project at the Center for Nanophase Materials Sciences (CNMS), which is a US Department of Energy, Office of Science User Facility, Oak Ridge National Laboratory.
References
1SimonyanK.ZissermanA.“Very deep convolutional networks for large-scale image recognition,” Preprint, arXiv:1409.1556 (2014)
2SzegedyC.LiuW.JiaY.SermanetP.ReedS.AnguelovD.ErhanD.VanhouckeV.RabinovichA.Going deeper with convolutionsProc. IEEE Conf. on Computer Vision and Pattern Recognition2015IEEEPiscataway, NJ191–910.1109/CVPR.2015.7298594
3ZeilerM. D.FergusR.“Visualizing and understanding convolutional networks,” Preprint, arXiv:1311.2901 (2014)
4SzegedyC.IoffeS.VanhouckeV.Rethinking the inception architecture for computer visionProc. IEEE Conf. on Computer Vision and Pattern Recognition2016IEEEPiscataway, NJ281828262818–2610.1109/CVPR.2016.308
5HeK.ZhangX.RenS.SunJ.Deep residual learning for image recognitionProc. IEEE Conf. on Computer Vision and Pattern Recognition2016IEEEPiscataway, NJ770778770–810.1109/CVPR.2016.90
6ZhangX.ZhouX.LinM.SunJ.Efficient and accurate approximations of nonlinear convolutional networksProc. IEEE Conf. Computer Vision and Pattern Recognition2018IEEEPiscataway, NJ198419921984–9210.1109/CVPR.2015.7298809
7SunM.LiuZ.WangX.QiaoW.LinK.Efficientnet: Rethinking model scaling for convolutional neural networksInt’l. Con. on Machine Learning2019PMLRStockholm, Sweden610561146105–14
8LinM.ChenQ.YanS.“Network in network,” Preprint, arXiv:1312.4400 (2013)
9HuangG.LiuZ.van der MaatenL.WeinbergerK. Q.Densely connected convolutional networksProc. IEEE Conf. on Computer Vision and Pattern Recognition2017IEEEPiscataway, NJ470047084700–8
10ShenY.HarrisN. C.SkirloS.PrabhuM.Baehr-JonesT.HochbergM.SunX.ZhaoS.LarochelleH.EnglundD.2017Deep learning with coherent nanophotonic circuitsNature Photonics11441446441–610.1038/nphoton.2017.93
11LinX.RivensonY.TengD.WeiL.GünaydınH.ZhangY.OzcanA.2018All-optical machine learning using diffractive deep neural networksScience361100410081004–810.1126/science.aat8084
12HughesT. W.MinkovM.WilliamsonI. A.ShiY.FanS.2018Training of photonic neural networks through in situ backpropagation and gradient measurement: supplementary materialOpticaPart F127
13TaitA. N.NahmiasM. A.ShastriB. J.PrucnalP. R.HarrisJ. S.2017The physics of optical neural networksAppl. Phys. Rev.4021105
14TaitA. N.NahmiasM. A.ShastriB. J.PrucnalP. R.HarrisJ. S.2016Optical implementation of deep networksAppl. Optics55A71A82A71–8210.1364/AO.55.000A71
15MiscuglioM.DambreJ.BienstmanP.2018All-optical nonlinear activation function for photonic neural networks [invited]Opt. Mater. Express8385138633851–6310.1364/OME.8.003851
16LargerL.SorianoM. C.BrunnerD.AppeltantL.GutiérrezJ. M.FischerI.MirassoC. R.2012Photonic information processing beyond turing: an optoelectronic implementation of reservoir computingOpt. Express20324132493241–910.1364/OE.20.003241
17JutamuliaS.YuF. T. S.1996Overview of hybrid optical neural networksOpt. Laser Technol.28859785–9710.1016/0030-3992(95)00070-4
18BoehmK. M.KhosraviP.VanguriR.GaoJ.ShahS. P.2022Harnessing multimodal data integration to advance precision oncologyNature Rev. Cancer22718871–8810.1038/s41568-021-00408-3
19ZhugeM.GaoD.FanD.-P.JinL.ChenB.ZhouH.QiuM.ShaoL.Kaleido-bert: Vision-language pre-training on fashion domainProc. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition2021IEEEPiscataway, NJ126421265212642–5210.1109/CVPR46437.2021.01246
20OvchinnikovY. B.MüllerJ.DoeryM.VredenbregtE.HelmersonK.RolstonS.PhillipsW.1999Diffraction of a released bose-einstein condensate by a pulsed standing light wavePhys. Rev. Lett.8328410.1103/PhysRevLett.83.284
21CloughJ. R.ByrneN.OksuzI.ZimmerV. A.SchnabelJ. A.KingA. P.2020A topological loss function for deep-learning based image segmentation using persistent homologyIEEE Trans. Pattern Anal. Mach. Intell.44876687788766–7810.1109/TPAMI.2020.3013679
22LiX.XieL.WangC.MiaoJ.ShenH.ZhangL.2024Boundary-enhanced dual-stream network for semantic segmentation of high-resolution remote sensing imagesGIScience Remote Sens.61235635510.1080/15481603.2024.2356355
23HanS.PoolJ.TranJ.DallyW.Learning both weights and connections for efficient neural networkAdvances in Neural Information Processing Systems2015Curran Associates, Inc.Montréal, Canada113511431135–4310.5555/2969239.2969366
24MolchanovP.TyreeS.KarrasT.AilaT.KautzJ.“Pruning convolutional neural networks for resource efficient inference,” Int’l. Conf. on Learning Representations (OpenReview, Toulon, France, 2016)
25HubaraI.CourbariauxM.SoudryD.El-YanivR.BengioY.“Quantized neural networks: Training neural networks with low precision weights and activations,” Preprint, arXiv:1609.07061 (2017)
26DentonE. L.ZarembaW.BrunaJ.LecunY.FergusR.“Exploiting linear structure within convolutional networks for efficient evaluation,” Advances in Neural Information Processing Systems (Curran Associates, Inc., Montréal, Canada, 2014), pp. 1269–1277
27HintonG.VinyalsO.DeanJ.“Distilling the knowledge in a neural network,” Preprint, arXiv:1503.02531 (2015)
28ChenW.WilsonJ.TyreeS.WeinbergerK.ChenY.Compressing neural networks with the hashing trickInt’l. Conf. on Machine Learning2015JMLRLille, France228522942285–9410.5555/3045118.3045361
29ZhengQ.ZhaoP.ZhangD.WangH.2021MR-DCAE: Manifold regularization-based deep convolutional autoencoder for unauthorized broadcasting identificationInt. J. Intell. Syst.36720472387204–3810.1002/int.22586
30ZhengQ.TianX.YuZ.DingY.ElhanashiA.SaponaraS.KpalmaK.2023Mobilerat: A lightweight radio transformer method for automatic modulation classification in drone communication systemsDrones759610.3390/drones7100596
31MehtaS.RastegariM.“Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer,” Preprint, arXiv:2110.02178 (2021)
32ParkH.SjösundL. L.YooY.BangJ.KwakN.“Extremec3net: Extreme lightweight portrait segmentation networks using advanced c3-modules,” Preprint, arXiv:1908.03093 (2019)
33ShenX.HertzmannA.JiaJ.ParisS.PriceB.ShechtmanE.SachsI.Automatic portrait segmentation for image stylizationComputer Graphics Forum2016Vol. 35Wiley Online LibraryHoboken, NJ9310293–102
34KrauseJ.DengJ.StarkM.Fei-FeiL.Collecting a large-scale dataset of fine-grained carsProc. 1st IEEE Workshop on Fine-Grained Visual Classification (FGVC) in Conjunction with the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)2013IEEEPiscataway, NJ
35GeigerA.LenzP.UrtasunR.Are we ready for autonomous driving? The KITTI vision benchmark suiteConf. on Computer Vision and Pattern Recognition (CVPR)2012IEEEPiscataway, NJ335433613354–6110.1109/CVPR.2012.6248074
36KirillovA.MintunE.RaviN.MaoH.RollandC.GustafsonL.XiaoT.WhiteheadS.BergA. C.LoW.-Y.DollarP.GirshickR.Segment anythingProc. IEEE/CVF Int’l. Conf. on Computer Vision2023IEEEPiscataway, NJ401540264015–26
37LiuQ.ZhengH.SwartzB. T.AsadZ.KravchenkoI.ValentineJ. G.HuoY.2023Digital modeling on large kernel metamaterial neural networkJ. Imaging Sci. Technol.6710.2352/J.ImagingSci.Technol.2023.67.6.060404
38HowardA. G.“Mobilenets: Efficient convolutional neural networks for mobile vision applications,” Preprint, arXiv:1704.04861 (2017)