
In recent years, multi-modal knowledge graphs (MMKGs) have emerged to enhance the representation of real-world entities through structural, textual, and visual features. However, the inherent heterogeneity among different modalities poses significant challenges for entity alignment across KGs. In this study, we introduce GMFDE, an innovative framework for multi-modal entity alignment that integrates a gated residual fusion mechanism with a knowledge distillation strategy. The fusion module adaptively balances and refines modality-specific features while the distillation component enables unimodal encoders to learn complementary information from the fused multi-modal representation, promoting consistency across modalities. Extensive experiments on both bilingual and cross-KG datasets demonstrate that GMFDE achieves superior performance compared with existing leading methods, particularly excelling in settings with limited alignment seeds.

Object detection in varying traffic scenes presents significant challenges in real-world applications. Thermal image utilization is acknowledged as a beneficial approach to enhance RGB image detection, especially in suboptimal lighting conditions. However, harnessing the combined potential of RGB and thermal images remains a formidable task. We tackle this by implementing an illumination-guided adaptive information fusion technique across both data types. Thus, we propose the illumination-guided with crossmodal attention transformer fusion (ICATF), a novel object detection framework that skillfully integrates features from RGB and thermal data. Further, an illumination-guided module is developed to adapt features to current lighting conditions, steering the learning process towards the most informative data fusion. Then, we incorporate frequency domain convolutions within the network’s backbone to assimilate spectral context and derive more nuanced features. In addition, we fuse the differential modality features for multispectral pedestrian detection with illumination-guided feature weights and transformer fusion architecture. Our method achieves state-of-the-art by experimental results on multispectral detection datasets, including FLIR-aligned, LLVIP, and KAIST.

Object detection and video single-frame detection have seen substantial advancements in recent years, particularly with deep-learning-based approaches demonstrating strong performance. However, these detectors often struggle in practical scenarios such as the analysis of video frames captured by unmanned aerial vehicles. The existing detectors usually do not perform well, especially for some objects with small area, large scale variation, dense distribution, and motion blur. To address these challenges, we propose a new feature extraction network: Attention-based Weighted Fusion Network. Our proposed method incorporates the Self-Attention Residual Block to enhance feature extraction capabilities. To accurately locate and identify objects of interest, we introduce the Mixed Attention Module, which significantly enhances object detection accuracy. Additionally, we incorporate adaptive learnable weights for each feature map to emphasize contributions from feature maps with varying resolutions during feature fusion. The performance of our method is evaluated on two datasets: PASCAL VOC and VisDrone2019. Experimental results demonstrate that our proposed method is superior to the baseline and other detectors. Our method achieves 87.1% mean average precision on the Pascal VOC 2007 test set and surpasses the baseline by 3.1% AP50. In addition, our method also exhibits lower false detection rate and missed detection rate compared with other detectors.