
The field of computer vision is currently undergoing a pivotal transformation, shifting its focus from discriminative to generative tasks. Over the past two decades, the discipline was primarily defined by the discriminative imperative, which sought to enable machines to perceive, classify, and segment the visual world. However, catalyzed by the development of the Diffusion Transformer (DiT), the years 2024 and 2025 marked a Generative Turn, where the benchmark of artificial visual intelligence has evolved from mere classification to controllable simulation. The ability to generate high-fidelity, physically consistent video has led to the development of advanced generative models capable of representing underlying physical dynamics and environmental causality through large-scale data and computation. This survey provides a comprehensive analysis of the recent emergence of high-fidelity video generation. It traces the evolution from the era of feature engineering to the current Diffusion Transformers (DiTs) based generation era, summarizes the present state of video generation and the technical advancements driving this period, and offers a guide detailing the architectures, data selection, and training methodologies essential for high-fidelity video generation.

Over the past few decades, facial expression recognition (FER) has been widely deployed in real-world applications. However, the collection conditions of existing datasets vary substantially, leading to significant domain shifts among datasets. Consequently, the performance of the most advanced FER methods will deteriorate in cross-domain scenarios. To address this issue, we propose a Dual-Stream Feature Disentanglement Network (DFDNet) within the Single Domain Generalization (SDG) paradigm. DFD-Net employs the Expression Feature Extraction (EFE) module together with an attention block as the expression feature extraction branch, performing primary feature fusion and high-level feature selection. In parallel, the Expression-Irrelevant Feature Extraction (EIFE) module and the Expression-Irrelevant Feature Predictor (EIFP) constitute another branch. EIFE is pre-trained to capture the expression-irrelevant feature. EIFP passes the expression features through the Gradient Reversal Layer (GRL) and the Mutual Information Predictor (MIP) to compute and minimize the mutual information with the expression-irrelevant features. Extensive experiments on multiple benchmark datasets demonstrate that our method consistently outperforms existing state-of-the-art methods.

This paper surveys mobile agents, tracing their evolution from the early paradigm of autonomous, migrating code to the contemporary era of sophisticated agents driven by Large Multimodal Models (LMMs). We begin by establishing a taxonomy that distinguishes historical network-centric agent architectures from modern LMM-native, user-centric systems designed for mobile environments. We then analyze the operational workflows and architectural patterns that enable robust automation of complex tasks on Graphical User Interfaces (GUIs), with particular emphasis on the shift toward multi-agent frameworks, hierarchical control, and local-first execution. A critical review of representative state-of-the-art systems, including Mobile-Agent-v3.5, ClawMobile, Droidrun-appcard, and OpenClaw, is presented alongside an examination of benchmarks such as AndroidWorld that drive progress in the field. Furthermore, we discuss the transition toward edge-native multimodal models and address novel vulnerabilities unique to LMM-powered GUI automation before concluding with future research directions toward generalized mobile autonomy.

This paper presents an experimental multi-agent system developed for robust feature extraction from diverse multimedia documents, including images, PDFs, and technical drawings. Addressing the enterprise demand for structuring unstructured data, the system employs a flexible architecture that intelligently orchestrates specialized agents—ranging from (Optical Character Recognition) OCR and image processing to Large Language Models (LLMs)—to achieve high-fidelity extraction. A key innovation is the system's high configurability, which keeps human experts in the loop to refine extraction logic via prompt engineering. Furthermore, the architecture supports hybrid edge-cloud deployment, allowing raw documents to be processed locally to satisfy strict data sovereignty requirements, with only non-sensitive data ingested centrally. The experimental system has shown scalability and efficiency in real-world use cases.

Zero-shot learning (ZSL) aims to classify unseen classes using semantic information from seen classes. However, existing methods often struggle with visual variations within the same attribute, leading to noisy features. We propose CRAE (Class Representation and Attribute Embedding), a novel ZSL method that combines class representation learning and attribute embedding learning for improved robustness and accuracy. CRAE introduces an adaptive softmax activation to normalize attribute feature maps, reducing noise and enhancing discriminability. It also employs attribute-level contrastive learning with hard sample selection and class-level contrastive learning to improve classification performance. Experimental results on CUB, SUN, and AWA2 demonstrate that CRAE outperforms state-of-the-art methods, proving its superiority in zero-shot image classification.

Accurate prediction of drug target affinity (DTA) is critical for accelerating drug discovery, yet existing methods often struggle with topological diversity and insufficient feature extraction in molecular graphs. This paper proposes a novel framework, Topological Adaptive Weighted Drug Target Affinity Prediction (TAW-DTA), which integrates a Topological Adaptive Graph Convolutional Network (TAGCN) and a gated skip-connection mechanism to address these limitations. TAGCN dynamically adjusts convolution filters based on node topology, enabling robust feature extraction from drug molecular graphs and weighted protein contact maps. The gated skip-connection mechanism mitigates gradient vanishing and feature degradation in deep networks by selectively fusing multiscale features. Evaluations of benchmark data sets demonstrate state-of-the-art performance, with improvements in the concordance index (CI) and reduced prediction errors. Ablation studies confirm the efficacy of TAGCN and the skip-connection mechanism. This framework offers a scalable and interpretable solution for DTA prediction, with significant potential for practical drug development applications.

Breast cancer pathological images are considered the “gold standard” for clinical diagnosis of breast cancer, but manual diagnosis suffers from inherent drawbacks such as low efficiency and high subjectivity. Computer-aided diagnosis (CAD) systems can provide objective decision support for clinicians by deeply mining multi-level features such as tissue architecture and cytology from pathological images. However, current CAD systems are still challenged by complex background noise and inconsistency in cross-scale feature representation, which hinder the extraction of critical features. Therefore, this paper proposes a key feature dynamic enhancement network for breast cancer pathological image classification (KFDE), in which the channel-spatial feature enhancement module (CSFE) and the multi-scale feature dynamic fusion module (MFDF) serve as the two core components. The CSFE module effectively suppresses background noise and highlights lesion regions through local channel variance analysis and an energy entropy-driven spatial focusing mechanism. The MFDF module employs a heterogeneous multi-branch convolutional architecture to intelligently fuse cross-scale features, addressing the issue of information fragmentation caused by magnification variation. Experiments on the BreakHis dataset demonstrate that KFDE achieves significant performance improvements, with a benign/malignant classification accuracy of 99.74% and an eight-class subtype classification accuracy of 96.35%, significantly outperforming existing mainstream models.

Video-based gait analysis has become a promising approach for assessing motor impairment in children with cerebral palsy (CP). However, existing methods usually rely on either pose sequences or handcrafted gait features alone, making it difficult to simultaneously capture spatiotemporal motion patterns and clinically meaningful biomechanical information. To address this gap, we propose a multimodal fusion framework that integrates skeleton dynamics with contribution-guided clinically meaningful gait features. First, Grad-CAM analysis on a pre-trained ST-GCN backbone identified the most discriminative body keypoints, providing an interpretable basis for subsequent gait feature extraction. We then build a dual-branch architecture, with one branch modeling skeleton dynamics using ST-GCN and the other encoding gait features derived from the identified keypoints. Fusing the two branches through feature cross-attention improved four-level CP motor severity classification to 70.86%, outperforming the baseline by 5.6 percentage points. Overall, we demonstrate that integrating skeleton dynamics with clinically meaningful gait descriptors can improve both prediction performance and biomechanical interpretability for video-based CP severity assessment.

Mammography is one of the most commonly used tools for early screening of breast cancer. Developing computer-aided diagnosis (CAD) based on mammographic images to assist doctors in making efficient and accurate diagnoses holds significant research value. Mass segmentation in mammograms is a core component of breast cancer CAD systems and an essential step in further qualitative analysis of breast cancer. However, significant challenges persist in the field of mass segmentation in whole mammograms, including model misalignment due to the small proportion of mass regions and difficulties in segmenting boundaries caused by blurred edges of mass areas. To solve these challenges, this paper proposes a local attention and detail-enhanced network (LADE-Net) for mass segmentation in whole mammograms. LADE-Net employs an asymmetric encoder-decoder architecture and introduces a lightweight local attention (LA) module aimed at early and precise localization of breast mass regions. Importantly, we design a new detail-enhanced fusion residual network (DEFRB) to refine and enhance the learning of edge features in breast masses. We evaluated the performance of LADE-Net on two publicly available datasets (INbreast, CBIS-DDSM). Compared to previous works, LADE-Net achieved superior performance.