Virtual background has become an increasingly important feature of online video conferencing due to the popularity of remote work in recent years. To enable virtual background, a segmentation mask of the participant needs to be extracted from the real-time video input. Most previous works have focused on image based methods for portrait segmentation. However, portrait video segmentation poses additional challenges due to complicated background, body motion, and inter-frame consistency. In this paper, we utilize temporal guidance to improve video segmentation, and propose several methods to address these challenges including prior mask, optical flow, and visual memory. We leverage an existing portrait segmentation model PortraitNet to incorporate our temporal guided methods. Experimental results show that our methods can achieve improved segmentation performance on portrait videos with minimum latency.
Deep neural networks have been utilized in an increasing number of computer vision tasks, demonstrating superior performance. Much research has been focused on making deep networks more suitable for efficient hardware implementation, for low-power and low-latency real-time applications. In [1], Isikdogan et al. introduced a deep neural network design that provides an effective trade-off between flexibility and hardware efficiency. The proposed solution consists of fixed-topology hardware blocks, with partially frozen/partially trainable weights, that can be configured into a full network. Initial results in a few computer vision tasks were presented in [1]. In this paper, we further evaluate this network design by applying it to several additional computer vision use cases and comparing it to other hardware-friendly networks. The experimental results presented here show that the proposed semi-fixed semi-frozen design achieves competitive performanc on a variety of benchmarks, while maintaining very high hardware efficiency.
Recently, the semantic inference from images is widely used for various applications, such as augmented reality, autonomous robots, and indoor navigation. As a pioneering work for semantic segmentation, the fully convolutional networks (FCN) was introduced and outperformed traditional methods. However, since FCN only takes account of the local contextual dependency, it does not reflect the global contextual dependency. In this paper, we explore variants of FCN with local and global contextual dependencies in the semantic segmentation problem. In addition, we tried to improve the performance of semantic segmentation with extra depth information from a commercial RGBD camera. Our experiment result indicates that exploiting the global contextual dependencies and the additional depth information improves the quality of semantic segmentation