In operating rooms, excessive cognitive stress can impede the performance of a surgeon, while low engagement can lead to unavoidable mistakes due to complacency. As a consequence, there is a strong desire in the surgical community to be able to monitor and quantify the cognitive stress of a surgeon while performing surgical procedures. Quantitative cognitive-load-based feedback can also provide valuable insights during surgical training to optimize training efficiency and effectiveness. Various physiological measures have been evaluated for quantifying cognitive stress for different mental challenges. In this paper, we present a study using the cognitive stress measured by the task evoked pupillary response extracted from the time series eye-tracking measurements to predict task difficulties in a virtual reality based robotic surgery training environment. In particular, we proposed a differential-task-difficulty scale, utilized a comprehensive feature extraction approach, and implemented a multitask learning framework and compared the regression accuracy between the conventional single-task-based and three multitask approaches across subjects.
Imitation learning is used massively in autonomous driving for training networks to predict steering commands from frames using annotated data collected by an expert driver. Believing that the frames taken from a front-facing camera are completely mimicking the driver’s eyes raises the question of how eyes and the complex human vision system attention mechanisms perceive the scene. This paper proposes the idea of incorporating eye gaze information with the frames into an end-to-end deep neural network in the lane-following task. The proposed novel architecture, GG-Net, is composed of a spatial transformer network (STN), and a multitask network to predict steering angle as well as the gaze map for the input frame. The experimental results of this architecture show a great improvement in steering angle prediction accuracy of 36% over the baseline with inference time of 0.015 seconds per frame (66 fps) using NVIDIA K80 GPU enabling the proposed model to operate in real-time. We argue that incorporating gaze maps enhances the model generalization capability to the unseen environments. Additionally, a novel course-steering angle conversion algorithm with a complementing mathematical proof is proposed.
Advanced perception and path planning are at the core for any self-driving vehicle. Autonomous vehicles need to understand the scene and intentions of other road users for safe motion planning. For urban use cases it is very important to perceive and predict the intentions of pedestrians, cyclists, scooters, etc., classified as vulnerable road users (VRU). Intent is a combination of pedestrian activities and long term trajectories defining their future motion. In this paper we propose a multi-task learning model to predict pedestrian actions, crossing intent and forecast their future path from video sequences. We have trained the model on naturalistic driving open-source JAAD [1] dataset, which is rich in behavioral annotations and real world scenarios. Experimental results show state-of-the-art performance on JAAD dataset and how we can benefit from jointly learning and predicting actions and trajectories using 2D human pose features and scene context.