Imitation learning is used massively in autonomous driving for training networks to predict steering commands from frames using annotated data collected by an expert driver. Believing that the frames taken from a front-facing camera are completely mimicking the driver’s eyes raises the question of how eyes and the complex human vision system attention mechanisms perceive the scene. This paper proposes the idea of incorporating eye gaze information with the frames into an end-to-end deep neural network in the lane-following task. The proposed novel architecture, GG-Net, is composed of a spatial transformer network (STN), and a multitask network to predict steering angle as well as the gaze map for the input frame. The experimental results of this architecture show a great improvement in steering angle prediction accuracy of 36% over the baseline with inference time of 0.015 seconds per frame (66 fps) using NVIDIA K80 GPU enabling the proposed model to operate in real-time. We argue that incorporating gaze maps enhances the model generalization capability to the unseen environments. Additionally, a novel course-steering angle conversion algorithm with a complementing mathematical proof is proposed.