The analysis of complex structured data like video has been a long-standing challenge for computer vision algorithms. Innovative deep learning architectures like Convolutional Neural Networks (CNNs), however are demonstrating remarkable performance in challenging image and video understanding tasks. In this work we propose a architecture for the automated detection of scored points during tennis matches. We explore two approaches based on CNNs for the analysis of video streams of broadcasted tennis games. We first explore the two-stream approach, which involves extracting features related to either pixel intensity values via the analysis of grayscale frames or the encoding of motion related information via optical flow. However, we explore the case of using higher order 3D CNN for simultaneously encoding both spatial and temporal correlations. Furthermore, we explore the late fusion of the individual stream in order to extract and encode both structural and motion spatio-temporal dynamics. We validate the merits of the proposed scheme using a novel manually annotated dataset created from publically available videos.