With the availability of fast internet and convenient imaging devices such as smart phones, videos are becoming increasingly popular and important content on social media platforms recently. They are widely adopted for various purposes including, but not limited to, advertisement,
education and entertainment. One important problem in understanding videos is thumbnail generation, which involves selecting one or a few images, typically frames, which are representative of the given video. These thumbnails can then be used not only as a summary display for videos, but also
for representing them in downstream content models. Thus, thumbnail selection plays an important role in a user’s experience when exploring and consuming videos. Due to the large scale of data, automatic thumbnail generation methods are desired since it is impossible to manually select
thumbnails for all videos. In this paper, we propose a practical thumbnail generation method. Our method is designed in a way that will select representative and high-quality frames as thumbnails. Specifically, to capture semantic information of video frames, we leverage the embeddings of
video frames generated by a state of the art con-volutional neural network pretrained in a supervised manner on external image data, using them to find representative frames in a semantic space. To efficiently evaluate the quality of each frame, we train a linear model on top of the embeddings
to predict quality instead of computing it from raw pixels. We conduct experiments on real videos and show the proposed algorithm is able to generate relevant and engaging thumbnails.