To avoid manual collections of a huge amount of labeled image data needed for training autonomous driving models, this paperproposes a novel automatic method for collecting image data with annotation for autonomous driving through a translation network that can transform the simulation CG images to real-world images. The translation network is designed in an end-to-end structure that contains two encoder-decoder networks. The forepart of the translation network is designed to represent the structure of the original simulation CG image with a semantic segmentation. Then the rear part of the network translates the segmentation to a realworld image by applying cGAN. After the training, the translation network can learn a mapping from simulation CG pixels to the realworld image pixels. To confirm the validity of the proposed system, we conducted three experiments under different learning policies by evaluating the MSE of the steering angle and vehicle speed. The first experiment demonstrates that the L1+cGAN performs best above all loss functions in the translation network. As a result of the second experiment conducted under different learning policies, it turns out that the ResNet architecture works best. The third experiment demonstrates that the model trained with the real-world images generated by the translation network can still work great in the real world. All the experimental results demonstrate the validity of our proposed method.
We have developed a semi-automatic annotation tool – “CVL Annotator” – for bounding box ground truth generation in videos. Our research is particularly motivated by the need for reference annotations of challenging nighttime traffic scenes with highly dynamic lighting conditions due to reflections, headlights and halos from oncoming traffic. Our tool incorporates a suite of different state-of-the-art tracking algorithms in order to minimize the amount of human input necessary to generate high-quality ground truth data. We focus our user interface on the premise of minimizing user interaction and visualizing all information relevant to the user at a glance. We perform a preliminary user study to measure the amount of time and clicks necessary to produce ground truth annotations of video traffic scenes and evaluate the accuracy of the final annotation results.