This paper describes an approach to video sequence over–segmentation. The objective is to split the video up to set of disjoint spatio–temporal regions with homogeneous texture properties. In the work we consider three possible types of regions: static texture, dynamic texture and non-textured region. Video over-segmentation is useful for wide range of applications, including perceptual video coding, video-based object recognition and high-level video segmentation. We treat the problem as a labeling problem on a Markov Random Field. Observed data are represented by output of fully-connected layer of convolutional neural network trained on static and dynamic textures. The hidden states of our model represent appropriate region labels. To provide robust over-segmentation we employ energy function composed of terms associated with neighboring voxels similarity and smoothness of obtained supervoxels. We show that our approach is able to segment static and dynamic textures in simultaneous fashion. We have tested our approach on several video sequences rich of static and dynamic textures and it has shown promising results.