The novel human computer interface is introduced, based on tongue and lips movements and using video data from a commercially available camera. The size and direction of the movements are extracted and can be used for setting cursor actions or to other relevant activities. The movement detection is based on convolutional neural networks. The applicability of the proposed solution is shown on the ASSISLT system [1], aimed to support speech therapy for adults and children with inborn and acquired motor speech disorders. The system focuses on individual treatment using exercises that improve tongue motion and thus articulation. The system offers an adjustable set of exercises which proper performance is motivated using augmented reality. Automatic evaluation of the performance of therapeutic movements allows the therapist to objectively follow the progress of the treatment.