Reference-based image quality assessment techniques use information from an undistorted reference image of the same scene to estimate the quality of a distorted target image. The main challenge in designing algorithms for quality assessment is to incorporate the behavior of the human visual system into the algorithms. The advent of deep learning (DL) techniques has garnered sufficient interest among researchers in the field of image quality assessment. The common limitation of applying deep learning for image quality assessment is its dependence on a large amount of subjective training data. Recent advances in the field of patch-based self-supervised vision transformers have achieved remarkable results for tasks like object segmentation, copy detection, etc. and other downstream computer vision tasks. In this paper, we study how the distance between the pretrained self-supervised vision transformer features applied on pristine and distorted images is related to the human visual system. Experiments carried out on three publicly available image quality databases (namely KADID-10K, TID2013, and MDID2016) have yielded promising results that can be further exploited to design perceptual reference-based image quality assessment methods.