Recent studies brought to light that the semantic labels (e.g. Excellent, Good, Fair, Poor, and Bad) commonly associated with discrete scale ITU subjective quality evaluation induce a bias in MOS computation and that such a bias can be quantified by some reference coefficients which are independent with respect to the observers panel. The present paper reconsiders these results from a standard upgrading perspective. First, it theoretically investigates the way in which results obtained on semantically labeled scales can be “cleaned” from such an influence and derives the underlying computation formula for the mean opinion score. Secondly, it suggests a unitary evaluation procedure featuring both semantic free MOS computation and backward compatibility with respect to state-of-the-art solutions. The theoretical and methodological results are supported by subjective experiments corresponding to a total of 440 human observers, alternatively scoring 2D and stereoscopic video content. For each type of content, both high and low quality excerpts are alternatively considered. For each type of content and for each type of quality a 5 level (Excellent, Good, Fair, Poor, and Bad) grading scales is considered.
Subjective quality assessment is considered a reliable method for quality assessment of distorted stimuli for several multimedia applications. The experimental methods can be broadly categorized into those that rate and rank stimuli. Although ranking directly provides an order of stimuli rather than a continuous measure of quality, the experimental data can be converted using scaling methods into an interval scale, similar to that provided by rating methods. In this paper, we compare the results collected in a rating (mean opinion scores) experiment to the scaled results of a pairwise comparison experiment, the most common ranking method. We find a strong linear relationship between results of both methods, which, however, differs between content. To improve the relationship and unify the scale, we extend the experiment to include cross-content comparisons. We find that the cross-content comparisons reduce the confidence intervals for pairwise comparison results, but also improve the relationship with mean opinion scores.