VISION MODEL-BASED ASSESSMENT OF DISTORTION MAGNITUDES IN DIGITAL VIDEO
*Jeffrey Lubin, Ph.D. Michael H. Brill, Ph.D Roger L. Crane, Ph.D.
David Sarnoff Research Center
Princeton, NJ, USA 08543-5300
It is often useful to know the perceptual impact of distortions introduced at critical points in the production, distribution and display of video. Direct measurement using human observers is possible for some applications, but this is a time consuming operation that must be performed under carefully controlled conditions. A faster and more easily standardized approach is to use a vision model that provides accurate estimates of the visibility of differences between original and distorted image sequences. A model that fulfills this need is the Sarnoff Just-Noticeable Difference (JND) Model. In this paper, the operation and general structure of the JND Model will be described, and its performance in a range of video applications will be discussed.
1. Introduction - The Need for a Vision Model
As a video signal moves from light in to light out, each intervening processing system (e.g., capture, encoding, transmission, decoding, and display) can introduce visible distortions in the final video output. A rigorous experimental technique for evaluating the perceived magnitudes of the distortions is illustrated in Fig. 1.

For example, as shown in Fig. 1a, evaluation of a particular encoder can be achieved by sending the same video signal through two different encoders - the encoder under test and a high quality reference encoder - and then gathering perceptual measurements, e.g., through ITU-Rec.500 testing, to determine the magnitude of perceived distortions in the test sequence, relative to the reference. Similar techniques can be used for other component tests, e.g., for transmission fidelity (Fig. 1b) or display performance (Fig, 1c).
A major problem with this approach to system component evaluation is that accurate perceptual measurement is a costly and time-consuming process. Moreover, it is difficult to control, since the performance of any one component often depends strongly on other components in the system. For example, visible differences between two encoders may be apparent on a high quality studio monitor, but not on a typical home television.
A faster and more easily standardized approach is to use a vision model that automatically and accurately assesses the perceptual magnitude of differences between a test and reference sequence. As illustrated in Figure 2, a useful output representation for such a model is a Just Noticeable Difference (JND) Map, on which each point represents an estimate of the magnitude of perceptual differences between the test and reference sequence at that point. For example, the JND Map in Figure 2c shows a relatively bright region in the area corresponding to the numbers on the front of the tram in Figures 2a and 2b, thus indicating a visible distortion on these numbers in the test image (Figure 2b).

a b

c
Fig. 2 - Typical inputs (a, b) and JND Map output (c) for the Vision Model
The JND Map in Figure 2 is an actual output from the Sarnoff JND Model, which will be described in some detail in the next section. In this model, the JND unit of measure is functionally defined such that 1 JND corresponds to a 75% probability that an observer viewing the two images multiple times would be able to see the difference. JND values above 1 are then calculated incrementally. For example, if Image Y is 1 JND higher in contrast than Image X, and Image Z is 1 JND higher in contrast than Image Y, then Image Z is 2 JNDs higher in contrast than Image X. In probability terms, this 2 JND difference corresponds to 93.75% probability of discrimination (0.75 + 0.75
¥ (1 - 0.75)), and a 3 JND difference corresponds to 98.44% probability. Although probability of discrimination asymptotes quickly as a function of JNDs, the units are useful because they correspond to roughly linear magnitudes of subjective visual difference, as will be shown in some of the results below.Given a model of this form, the subjective testing illustrated in Figure 1 can be replaced with vision model testing, as shown in Figure 3 below. Here, reference and test sequences are presented to the model, rather than to a human observer. Then, depending on the application, the resulting JND Map sequence can be interpreted either directly, or after statistics have been computed on it to return a summary measure.

For example, an average across the JND Map sequence has proven itself to be a useful summary measure for predicting subjective image quality ratings (See prediction plots in Section III below), while a maximum across the maps is useful to determine if any distortions are visible from the system under test. The speed and automatic operation of the vision model approach to system performance evaluation also make it amenable to applications like encoding in which performance evaluations can be fed back in real time to optimize the system operation, as shown for example in Figure 4.

In Section 2 below, some background and general architectural principles of the Sarnoff JND Model architecture will be described (See Lubin, 1993, 1995, for a more detailed decription). Then, in Section 3, the model's performance in predicting a wide range of perceptual measurements will be shown and evaluated.
2. The Sarnoff JND Model
Figure 5 shows an overview of the JND Model architecture. The inputs are two image sequences of arbitrary length. As shown in the figure, each field of each input sequence is represented as a trio of R', G', B' images, wherein the pixel values represent the modeled electron-gun voltages that would give rise to the displayed pixel. In the first stage of the model, labeled Front End Processing in the figure, the voltage units are transformed to light output units to obtain luminance (Y), and then to the psychophysically defined quantities of the CIE L*u*v* uniform color space to obtain the two channels (u*, v*) of the model's chrominance pathway.

In the next stage of the model, labeled Pyramid Decomposition, each sequence is filtered and down-sampled using a Gaussian pyramid operation (Burt and Adelson, 1983) to efficiently generate a range of spatial resolutions for subsequent filtering operations. Next, the Normalization stage sets the overall gain with a time-dependent average luminance, to model the visual system's relative insensitivity to overall light level, and to represent such effects as the loss of visual sensitivity after a transition from a bright to a dark scene.
After normalization, three separate contrast measures are calculated. In each case, the contrast is a local difference of pixel values divided by a local sum, appropriately scaled as a function of pyramid level so that the result is 1 when the image contrast is at the human detection threshold. This establishes the definition of 1 JND, which is passed on to subsequent stages of the model. Figure 6 shows the fit of the model to spatial (top panel) and temporal (bottom panel) contrast sensitivity data from Van Doorn and Koenderink (1979).


Fig. 6 - Spatial (top panel) and temporal (bottom panel) contrast sensitivity: Data and model fit
In both cases, deviations of the model fits from the data are the result of specific implementation choices designed to speed model operations. So, for example, the scallops evident in the spatial contrast sensitivity plot (top panel) are the result of using 4 pyramid levels with octave-wide separation. Decreasing the separation of the levels and increasing their number would smooth out the fit, but add to the complexity. Similarly, the excessively narrow temporal frequency tuning in the bottom panel could be removed by using more (than the current four) fields in the flicker contrast calculation.
In the Contrast Energy Masking stage, each contrast image is subjected to a point non-linearity, the gain of which is controlled by the response across other resolution levels and channels. This gain-setting is included to model visual masking effects such as the decrease in sensitivity to distortions in "busy" image regions. The parameters of the point non-linearity at this stage are fit according to contrast discrimination data (Carlson and Cohen, 1978), in which the contrast increment needed to detect the change in contrast is measured as a function of the contrast from which the change is made. The results of this fit are shown in Figure 7 below.

Fig. 7 - Contrast discrimination: data and model fit
Next, in the Difference Metric stage, outputs from the test and reference sequences are combined via a simple difference operator and then summed across pyramid levels and channels to return the number of JNDs in both luma and chroma. Separate JND maps for luma and chroma can then be combined into one map. If desired, summary statistics can also be obtained at this point.
3. Model Predictions
After calibration, the JND Model accurately predicts a large amount of human visual performance data, both in detection and discrimination tasks and, perhaps more surprisingly, in rating tasks in which the observer is asked to estimate a magnitude of subjective image quality. In this section, many of the experiments on which the model has been tested to date will be described. For each experiment, a plot showing data vs. model predictions will be presented. It is important to note that for all these predictions, no additional model parameter fitting was performed, In some cases (as in Figures 8, 9, and 10) the predictions were obtained from a fast version of the model with restricted spatial resolution; hence, predictions were not always obtainable against the full range of stimulus conditions.
Figure 8 shows model predictions against disk detection data from Blackwell (1971). In this experiment, threshold contrast ratios between a small disk and its surround were measured as a function of disk diameter. Figure 9 shows data and predictions for a checkerboard detection experiment conducted at Sarnoff. In this experiment, the stimulus was a 5
¥ 5 black/white checkerboard pattern, which could vary in size and contrast. Contrast detection thresholds were measured in a two alternative spatial forced choice staircase paradigm, as a function of checkerboard size, and are plotted (with four replications per data point) as filled diamonds. These data are of special interest because of the similarity of the stimulus patterns to the block-like artifacts associated with DCT-based compression. As seen in the plot, the model predictions are quite good.
Fig. 8 - Model predictions on disk detection data from Blackwell (1971)

Fig. 9 - Data and model predictions on a checkerboard detection experiment
Figure 10 shows data and model predictions for an edge sharpness discrimination task of Carlson and Cohen (1980). In this task, observers were asked to discriminate a change in sharpness of an intensity edge as a function of the base sharpness of that edge. This task is interesting because it quantifies the visibility of losses in image sharpness that can result from degradations at various points in the video chain.

Fig. 10 - Edge sharpness discrimination: data and model fit
Another interesting task for some applications is from a Gille et al. (1994) study designed to assess the visibility of trade-offs between grayscale and pixel resolution in the design and manufacture of displays. In some display systems, it is possible at a fixed manufacturing cost to trade-off grayscale resolution (i.e., number of gray levels) against pixel resolution (i.e., number of pixels per unit area). Therefore, it is extremely useful to know the design point along this trade-off function that produces the best possible image quality. One way to answer this question is to plot, as shown in both panels of Figure 11, "iso-JND" contours in a two dimensional resolution vs. grayscale design space, where each contour represents the set of grayscale and resolution degradations that are equally discriminable from an "ideal" (high resolution, high grayscale) reference image. Because one can also plot the "iso-cost" contour in this same space, the design choice is reduced to the simple task of choosing the point along the iso-cost contour that corresponds to the minimum value from the underlying JND surface.
To test the model's ability to produce accurate JND contours in this two dimensional design space, Gille et al. (1994) collected data in which discrimination thresholds were measured between an original reference image and an image that had been degraded in resolution and/or grayscale, in order to determine empirically the location of the 1 JND contour. As shown by the two panels in Figure 11, this experiment was performed both with and without error diffusion, a dithering technique common in the printer industry.
The plotted points in each panel of Figure 11 show the experimental results for each dithering condition, with the different symbol types for three different subjects, as indicated in the legend. In both plots, the grouping of these data points along the 1 JND contour indicates excellent predictive performance by the model. It is also interesting to note the overall improvement in image quality that results from error diffusion, as seen by the shift of all JND contours to the lower resolution/grayscale corner of the error diffusion plot in the bottom panel of Figure 11, compared with the results in the top panel.
The plots in Figure 11 show that the model can accurately predict discrimination performance even among complex images that are undergoing complex sets of distortions. The next set of prediction plots shows that the model's ability to handle complex images is not limited to discrimination predictions, but extends to predictions of subjective image quality ratings as well.


Fig. 11 - Visibility of quantization artifacts in greyscale/resolution tradeoffs, with (bottom panel) and without (top panel) error diffusion
One set of rating data to which the model was applied was from a study by Snyder et al. (1982) in which trained government image analysts were asked to rate the quality of aerial imagery as a function of distortions in both noise and blur. These analysts are extensively trained to rate images according to a functional quality scale called NIIRS (National Image Interpretability Rating Scale) which assigns a number from 1 to 10 to an image, based on the level of detail that they estimate would be visible on objects of interest. For example, by assigning a NIIRS rating of 5 to an image, an analyst is estimating that discriminations would be possible between objects such as different kinds of railroad cars (e.g., gondola vs. flat). With a NIIRS rating of 6, the analyst would be able to identify an automobile as either a sedan or a station wagon, while at a NIIRS 8, the windshield wipers on that vehicle would be visible. For each of several images, 15 trained analysts NIIRS-rated the original image, as well as 25 degraded versions of that image (five noise levels
¥ five blur levels). These ratings for each degraded image can be described in terms of the difference in NIIRS rating between that degraded image and the original image, a difference referred to here as the DNIIRS value. In Figure 12, the DNIIRS value for each degraded image is plotted against the average of the JND map between that degraded image and the original. If it is the case that this JND value is capturing the difference in rated image quality between the two images, then there should be a strong correlation in this scatter plot. This is in fact the case, as shown by the highly linear, tight clustering between DNIIRS and JNDs over most of the plot's range.
Fig. 12 -
DNIIRS ratings vs model JNDs for Snyder et. al. (1982) data set. Different symbol types correspond to three different source images.In another rating experiment (Lubin, 1994), images were distorted by the addition of grayscale errors varying along two dimensions - noise range and block size. For each block size, the same randomly chosen grayscale error was added to each square block of pixels, with block sizes set at 1, 2, 4, 8, or 16 pixels on a side. The random error for each block was generated by sampling uniformly from a grayscale noise range ±
, where
could take on values 3, 4, 5, 6, or 7. The result of these stimulus manipulations is, for each source image, a set of 25 (5 block sizes
Figure 13 shows the results of this rating experiment, as compared against model JNDs and mean squared error. Each plotted symbol represents the rating data for one of the 25 distortion conditions, averaged across two subjects, four images and four replications per condition. Different symbol types correspond to different block sizes, as indicated in the plot legends. The top plot in Figure 13 shows these rating results compared against the average JNDs from the model, while the bottom panel shows the predictions from a simple meas squared error calculation between the two comparison images. Here, the change in rating with block size is obviously not predicted, as indicated by the vertical spread of points for different block sizes at each of five MSE values corresponding to the five different noise ranges. These results give strong evidence that the model JND metric is much more suitable than MSE for assessing subjective image quality for imagery distorted by typical codec artifacts.


Fig. 13 - Rating results and JND Model predictions (top panel) vs. MSE predictions (bottom panel) for block size vs. noise level experiment (Lubin, 1994).
4. Summary and Conclusions
The results discussed above have demonstrated that the Sarnoff JND Model, a physiologically-based model of human visual discrimination performance, produces useful and robust measures of image quality for practical applications in video system design and optimization. Ongoing subjective studies at Sarnoff and elsewhere continue to extend the tested range of the model,; new results will be reported as they arrive.
References
O. M. Blackwell and H. R. Blackwell, "Visual performance data for 156 normal observers of various ages" J. Illum. Engr. Soc. 61 (1971) 3-13.
P. J. Burt and E. H. Adelson, "The Laplacian pyramid as a compact image code," IEEE Transactions on Communications COM-31 (1983) 532-540.
C. Carlson and R. Cohen, "A simple psychophysical model for predicting the visibility of displayed information," Proceedings of the Society for Information Display 21 (1980) 229-245.
J. Gille, R. Martin, and J. Larimer, "Spatial resolution, grayscale, and error diffusion trade-offs: impact on display system design,"Conference record of the 1994 International Display Research Conference, Oct. 10-13, 1994, Monterey, CA.
J. Lubin, "The use of psychophysical data and models in the analysis of display system performance," in Digital Images and Human Vision, ed. A. B. Watson (MIT Press, Cambridge, MA, 1993), pp. 163-178.
J. Lubin, A visual discrimination mode for imaging system design and evaluation. In E. Peli (ed.), Visual Models for Target Detection and Recognition, World Scientific Publishers, 1995.
J. Lubin, "A practical vision model for the evaluation and optimization of image compression schemes," Invited talk, 1994 Optical Society of America Annual Meeting, Dallas, TX.
H. L. Snyder, M. E. Maddox, D. I. Shedivy, J. A. Turpin, J. J. Burke, and R. N. Strickland, "Digital image quality and interpretability: database and hardcopy studies," Optical Engineering 21 (1982) 14-22.
A.J. van Doorn and J.J. Koenderink, "Spatiotemporal contrast detection threhsold surface is bimodal," Optics Letters 4, 32-34 (1979).