VistaModels: Computational models of Visual Neuroscience

The Toolboxes in the VistaModels site are organized in three categories of different nature: (a) Empirical-mechanistic Models, tuned to reproduce basic phenomena of color and texture perception, (b) Principled Models, derived from information theoretic arguments, and (c) Engineering-motivated Models, developed to address applied problems in image and video processing.

The algorithms in VistaModels require the standard building blocks provided in the (more basic) toolboxes VistaLab and ColorLab. However, the necessary functions from these more basic toolboxes are included in the packages listed below for the user convenience.

Table of contents

(A) Empirical-mechanistic Models

Cascades of linear transforms and nonlinear saturations are ubiquitous in neuroscience and artificial intelligence ever since the [McCulloch-Pitts model]. More recently this has been exemplified in subtractive and divisive models of cortical interaction [Wilson & Cowan, Kybernetik 73; Carandini and Heeger, Nature Rev. Neurosci. 12].

Over the years, we have developed progressively better versions of such cascades to be applicable to color images and video sequences. These parametric models were empirically tuned to give a rough description of different color and texture perception phenomena (see the psychophysical test-bed below for model tuning and comparison).

See a visual example of the effect of the local spatial-frequency transforms and the divisive normalization below (illustration of the 2018 model)

1995 - 2008: Linear opponent color channels, local-DCT and Divisive Normalization

This model is invertible and was originally tuned to reproduce contrast response curves obtained from contrast incremental thresholds [Pons PhD Thesis, 1997]. It was applied to reproduce subjective distortion opinion [Im.Vis.Comp.97, Displays 99] and to improve the perceptual quality of JPEG and MPEG through (a) transform coding of the achromatic channel [Elect.Lett.95, Elect.Lett.99, Im.Vis.Comp.00, IEEE TIP 01, Patt.Recog.03, IEEE TNN 05, IEEE TIP 06a, JMLR08], (b) the color channels [RPSP12], and (c) by improving the motion estimation [LNCS97, Elect.Lett.98, Elect.Lett.00a, Elect.Lett.00b, IEEE TIP 01].

2009 - 2010: Linear opponent color channels, Orthogonal Wavelet and Divisive Normalization

Even though we developed our own Matlab code for some specific overcomplete wavelets in the mid 90’s [MSc Thesis 95, J.Mod.Opt.97], it took some time until we applied the Divisive Normalization interaction to Simoncelli’s wavelets in MatlabPyrTools (which are substantially more efficient). The model was fitted to reproduce subjective image distortion opinion [JOSA A 10] following exhaustive grid search as in [IEEE ICIP 02]. This model (which relies on the orthogonal wavelets of the MatlabPyrTools) was found to have excellent redundancy reduction properties [LNCS10, Neur.Comp.10].

2013 - 2018: Multi-Layer network with nonlinear opponent color, Overcomplete Wavelet and Divisive Normalization

Even though we developed a comprehensive color vision toolbox in the early 2000’s (see ColorLab ), it took some time until we included a fully adaptive chromatic front-end before the spatial processing models based on overcomplete wavelets. Note that the older toolboxes rely on (too rough) linear RGB to YUV transforms. This multi-layer model (or biologically-plausible deep network) performs the following chain of perceptually meaningful operations [PLoS 18].

The parameters of the different layers were fitted in different ways: while the 2nd and 3rd layers (contrast and CSF+masking) were determined using Maximum Differentiation [Malo and Simoncelli SPIE 15], layers 1st and 4th (chromatic front-end and wavelet layer) were fitted to reproduce subjective image distortion data [PLoS 18], and then fine-tuned to reproduce classical masking [Front. Neurosci. 19].

2019 - 2021: Convolutional and differentiable implementations

The matrix formulation developed in [PLoS 18, Front. Neurosci. 19] and implemented in BioMultiLayer_L_NL_color is elegant but not applicable to large images nor appropriate to be included in python deep-learning schemes since it is implemented in Matlab. Recently we worked to solve these issues and confirm the choices of the chromatic part. This led to the deep Percepnet [IEEE ICIP 20], and to the convolutional version the above MultiLayer L+NL cascade [J.Vision, Proc. VSS 2021]. While Percepnet has the advantage of being implemented in python and hence ready for automatic differentiation (state-of-the-art in image quality), it has the disadvantage of being based on a restricted version of Divisive Normalization (no explicit interactions in space/scale) [ICLR 17]. On the other hand, the BioMultiLayer_L_NL_color_convolutional has a more general and interpretable version of the Divisive Normalization (in includes full range of interactions in space/scale/orientation). Moreover, the color adaptation choices and the scaling of the achromatic and chromatic channels has been confirmed by positive psychophysical and statistical behaviors [J. Neurophysiol.19, J. Math.Neurosci.20]. However, derivatives are implemented in matlab, so it is not ready to be included in deep-learning schemes right away. There is a lot of room for improvement of its parameters!.

Psychophysical test-bed for model tuning and comparison

The figure below (computed using VISTALAB and ColorLab) illustrates distinctive features of early vision: (a) the bandwidth of the achromatic and the chromatic channels is markedly different, (b) the response to contrast is a saturating nonlinearity, its slope (sensitivity) depends on the frequency and the response attenuates as a function of the properties of the background (note how the test is more salient -highlighted in green- on top of a very different background while it is masked -highlighted red- on top of similar backgrounds), and (c) the visibility of i.i.d. noise seen on top of a natural image is not uniform: e.g. visibility is smaller in high contrast regions.

These quite visible facts can be used to tune the parameters of the mechanistic models considered above. One could play with the parameters by hand until the response curves qualitatively reproduce what one actually sees. We suggested this idea to improve model fit in natural image databases [Front.Neurosci.18] and (for the first time!) here is data and code to perform such tune-it-yourself experiments: experiments_VistaModels.zip (400MB).

File is huge because it contains thousands of tests to compute detailed contrast response curves and distortion measures on the TD database. Moreover, it also has the corresponding responses of the three mechanistic models!.

Results below suggest that models are equivalent but the most recent displays better behavior (on top of having more plausible receptive fields [PLoS 18]). More importantly, while the results on Image Quality are way better than the popular Structural Similarity Index SSIM (see VistaQualityTools ) there is still a lot of room for improvement through these tune-it-yourself experiments!.

Model Comparison

(B) Principled Models

Efficient coding in mechanistic models

We have shown that models including point-wise Weber-like saturation for brightness lead to decreasing signal-to-noise ratio as a function of the luminance [J.Opt.95]. Moreover, taking into account more general cascades of linear+nonlinear layers (e.g. local-frequency transforms and divisive normalization after Weber-brightness) we have seen that the efficiency of such systems (in terms of redundancy reduction) decreases with luminance and contrast, which is consistent with the distribution of natural images in local frequency domains [PLoS 18]. We have seen that the discrimination ability of Local-DCT+Div.Norm. models is bigger in the more populated regions of the frequency-amplitude domain [Im.Vis.Comp.97]. Additionally, we have seen that the mutual information between the coefficients of the image representation progressively reduces from the retina to the normalized representation, both in the local-DCT + DN case [IEEE TIP 06] and in the Orthogonal wavelet+DN case Neur.Comp.10.

The above body of results means that the Mechanistic Models considered above display remarkable adaptation to the natural image statistics.

In the same line, in collaboration with NYU (Balle and Simoncelli) we have optimized the described linear+nonlinear architectures for optimal autoencoding. By including both the linear and the nonlinear parts in the optimization we get unprecedented rate-distortion performances (see paper and code here [ICLR 17]), way better than our previous image coders based on V1 models with fixed linear stages (See the VistaCoRe Toolbox).

Statistically-based linear receptive fields

Statistical goals such as decorrelation (Principal Component Analysis, PCA), and Independent Component Analysis (ICA) many time lead to sensible linear receptive fields when trained with natural scenes. For instance, spatio-spectral PCA leads to compact representations to disentangle reflectance and spectral illumination from retinal irradiance and lead to spatial-frequency sensors with smooth spectral response [IEEE TGRS 13] (see VistaSpatioSpectral). In collaboration with Helsinki University (Gutman and Hyvarinen) we explored ICA-related techniques. Complex ICA led to local and oriented receptive fields in phase quadrature [LNCS11] (download the Complex ICA Toolbox). Higher Order Canonical Correlation Analysis (HOCCA) combines the sparsity goal with optimal correspondence between identified features in domain adaptation problems leading to biologically plausible spatiochromatic receptive fields which adapt to changes in the illumination (PLoS 14, see the HOCCA Toolbox).

This analysis of ICA methods concluded with a refutation of a classical result in cortical organization based on Topographica ICA: in fact (as opposed to Hyvarinen & Hoyer Vis. Res. 2001) it does not lead to orientation domains [PLoS 17]. See code and results to analyze TICA receptive fields.

Statistically-based nonlinearities

Instead of optimizing the mechanistic models for efficient coding we tried a stronger approach to test the Efficient Coding Hypothesis: use pure data-driven techniques instead of assuming models which already have the right functional form. We developed a family of invertible techniques for manifold unfolding and for manifold Gaussianization.

The unfolding techniques identify nonlinear sensors that follow curved manifolds. These include Sequential Principal Curves Analysis SPCA and sequels: Principal Polynomial Analysis PPA and Dimensionality Reduction based on Regression, DDR.

The Gaussianization technique (Rotation-Based Iterative Gaussianization, RBIG) does not identify sensors but it allows to compute the PDF. Therefore it is useful to define discrimination regions according to information maximization or error minimization. See the kind of predictions made by these unfolding techniques (SPCA [Network 06, NeCo12, Front. Human Neurosci.15, ArXiv 16, https://arxiv.org/pdf/1606.00856.pdf], and PPA-DRR [SPIE13, Int.J.Neur.Syst.14, IEEE Sel.Top.Sig.Proc.15]) and by the Gaussianization technique [Talk at LeCun Lab NYU 13, IEEE TNN 11].

Closely related to optimal discrimination (or optimal metric) for error minimization is the concept of Fisher Information. Our lab has a tradition in the study of Riemannian metrics induced by nonlinear perception systems [J. Malo PhD 99, Displ.99]. Over the years, the ideas about the geometrical transforms induced by the system and their effect on information processing have evolved from distance computation to the consideration of the transformation of neural noise [Displ.99, Patt.Recog.03, IEEE TIP 06, JOSA A 10, SPIE 15, NIPS 17, PLoS 18].

(C) Engineering-motivated

Perceptually-weighted motion estimation: VistaVideoCoding

What can be predicted is not worth transmitting!. This simple idea is the core of predictive coding used in most successful video coders (e.g. MPEG). In predictive coding motion information is the key to predict future-from-past. MPEG-like coders first compute the optical flow (or displacement field) and encode the prediction error in a transformed domain which (not surprisingly!) is similar to the V1 mechanistic models described above.

In this video-coding context we improved motion estimation by connecting the optical flow computation with the perceptual relevance of the prediction error: we proposed to improve the resolution of the motion estimate only if the prediction error was hard to encode for our improved V1 models [LNCS97, Electr.Lett.98, J.Vis.01]. This gave rise to smoother motion flows more appropriate for motion-based segmentation [Electr.Lett.00a], and to better video coders [Electr.Lett.00b, IEEE TIP 01].

Image Coding: VistaCoRe

Image compression requires vision models that rank visual features according to their perceptual relevance so that extra bits can be allocated to encode the subjectively important aspects of the image.

The vision model based on DCT and Divisive Normalization considered above leads to better decoded images at the same compression ratio than JPEG and variants based on simpler models of masking.

See the VistaCoRe (Coding and Restoration Toolbox), and the references [Eletr.Lett95, Eletr.Lett99, Im.Vis.Comp.00 Patt.Recog.03, IEEE TNN 05, IEEE TIP 06a, IEEE TIP 06b, JMLR08].

Image and Video Quality: VistaQualityTools

Computing perceptual distances between images requires vision models that identify relevant and negligible visual features. Distortions in features that will be neglected by the observers should induce no effect in the distance. And the other way around for visually relevant features. The different models can be quantitatively compared by their accuracy in reproducing the opinion of viewers in subjectively rated databases.

The three vision models considered above (based on DCTs, orthonormal wavelets, and overcomplete wavelets) have been used to propose distortion metrics that overperform SSIM. See VistaQualityTools, and the references [Im.Vis.Comp.97, Displays99, Patt.Recog.03, IEEE Trans.Im.Proc.06] for the DCT metric, [JOSA 10, Neur.Comp.10] for the orthogonal wavelet metric, and [PLoS 18, Frontiers Neurosci.18] for the metric based on overcomplete wavelets.

References

Download