Monitor your models with Total Activation Classifiers

Magnifying glass over a bright circle with strands streaming out from it like sun rays

Authors: João Monteiro, Pau Rodríguez, Pierre-André Noël, Issam Laradji, David Vázquez

Neural networks have experienced a significant boost in performance across various tasks. However, a nagging issue persists: These networks tend to become excessively confident, particularly when they encounter data that significantly differs from their training sets.

This overconfidence, which often leads to errors, has prompted the research community to explore strategies to fortify neural networks against different types of data perturbations, encompassing both naturally occurring data variations and adversarial attacks.

Our recent research addressed this issue by implementing monitoring modules within the neural network framework. These modules serve as vigilant overseers, continuously scrutinizing the internal representations of the model. By actively monitoring the model's internal state, these modules aim to detect anomalies or irregular patterns.

This innovative approach offers promise in reducing or even eliminating overconfidence, aiming to improve the model's reliability and decrease instances of unwarranted certainty, especially when encountering novel or atypical data.

What are monitor modules?

Our recent work introduced the idea of monitor modules, which are private model components with access to a publicly available model. In the typical situation where attackers have as much access as regular users to a deployed public-facing model, they can query the model to the same extent as any other user can.

In this situation, we introduce a monitor, a separate model completely hidden from the external world but able to access the internal states of the public model whenever the public model receives and processes a query.

We then use that monitor for two purposes:

To predict: We can compare predictions between the two models and flag mismatches.
To estimate confidence: We can use the intermediate layer’s activations instead of just the last layer’s output.

Data feeds into a public model which leads to predictions. The public model can also be fed to the private TAC (monitor) to arrive at predictions and confidence.

In the following section, we will describe our instantiation of the monitored model class discussed above as Total Activation Classifiers (TAC). Our proposal uses recent findings showing that commonly used neural networks exhibit simple, class-dependent patterns in their internal representations, which leads to the question: Can we enforce a simple, class-dependent structure rather than hope it emerges?

We then show that one can indeed constrain representations to follow simple, class-dependent, and efficiently verifiable patterns on learned representations.

Total Activation Classifiers

TAC operate by defining class-dependent activation patterns—or class codes—that must be followed by each layer of a multilayer classifier. Those patterns are decided a priori and designed to be maximally separated. In other words, we turn the label set into a set of hard-coded, class-specific binary codes and define models such that activations obtained from different layers match those patterns.

Class codes then constrain representations and define a discrete set of valid internal configurations by indicating which groups of features should be activated for a given class. At testing time, any measure of how well a model matches some of the valid patterns can be used to reject potentially erroneous predictions.

In doing so, one can make predictions by checking for the best match between an observed activation pattern and the set of codes used for training. In addition, one can use the goodness of match between observed activation profiles and known codes as a confidence score to define selective classifiers that can reject when uncertain.

An illustration of the model appears below. TAC rely on a base multilayer classifier (pretrained or not), denoted by the red blocks, and extract pooled/lower-dimensional activation profiles via a project-slice-sum sequence of operations. The projection layers are trained, whereas the layers of the base/public classifier may or may not be trained.

Illustration of a model: Channels and input feed into a base multilayer classifier and extract pooled/lower-dimensional activation profiles via a sequence of operations.

Can we train TAC?

We observed that TAC match class codes very well while attaining accuracy on par with the base classifier, all the while offering the possibility to perform selective classification or reject likely out-of-distribution samples. We also observed improved adversarial robustness. See below for a goodness-of-match plot of activation profiles and a class code for a test CIFAR-10 image.

Each column is a different layer. The top row shows raw activations. Activation profiles and target/ground-truth codes are then displayed in the middle rows. The bottom row reveals the gaps between the observed patterns and ground-truth codes.

Goodness-of-match plot of activation profiles: 3 columns, each with four rows of computer-created plotted lines.

Are confidence scores derived from TAC useful?

We trained a TAC’d Wide Residual Network on CIFAR-10 using TAC and observed that confidence estimates obtained from clean data and adversarial perturbations significantly differ, enabling detection of these perturbations.

To illustrate this, we plot histograms of confidence estimates based on distances between class codes and activation patterns for clean data and adversarial perturbations of the test set of CIFAR-10. Attackers correspond to subtle perturbations, which are undetectable by visual inspection. We consider the white-box access model, in which the attacker has full access to the target predictor and the table of codes.

Attackers fail in matching codes as tightly as clean data can. TAC can then spot attackers and defer low-confidence predictions.

A graph with an X and Y axis showing clean images versus attacks images. The attacks images do not align with the clean images.

We also look at similar results on each layer independently, which reveals a similar behavior throughout depth; that is, all layers contribute to discriminate maliciously manipulated data.

A graph showing distance between clean images and attacks on three layers and then all. The attacks have a much greater distance, showing that the data was manipulated.

Adversarial robustness

While TAC are meant to enable detection of likely prediction errors from a base classifier, we also evaluated them as a means to make that base classifier more robust to attackers. In other words, in the context of a monitored white-box API call to a publicly available model, an outsider may observe that the API behaves identically to the white-box model, but the monitor may take certain actions following an attack, such as requesting human assistance.

Using TAC as the monitor proves to be a strong baseline, competing with the top performers for “true” white-box robustness: TAC can raise the performance of the base classifier to the level of the top performers. However, if the intended-as-private TAC are exposed, then the attacker wins.

Future research

Questions have emerged from the work completed on TAC so far. Here are two areas of the future research we’ll explore:

Data-dependent codes: The class codes used by TAC are completely independent of the data and designed to be pairwise maximally separated. Alternatively, one could consider approaches to learn the code table from data and leverage interclass similarities.

For example, it’s reasonable to expect that typical activation patterns of two breeds of dogs would be closer to the typical patterns of images of dogs than those of airplanes. Reflecting that in our table of codes would likely improve performance.

Ensemble classifiers: We can treat each layer in TAC as an individual predictor and, as such, those layers can be combined to form an ensemble, likely yielding improved performance and better confidence estimation.

Resources

Paper: https://openreview.net/pdf?id=1w_Amtk67X

Code: https://github.com/ServiceNow/tac

Citation

```
@inproceedings{
monteiro2023constraining,
title={Constraining Representations Yields Models That Know What They Don't Know},
author={Joao Monteiro and Pau Rodriguez and Pierre-Andre Noel and Issam H. Laradji and David Vazquez},
booktitle={The Eleventh International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=1w\\_Amtk67X}\\ }
```

Find out more about ServiceNow Research.