Retrospective on Interpretability Capture the Flag

Tl;dr

We reverse-engineered the small text classifier (a frozen all-MiniLM-L6-v2 encoder feeding a five-layer ReLU MLP) that predicts eight binary features from short text. At the analysis layer (layer L), seven features are linearly decodable with AUROC ≥ 0.995. The eighth feature - representing “country” - is not.

That does not mean the model doesn't represent "country". It’s present, but only turns linear once you condition on two other features: “food” and “sentiment”. Geometrically, activating the "country" feature contracts the food-by-sentiment parallelogram toward its centre. Because the four local "country" deltas cancel when summed, the global mean-difference direction collapses to near-zero, rendering the binary classifier invisible to any single global linear probe.

Task 1: Identify the non-linear features

Setup

The model architecture is:

Encoder: sentence-transformers/all-MiniLM-L6-v2 → 384-dimensional sentence embedding
MLP Head: Five linear layers with ReLU activations between them
Linear(384→64) → ReLU → Linear(64→64) → ReLU → Linear(64→64) → ReLU → Linear(64→64) → ReLU → Linear(64→8)
Output: 8 independent sigmoid heads for binary classification
Layer L (the analysis layer): the 64-dimensional post-ReLU output after the third linear layer (hidden layer 2)

01 Linear Probing

We cached layer L activations for all 7,000 training and 1,500 test examples, then trained a global linear regression probes for each of the eight labels.

Result

Our takeaway is that "country" is almost certainly the target feature.

The sheer difference in decodability tells us something important: it’s unlikely to be “difficult” to probe. If the feature was merely hard to decode, it wouldn’t necessarily drop linear probing all the way to chance.

Some guesses we had for what the representation might instead be:

XOR-like gating between dimensions
Some form of parity logic
Something completely unexpected

Task 1 Answer: The non-linearly represented feature is country.

Task 2: How "country" is represented

02 Is there any signal at all?

A chance-level global linear probe could mean one of two things:

Country is not represented at layer L.
Country is represented, but not by one global linear direction.

The second one is correct. Nonlinear and local probes recover "country" with high accuracy.

The question then follows: if "country" is not globally represented linearly, how is it represented?

Hypothesis 1: My working hypothesis is that different representations for countries are grouped together. The natural story for "kNN works but linear probes don't" is that each "country" forms its own cluster… ie. all the Japan sentences together, all the Italy sentences together, so the binary "country" label is a union of clusters that no single hyperplane can capture.

This seemed like a good place to start, especially since task 3 implied the representation was “weird”.

03 Country by "country" Grouping Hypothesis

To test it, we extracted country-specific labels from the raw text (regex over a fixed "country" list) and trained per-country MM and LR probes. They failed too. Individual countries were not linearly decodable at layer L either.

Interestingly:

Specific "country" probes are weak and noisy, with tiny positive counts per named country.
The best individual "country" probe (Norway) achieves only ~0.721 AUROC with a tiny test set, although this result in of itself might reflect a P-hacked linear representation over the test set of countries.
Nearest-neighbor analysis shows that nearby examples share most of the label bundle, not necessarily the same "country" identity.

One example of this was: "Yesterday, Henry sold the pathetic coins with a gold cover in Australia near gate fifteen.

labels: ['number', 'color', 'country', 'person']

nearest neighbours:

['number', 'color', 'person'] ['number', 'color', 'person'] ['number', 'color', 'person']

...

Its nearest training neighbors mostly share 'number', 'color', 'person' but not necessarily 'country', and not necessarily Australia.

For each non-country bundle B (the setting of the other features), we split the examples into B-with-country and B-without-country and trained a country probe restricted to that bundle. Within a bundle, country becomes strongly (often perfectly) linearly decodable.

Revised hypothesis #2: Country’s representation is context-dependent on the surrounding labels. Across all test-neighbor pairs, the mean full-label-set Jaccard similarity is 0.942-0.958, and 76.7-83.1% of neighbors are exact full-label-set matches. kNN succeeds because neighborhoods preserve the whole feature bundle and "country" is often the single bit that flips within a bundle.

04 “Bundle” hypothesis

The question we want to test is whether "country" is linear once the rest of the context is fixed.

Feature	Matched Pairs	Dir Cos Mean	Dir Cos Median	Dir Cos Std	Dir Cos Min	Dir Cos Max	Effect Cos Mean	Effect Cos Median	Effect Cos Std
0	food	44	-0.655899	-0.674944	0.144274	-0.876457	-0.282853	0.948873	0.989298	0.109125
1	sentiment	43	0.688248	0.707079	0.141092	0.287980	0.909225	0.833466	0.926625	0.221358
2	number	39	0.916846	0.966862	0.149468	0.355196	0.994951	0.304866	0.286315	0.365349
3	question	43	0.929662	0.971758	0.144661	0.353409	0.998526	0.138825	0.104187	0.403496
4	person	44	0.934508	0.981352	0.142935	0.352636	0.999118	0.085866	0.073157	0.478640
5	color	45	0.945744	0.986443	0.131947	0.345825	0.998031	0.035112	0.027702	0.435061
6	body_part	43	0.966375	0.979970	0.065418	0.557022	0.998510	0.147325	0.182786	0.493980

From within

The bar chart makes the gating structure explicit. For each non-country feature we split the data by that feature and trained a within-split country classifier; the bar is the resulting gain in country decodability. Controlling for food (+0.473) and sentiment (+0.405) recovers the overwhelming majority of the signal, while the next-best splitter (number, +0.170) and the remaining four (color +0.128, question +0.110, body_part +0.075, person +0.052) contribute comparatively little. This is the quantitative version of the point-cloud panels: food and sentiment are the two variables the country direction is conditioned on, and the other five leave it essentially untouched.

05 “Food” and “Sentiment” as important features

The point-cloud view shows what the bar charts are measuring. Each panel below freezes one food/sentiment combination. Inside that panel, blue and red points separate along a clean local direction. The direction is consistent within every combination, but it is not the same across combinations.

Within each food/sentiment cell the country direction is clean and stable, but the four cells disagree. Two numbers from the steering analysis pin this down: the δ₀₀ and δ₁₁ diagonal pairs are essentially anti-parallel (cosine = −0.99). So the four valid local directions are not noisy copies of one global arrow, they are arranged so that same-axis pairs oppose each other. Averaging them is what collapses the global signal to 0.079.

06 What is the geometry?

We can determine the geometry by starting with the linear vectors representing the centroid for “food” and “sentiment” without “country” present. Adding these vectors together, we get a simple parallelogram.

So, what happens when “country” is present?

Interesting! Geometrically, the presence of the "country" scales each corner towards the middle of the food/sentiment parallelogram.

That is the missing geometry from the global probe result. A global linear probe is looking for one arrow that means "country." The model has four arrows, with two sets pointing in anti-parallel directions.

These are all valid local "country" directions. They also cancel when averaged.

The geometry is visible when we match to real examples, which compares examples where “country” flips from present to not present. It shows the same pattern: individual matched arrows are noisy, but their averages point toward the same four inward directions.

And if we reconstruct the vector cube using the real examples:

Are the inside values of “country” contained within the 2D plane containing the “sentiment” and “food” centroids?

Split	Context	Cosine to Centroid Plane	Inward Cosine
train	F0 S0	0.9978	0.9903
train	F1 S0	0.9960	0.9948
train	F0 S1	0.9978	0.9943
train	F1 S1	0.9988	0.9896
test	F0 S0	0.9928	0.9777
test	F1 S0	0.9950	0.9916
test	F0 S1	0.9899	0.9871
test	F1 S1	0.9873	0.9781

The 'country' feature demonstrates a highly structured geometric representation. Across all training and test splits, the 'country' vector is almost perfectly contained within the 2D plane defined by the 'sentiment' and 'food' centroids (cosine to centroid plane ≥ 0.987). Furthermore, the vector consistently points inward toward the center of the food-by-sentiment parallelogram, evidenced by exceptionally high inward cosine values (≥ 0.977).

07 Why Global Direction Disappears

The model predicts "country" correctly (test accuracy: 0.964), so some later layer must unfold the hidden2 contraction. The layer progression shows this happens immediately in hidden layer 3:

Stage	Dimension	Country MM AUROC	Country LR AUROC	Country LR Best F1
hidden2 post-ReLU (layer L)	64	0.488	0.490	0.683
hidden3 post-ReLU (layer L+1)	64	0.979	0.994	0.976
logits	8	0.993	0.994	0.976

Figure 15Figure 2: Linear decodability of country across model stages. The dramatic jump from hidden2 to hidden3 confirms that the next ReLU layer unfolds the context-gated code.

This rules out the story that the final linear head alone does the work. By hidden3, "country" is already linearly available.

Inspecting the "country" logit readout at hidden3 reveals that the strongest contributors are anti-country detectors; hidden3 neurons with negative head weights that fire strongly on non-country examples:

Hidden3 Neuron	Country-Head Weight	Country Mean Act	Non-Country Mean Act	Contribution to "country" Gap
h3_00	−3.714	0.335	0.913	2.144
h3_56	−3.332	0.346	0.953	2.023
h3_05	−3.789	0.281	0.807	1.992
h3_60	−3.187	0.283	0.865	1.853
h3_11	−2.713	0.349	1.025	1.834

Individual hidden3 neurons are weak abstract "country" probes (best single-neuron AUROC ≈ 0.594). The representation is distributed: the next layer creates a readout space where the hidden2 context-gated code becomes linearly separable through multiple weak features acting in concert.

This readout appears to follow from our earlier findings:

The contraction geometry established what the "country" classifier does geometrically

When "country" is present, each food/sentiment corner is pulled inward toward the center of the parallelogram. The effect of "country" is therefore to reduce the activation magnitude along the food and sentiment axes. Country-present examples sit closer to the origin of the food–sentiment subspace; country-absent examples sit out at the corners.

The hidden3 readout establishes how the next layer reads that out.

The dominant hidden3 contributors to the "country" logit are anti-country detectors: neurons with large negative head weights that fire strongly on non-country examples. These neurons appear to be magnitude detectors on the food/sentiment axes.

What this tells us, is that "country" is likely encoded as a low projection on the food/sentiment axes, and the model reads it by detecting the absence of the high-magnitude corner signal.

Contraction-toward-center (the geometry) and detection-by-absence (the readout) unify our hypothesis.

08 Causal Steering Checks

The local "country" directions don’t appear to be simple probe artifacts. This is because steering the model's own hidden2 activations along the correct local direction changes its "country" prediction, while steering along the cancelled global direction barely moves it.

The additive ablation is the most important control here. Treat food and sentiment as a plain additive feature and it does basically nothing (0.504), indistinguishable from the global direction. Only the interaction of terms, which is what lets the contraction happen, recovers “country”. So the encoding is conditional, not additive.

Also noteworthy: the gap between the centroid probe and the interaction linear regression probe. The centroid uses the only “per-bundle” mean-difference direction, whereas the interaction LR fits an optimal per-bundle plane, and thus picks up the directional difference between the bundle with and without the “country” feature.

Example 1: A correctly classified food=0, sentiment=0, country=0 example

Steer Direction	α	Cosine vs. Local	Base P	Steered P	Flip?
Correct local δ₀₀	1.0	1.000	0.000016	0.937112	Yes
Anti-parallel δ₁₁	1.0	−0.995	0.000016	0.000000	No
Global mean-diff	1.0	−0.201	0.000016	0.000002	No

Example 2: A correctly classified food=0, sentiment=0, country=1 example

Steer Direction	α	Cosine vs. Local	Base P	Steered P	Flip?
Correct local δ₀₀	0.8	1.000	0.999507	0.999965	No
Anti-parallel δ₁₁	0.8	−0.995	0.999507	0.066832	Yes
Global mean-diff	0.8	−0.201	0.999507	0.998274	No

In part 2, we build our own interpretability challenge.

Task 3: Our Not-So-Honest Model: TrapXORHead

Published from Brisbane, Australia.