Tl;dr

We reverse-engineered the small text classifier (a frozen all-MiniLM-L6-v2 encoder feeding a five-layer ReLU MLP) that predicts eight binary features from short text. At the analysis layer (layer L), seven features are linearly decodable with AUROC ≥ 0.995. The eighth feature - representing “country” - is not.

That does not mean the model doesn't represent "country". It’s present, but only turns linear once you condition on two other features: “food” and “sentiment”. Geometrically, activating the "country" feature contracts the food-by-sentiment parallelogram toward its centre. Because the four local "country" deltas cancel when summed, the global mean-difference direction collapses to near-zero, rendering the binary classifier invisible to any single global linear probe.

Task 1: Identify the non-linear features

Setup

The model architecture is:

  • Encoder: sentence-transformers/all-MiniLM-L6-v2 → 384-dimensional sentence embedding
  • MLP Head: Five linear layers with ReLU activations between them
  • Linear(384→64) → ReLU → Linear(64→64) → ReLU → Linear(64→64) → ReLU → Linear(64→64) → ReLU → Linear(64→8)
  • Output: 8 independent sigmoid heads for binary classification
  • Layer L (the analysis layer): the 64-dimensional post-ReLU output after the third linear layer (hidden layer 2)

01 Linear Probing

We cached layer L activations for all 7,000 training and 1,500 test examples, then trained a global linear regression probes for each of the eight labels.

Result

Figure 02

Our takeaway is that "country" is almost certainly the target feature.

The sheer difference in decodability tells us something important: it’s unlikely to be “difficult” to probe. If the feature was merely hard to decode, it wouldn’t necessarily drop linear probing all the way to chance.

Some guesses we had for what the representation might instead be:

  • XOR-like gating between dimensions
  • Some form of parity logic
  • Something completely unexpected

Task 1 Answer: The non-linearly represented feature is country.

Task 2: How "country" is represented

02 Is there any signal at all?

A chance-level global linear probe could mean one of two things:

  1. Country is not represented at layer L.
  2. Country is represented, but not by one global linear direction.

The second one is correct. Nonlinear and local probes recover "country" with high accuracy.

Figure 03

The question then follows: if "country" is not globally represented linearly, how is it represented?

Hypothesis 1: My working hypothesis is that different representations for countries are grouped together. The natural story for "kNN works but linear probes don't" is that each "country" forms its own cluster… ie. all the Japan sentences together, all the Italy sentences together, so the binary "country" label is a union of clusters that no single hyperplane can capture.

This seemed like a good place to start, especially since task 3 implied the representation was “weird”.

03 Country by "country" Grouping Hypothesis

To test it, we extracted country-specific labels from the raw text (regex over a fixed "country" list) and trained per-country MM and LR probes. They failed too. Individual countries were not linearly decodable at layer L either.

Figure 04

Interestingly:

  • Specific "country" probes are weak and noisy, with tiny positive counts per named country.
  • The best individual "country" probe (Norway) achieves only ~0.721 AUROC with a tiny test set, although this result in of itself might reflect a P-hacked linear representation over the test set of countries.
  • Nearest-neighbor analysis shows that nearby examples share most of the label bundle, not necessarily the same "country" identity.

One example of this was: "Yesterday, Henry sold the pathetic coins with a gold cover in Australia near gate fifteen.

labels: ['number', 'color', 'country', 'person']

nearest neighbours:

['number', 'color', 'person'] ['number', 'color', 'person'] ['number', 'color', 'person']

...

Its nearest training neighbors mostly share 'number', 'color', 'person' but not necessarily 'country', and not necessarily Australia.

For each non-country bundle B (the setting of the other features), we split the examples into B-with-country and B-without-country and trained a country probe restricted to that bundle. Within a bundle, country becomes strongly (often perfectly) linearly decodable.

Figure 05

Revised hypothesis #2: Country’s representation is context-dependent on the surrounding labels. Across all test-neighbor pairs, the mean full-label-set Jaccard similarity is 0.942-0.958, and 76.7-83.1% of neighbors are exact full-label-set matches. kNN succeeds because neighborhoods preserve the whole feature bundle and "country" is often the single bit that flips within a bundle.

04 “Bundle” hypothesis

The question we want to test is whether "country" is linear once the rest of the context is fixed.

FeatureMatched PairsDir Cos MeanDir Cos MedianDir Cos StdDir Cos MinDir Cos MaxEffect Cos MeanEffect Cos MedianEffect Cos Std
0food44-0.655899-0.6749440.144274-0.876457-0.2828530.9488730.9892980.109125
1sentiment430.6882480.7070790.1410920.2879800.9092250.8334660.9266250.221358
2number390.9168460.9668620.1494680.3551960.9949510.3048660.2863150.365349
3question430.9296620.9717580.1446610.3534090.9985260.1388250.1041870.403496
4person440.9345080.9813520.1429350.3526360.9991180.0858660.0731570.478640
5color450.9457440.9864430.1319470.3458250.9980310.0351120.0277020.435061
6body_part430.9663750.9799700.0654180.5570220.9985100.1473250.1827860.493980

From within

Figure 06

The bar chart makes the gating structure explicit. For each non-country feature we split the data by that feature and trained a within-split country classifier; the bar is the resulting gain in country decodability. Controlling for food (+0.473) and sentiment (+0.405) recovers the overwhelming majority of the signal, while the next-best splitter (number, +0.170) and the remaining four (color +0.128, question +0.110, body_part +0.075, person +0.052) contribute comparatively little. This is the quantitative version of the point-cloud panels: food and sentiment are the two variables the country direction is conditioned on, and the other five leave it essentially untouched.

05 “Food” and “Sentiment” as important features

The point-cloud view shows what the bar charts are measuring. Each panel below freezes one food/sentiment combination. Inside that panel, blue and red points separate along a clean local direction. The direction is consistent within every combination, but it is not the same across combinations.

Figure 07

Within each food/sentiment cell the country direction is clean and stable, but the four cells disagree. Two numbers from the steering analysis pin this down: the δ₀₀ and δ₁₁ diagonal pairs are essentially anti-parallel (cosine = −0.99). So the four valid local directions are not noisy copies of one global arrow,  they are arranged so that same-axis pairs oppose each other. Averaging them is what collapses the global signal to 0.079.

Figure 08

06 What is the geometry?

We can determine the geometry by starting with the linear vectors representing the centroid for “food” and “sentiment” without “country” present. Adding these vectors together, we get a simple parallelogram.

Figure 09
Figure 10

So, what happens when “country” is present?

Figure 11

Interesting! Geometrically, the presence of the "country" scales each corner towards the middle of the food/sentiment parallelogram.

That is the missing geometry from the global probe result. A global linear probe is looking for one arrow that means "country." The model has four arrows, with two sets pointing in anti-parallel directions.

These are all valid local "country" directions. They also cancel when averaged.

The geometry is visible when we match to real examples, which compares examples where “country” flips from present to not present. It shows the same pattern: individual matched arrows are noisy, but their averages point toward the same four inward directions.

Figure 12

And if we reconstruct the vector cube using the real examples:

Figure 13

Are the inside values of “country” contained within the 2D plane containing the “sentiment” and “food” centroids?

SplitContextCosine to Centroid PlaneInward Cosine
trainF0 S00.99780.9903
trainF1 S00.99600.9948
trainF0 S10.99780.9943
trainF1 S10.99880.9896
testF0 S00.99280.9777
testF1 S00.99500.9916
testF0 S10.98990.9871
testF1 S10.98730.9781
Figure 14

The 'country' feature demonstrates a highly structured geometric representation. Across all training and test splits, the 'country' vector is almost perfectly contained within the 2D plane defined by the 'sentiment' and 'food' centroids (cosine to centroid plane ≥ 0.987). Furthermore, the vector consistently points inward toward the center of the food-by-sentiment parallelogram, evidenced by exceptionally high inward cosine values (≥ 0.977).

07 Why Global Direction Disappears

The model predicts "country" correctly (test accuracy: 0.964), so some later layer must unfold the hidden2 contraction. The layer progression shows this happens immediately in hidden layer 3:

StageDimensionCountry MM AUROCCountry LR AUROCCountry LR Best F1
hidden2 post-ReLU (layer L)640.4880.4900.683
hidden3 post-ReLU (layer L+1)640.9790.9940.976
logits80.9930.9940.976
Figure 2: Linear decodability of country across model stages. The dramatic jump from hidden2 to hidden3 confirms that the next ReLU layer unfolds the context-gated code.
Figure 15Figure 2: Linear decodability of country across model stages. The dramatic jump from hidden2 to hidden3 confirms that the next ReLU layer unfolds the context-gated code.

This rules out the story that the final linear head alone does the work. By hidden3, "country" is already linearly available.

Inspecting the "country" logit readout at hidden3 reveals that the strongest contributors are anti-country detectors; hidden3 neurons with negative head weights that fire strongly on non-country examples:

Hidden3 NeuronCountry-Head WeightCountry Mean ActNon-Country Mean ActContribution to "country" Gap
h3_00−3.7140.3350.9132.144
h3_56−3.3320.3460.9532.023
h3_05−3.7890.2810.8071.992
h3_60−3.1870.2830.8651.853
h3_11−2.7130.3491.0251.834

Individual hidden3 neurons are weak abstract "country" probes (best single-neuron AUROC ≈ 0.594). The representation is distributed: the next layer creates a readout space where the hidden2 context-gated code becomes linearly separable through multiple weak features acting in concert.

This readout appears to follow from our earlier findings:

  1. The contraction geometry established what the "country" classifier does geometrically

When "country" is present, each food/sentiment corner is pulled inward toward the center of the parallelogram. The effect of "country" is therefore to reduce the activation magnitude along the food and sentiment axes. Country-present examples sit closer to the origin of the food–sentiment subspace; country-absent examples sit out at the corners.

  1. The hidden3 readout establishes how the next layer reads that out.

The dominant hidden3 contributors to the "country" logit are anti-country detectors: neurons with large negative head weights that fire strongly on non-country examples. These neurons appear to be magnitude detectors on the food/sentiment axes.

What this tells us, is that "country" is likely encoded as a low projection on the food/sentiment axes, and the model reads it by detecting the absence of the high-magnitude corner signal.

Contraction-toward-center (the geometry) and detection-by-absence (the readout) unify our hypothesis.

08 Causal Steering Checks

The local "country" directions don’t appear to be simple probe artifacts. This is because steering the model's own hidden2 activations along the correct local direction changes its "country" prediction, while steering along the cancelled global direction barely moves it.

Figure 16

The additive ablation is the most important control here. Treat food and sentiment as a plain additive feature and it does basically nothing (0.504), indistinguishable from the global direction. Only the interaction of terms, which is what lets the contraction happen, recovers “country”. So the encoding is conditional, not additive.

Also noteworthy: the gap between the centroid probe and the interaction linear regression probe. The centroid uses the only “per-bundle” mean-difference direction, whereas the interaction LR fits an optimal per-bundle plane, and thus picks up the directional difference between the bundle with and without the “country” feature.

Example 1: A correctly classified food=0, sentiment=0, country=0 example

Steer DirectionαCosine vs. LocalBase PSteered PFlip?
Correct local δ₀₀1.01.0000.0000160.937112Yes
Anti-parallel δ₁₁1.0−0.9950.0000160.000000No
Global mean-diff1.0−0.2010.0000160.000002No

Example 2: A correctly classified food=0, sentiment=0, country=1 example

Steer DirectionαCosine vs. LocalBase PSteered PFlip?
Correct local δ₀₀0.81.0000.9995070.999965No
Anti-parallel δ₁₁0.8−0.9950.9995070.066832Yes
Global mean-diff0.8−0.2010.9995070.998274No

In part 2, we build our own interpretability challenge.

Task 3: Our Not-So-Honest Model: TrapXORHead

Published from Brisbane, Australia.