Research
4/10/2026Three Functional Roles of the Per-Layer Embedding Gate in Gemma-4 E2B
I ran polysemy tests, magnitude decompositions, and a full causal ablation battery across all 35 layers of Gemma 4 E2B's Per-Layer Embedding gate. The gate contains at least three independent mechanisms with different causal signatures: Layer 6 carries word-sense information correlationally but its causal contribution is syntactic and lexical; Layers 13-14 inject a massive token-identity signal that is net-harmful on English and German but net-helpful on Chinese; Layer 33 is a late-stage output prior whose removal is catastrophic (+1.59 NLL). The primary evidence for the L13/14 finding is mean-ablation (−0.159 nats, P=1.000 on 500k tokens), not zero-ablation. The Chinese sign flip means domain-conditioned analysis, not uniform pruning. Treating PLE as a single mechanism is the wrong unit of analysis.
3/28/2026Cracking Open Gemma 3 4B Part 2: Transcoders And Generation-Time Behavioral Circuits
SAEs found strong encoding-time features for sycophancy and over-refusal, but missed generation-time behaviors entirely. Transcoders, which decompose MLP computation rather than residual stream state, unlock overconfidence as a guardrail-viable behavior and reveal that some behaviors are states decided at encoding while others are computations that unfold during generation.
3/18/2026Large Audio Deepfake detection models perform well on academic benchmarks but fail in the real world compared to smaller models
A 2M parameter model with no pretrained backbone beat my 350M parameter WavLM pipeline by 24 percentage points on out-of-distribution data. I ran 50 experiments across four architectures, multiple datasets, and different audio codecs. The results inverted every assumption I had.
3/18/2026Cracking Open Gemma 3 4B Part 1: Finding Behavioral Circuits With Sparse Autoencoders
I ran contrastive feature discovery across six model behaviors, four layers, and hundreds of prompts to find SAE features that reliably detect sycophancy, over-refusal, hallucination, and more. Sycophancy produced features so strong a runtime guardrail is immediately viable. Hallucination produced almost nothing. The difference comes down to where in the forward pass each behavior lives.