Blog
4/10/2026Three Functional Roles of the Per-Layer Embedding Gate in Gemma-4 E2B
I ran polysemy tests, magnitude decompositions, and a full causal ablation battery across all 35 layers of Gemma 4 E2B's Per-Layer Embedding gate. The gate contains at least three independent mechanisms with different causal signatures: Layer 6 carries word-sense information correlationally but its causal contribution is syntactic and lexical; Layers 13-14 inject a massive token-identity signal that is net-harmful on English and German but net-helpful on Chinese; Layer 33 is a late-stage output prior whose removal is catastrophic (+1.59 NLL). The primary evidence for the L13/14 finding is mean-ablation (−0.159 nats, P=1.000 on 500k tokens), not zero-ablation. The Chinese sign flip means domain-conditioned analysis, not uniform pruning. Treating PLE as a single mechanism is the wrong unit of analysis.
3/28/2026Cracking Open Gemma 3 4B Part 2: Transcoders And Generation-Time Behavioral Circuits
SAEs found strong encoding-time features for sycophancy and over-refusal, but missed generation-time behaviors entirely. Transcoders, which decompose MLP computation rather than residual stream state, unlock overconfidence as a guardrail-viable behavior and reveal that some behaviors are states decided at encoding while others are computations that unfold during generation.
3/18/2026Large Audio Deepfake detection models perform well on academic benchmarks but fail in the real world compared to smaller models
A 2M parameter model with no pretrained backbone beat my 350M parameter WavLM pipeline by 24 percentage points on out-of-distribution data. I ran 50 experiments across four architectures, multiple datasets, and different audio codecs. The results inverted every assumption I had.
3/18/2026Cracking Open Gemma 3 4B Part 1: Finding Behavioral Circuits With Sparse Autoencoders
I ran contrastive feature discovery across six model behaviors, four layers, and hundreds of prompts to find SAE features that reliably detect sycophancy, over-refusal, hallucination, and more. Sycophancy produced features so strong a runtime guardrail is immediately viable. Hallucination produced almost nothing. The difference comes down to where in the forward pass each behavior lives.
2/13/2026Building AI text detection that explains itself
Most AI detectors give you a percentage and call it a day. We built one that shows you which sentences triggered the verdict, how much each one mattered, and why. It uses attention-based attribution on a sliding window transformer.
7/13/2025Are words the best building blocks for AI?
Language tokens are a poor substrate for grounded intelligence. This post argues for structured, world-centric tokens (geometry, dynamics, agency, causality) and outlines bridging mechanisms like cross-attention and V-JEPA to connect language with learned perceptual models.
7/9/2025Agents that learn in production
Most LLM agents don’t learn from deployment experience, but that’s changing. This post covers the research (DPO, GRPO, continual learning), the tooling (OpenPipe’s ART framework), and the first real production deployment (Cursor’s real time RL for Composer), plus the hard gaps still left to close.
3/25/2025The unique risks of audio deepfakes
Human detection of voice deepfakes is unreliable (60–73%); automated detectors hit 98%+ in-lab but fail to generalize to unseen attacks. Risks are rising; mitigation requires provenance standards, robust field-trained detectors, and on-device voice verification.
1/4/2025Design Principles for AI-Based Mental Health Tools
Lessons from building Flourish on addressing sycophancy and the risk of echo chambers, adding user-controlled stateful memory, structuring sessions, anchoring to evidence-based techniques, enforcing therapeutic boundaries, and building specialized, auditable systems.