AI-SPY Text Detection: AI Writing Detection That Shows Its Work

Overview

Live demo: app.ai-spy.xyz

AI-SPY started as an audio deepfake detector. The text detection system is the second modality — same platform, same philosophy: don't just classify, explain. You paste text, you get back a per-sentence breakdown showing what the model thinks is AI, what it thinks is human, and which parts drove the verdict. No code is publicly available for this feature right now.

Why I built this

Most AI text detectors hand you a single percentage and call it a day. "87% AI-generated." Okay — which parts? Why? What do I do with that number? If you're a teacher looking at a student's essay, a percentage creates suspicion. Highlighted sentences with confidence levels create a conversation.

The deeper problem is that real-world text is almost never purely AI or purely human anymore. Someone drafts an outline, generates a few paragraphs with ChatGPT, rewrites half of it by hand, and submits the result. Binary detectors can't handle this. They weren't designed for the world we actually live in.

So I wanted to build something that operates at multiple levels of granularity — sentence, paragraph, and document — and that surfaces the model's own reasoning instead of hiding it behind a number. The idea isn't to be a lie detector. It's to be a tool that earns trust by showing its work.

How it works

The system has two detection approaches running behind the same API, each designed for different trade-offs.

The dual-path detector is inspired by the GPTZero architecture. It uses a shared DeBERTa V3 backbone with two parallel analysis paths:

Sentence path ("the magnifying glass") — encodes each sentence individually with CLS token extraction, combines the embedding with perplexity features computed from GPT-2, and classifies each sentence as AI or human. Perplexity measures how "surprised" a language model is by the text — AI-generated text tends to be more predictable, so it scores lower.
Paragraph path ("the wide-angle lens") — encodes paragraphs with mean pooling and combines the embedding with burstiness features. Burstiness is the variation in perplexity across sentences within a paragraph. Human writers naturally vary — some sentences are simple, others complex. AI tends to write at a more uniform difficulty level. Low burstiness is a signal.

Burstiness comparison — human writing shows erratic spikes in sentence complexity while AI writing stays flat and uniform

Document-level predictions emerge from aggregating sentence and paragraph predictions using a weighted combination (60/40 by default). The model is trained so that getting sentence and paragraph predictions right automatically produces accurate document predictions — no separate document head that might disagree with the granular analysis.

Dual-Path Detector Architecture — the sentence path acts as a magnifying glass while the paragraph path captures the wide-angle view, merging at a 60/40 weighted aggregation

The chunk-level detector takes a different approach. Instead of encoding sentences independently, it processes full 512-token chunks through DeBERTa where every token can attend to every other token. This gives much richer contextual signal. It classifies each chunk as AI, Human, or Mixed (three classes instead of two), and uses the [CLS] token's attention weights across all layers and heads as importance scores — a built-in attribution mechanism that shows which tokens the model found most diagnostic.

For documents longer than 512 tokens, a sliding window with 256-token stride creates overlapping chunks. Predictions are aggregated across overlapping regions, and importance scores are mapped back to character positions and aligned to sentence boundaries. The overlap is important — without it, sentences at chunk boundaries get misclassified because neither chunk sees them in full context.

What you see in the results

Anatomy of a result — labeled mockup showing composition stats, color-coded attribution, calibrated confidence score, and top influential sentences

When you submit text, you get:

Document-level verdict — AI, Human, or Mixed, with a confidence level (high, moderate, or low) and a plain-language assessment.
Paragraph breakdown — each paragraph classified independently, with AI probability and burstiness scores. Paragraphs flagged as suspicious get drilled down to the sentence level.
Sentence-level detail — per-sentence AI probability, confidence, and attention weight (how much that sentence influenced the overall verdict).
Three ranked lists — the top 5 most AI-likely sentences, the top 5 most influential sentences (by attention weight), and the top 5 most human-likely sentences. The most influential sentence isn't always the most AI-like — sometimes a single very human sentence in an otherwise AI document is what pulls the confidence down.
Composition stats — both character-level (what percentage of the text by volume reads as AI) and sentence-level (what percentage of sentences by count are classified as AI). These can diverge meaningfully.

Calibration

One of the training decisions I'm most happy with is using Brier score loss alongside standard cross-entropy during training. Cross-entropy optimizes for getting the right answer. Brier score optimizes for the model's confidence reflecting its actual accuracy — when it says "80% AI," it should be right about 80% of the time.

The practical effect is that the model produces genuinely uncertain predictions when text is ambiguous. Mixed documents, heavily edited AI text, or human text that happens to be stylistically flat — these get moderate confidence instead of false certainty. A detector that's confidently wrong is worse than no detector at all.

The stack

API: FastAPI with rate limiting (10 req/min), daily usage caps for free users, JWT auth, and subscription gating via Clerk + Stripe
Models: DeBERTa V3-Small for classification, GPT-2 for perplexity features — both run on CPU, no GPU required
Inference: Sliding window chunking with tokenizer offset mapping for character-level alignment
Frontend: Next.js with interactive result rendering — donut charts, color-coded text highlights, confidence badges
Infrastructure: Cloud Run (containerized), lazy model loading on first request

What this doesn't solve

Attention-based attribution is indicative, not causal. The importance scores show where the model looked, not necessarily why. Two different reasoning paths could produce the same attention pattern.

The three-class framing (AI/Human/Mixed) is better than binary, but it's still a simplification. In reality there's a spectrum — from "AI wrote every word" to "AI suggested a synonym the human accepted." The model doesn't capture that granularity.

And like all detectors, this is in an arms race. As language models improve and their output becomes more stylistically diverse, the statistical signatures the detector relies on get narrower. The edge comes from patterns that are perceptually invisible but statistically real — subtle distributional signatures in token frequency, sentence structure, and lexical choice.