Ian Bigford

AI Spy (Web): AI Speech Detection

6/2/2025

Overview

AI Spy is the web counterpart to the mobile app. The mobile version is built for quick, on-the-go checks — the web version is built for teams, higher-throughput workflows, and people who want to drag-and-drop files in a browser. Same detection pipeline, same safety goals, just a different surface.

I keep the core checks free for everyone. If you want to know whether a clip is AI-generated, you shouldn't need a subscription to find out. Pro users get the full transcript, word-level risk coloring, and chat — but the detection itself is open.

AI Spy Web Interface — displaying word-level risk analysis with green, yellow, and red transcript coloring alongside a per-chunk confidence timeline and aggregate AI probability curve

Why I built this

In short, I tried ElevenLabs' API in the summer of 2023 and realized that this technology was likely to cross the uncanny valley in a matter of months. Of course, this isn't just true for speech, but also for images, text and video. However, our visual senses far exceeds that of our auditory acuity — we can see better than we can hear — so if I were to wear my black hat for a second, any deepfake‑based attack should aim to provide as little signal as possible in the medium we are least attuned to distinguish — this is clearly audio‑only. So I started there.

In digging further into the space, I realized that there were a ton of critical systems that leveraged audio‑only verification to authenticate customers. Banks, utility companies, doctors offices, schools — the list went on. Without updated biometric systems that could distinguish between voices that AI systems deliberately tried to clone there would be little stopping bad actors from escalating their privileges.

Not only that — research on humans' ability to detect audio deepfakes has exposed a concerning reality: people miss roughly a quarter to a half of all audio deepfakes (UF News, Systematic review and meta‑analysis).

The web platform came after the mobile app because I wanted to reach teams and enterprises who needed this in a browser — drag-and-drop a file, get a result, share it with your team.

The Uncanny Valley of Audio — side-by-side waveform comparison showing an organic, irregular human voice signal on the left versus a suspiciously perfect, uniformly periodic AI-synthesized waveform on the right

Platform and stack

  • Frontend (this repo): Next.js App Router, Tailwind, modern React patterns; responsive, accessible UI for drag‑and‑drop upload and instant results.
  • Backend (companion service): FastAPI + PyTorch inference pipeline, strict file validation, rate limiting, and storage integration.
  • Payments & auth: Stripe for subscriptions; Clerk for auth (when enabled).

Architecture overview

  • Auth bootstrap (client → FastAPI via Next proxy)

    • Client requests a short‑lived, anonymous token via POST /api/auth/token (Next route proxies to FastAPI POST /auth/token).
    • FastAPI returns a signed, HMAC token with expiry; client uses it as Authorization: Bearer <token> for all FastAPI calls.
  • Upload using signed URLs (client → GCS)

    • Client asks FastAPI for a V4 GCS signed URL: POST /generate-upload-url { file_name, file_type }.
    • FastAPI sanitizes the filename, validates extension/MIME/magic bytes, generates a unique object name, and returns a PUT URL with ~10s TTL.
    • Client uploads the audio directly to GCS via HTTP PUT to the signed URL.
  • Queue background analysis (client → FastAPI → Cloud Tasks → Cloud Run worker)

    • Client starts a report: POST /report { bucket_name, file_name }.
    • FastAPI enqueues a Cloud Task (parent from GOOGLE_CLOUD_PROJECT, CLOUD_TASKS_LOCATION, CLOUD_TASKS_QUEUE) targeting WORKER_URL/process-report with a 300s deadline.
    • Cloud Tasks delivers an authenticated HTTP POST (by shared secret in this repo) to the same FastAPI service’s /process-report handler running on Cloud Run. The worker verifies X-Tasks-Secret (prefer OIDC in prod).
  • Worker processing (Cloud Run)

    • Download: Worker downloads the uploaded object from GCS to /tmp.
    • Transcription (Deepgram): Calls Deepgram “nova-2” for transcript, word‑level timestamps, sentiment, and summary (with robust fallback if SDK typing issues occur).
    • ML inference (PyTorch):
      • Resample to 16 kHz, window into 3‑second non‑overlapping chunks (no 50% overlap), compute log‑mel spectrograms, run a CNN (DeepfakeDetectorCNN) with sigmoid output as AI probability.
      • For each chunk: label "AI" if prob > 0.5 else "Human"; confidence is max(prob, 1‑prob).
      • Aggregate: compute percent AI/Human; set overall verdict via thresholds (AI if >60% AI chunks, Human if >60% Human chunks; else Mixed/Uncertain by confidence band).

Log-Mel Spectrogram — a 9-second audio clip decomposed into 128 Mel frequency bins across three non-overlapping 3-second windows, with each window fed independently into the CNN for per-chunk AI probability scoring

  • Results assembly:

    • Summary stats + per‑3s timeline + overall verdict + transcription payload.
    • Persisted in‑memory under task_id for GET /report-status/{task_id}.
  • Client UX (poll + render)

    • Client polls GET /report-status/{task_id} until status = completed. If the user lacks an active subscription, the API returns a limited view (summary + first few timeline entries, truncated transcription).
    • The UI renders:
      • Timeline of 3‑second segments with per‑chunk confidence.
      • Transcript with word‑level risk colorization (green/yellow/red) mapped from the associated timeline confidence at each word’s timestamp.
      • Summary statistics and overall verdict.
  • Subscription gating and chat

    • Subscription checks use a Next API route GET /api/check-subscription that verifies active Stripe subscriptions (customer search by userId metadata).
    • Pro features (full transcript, full timeline, chat) are gated for subscribers. Chat uses Gemini and enforces a per‑report quota (10 messages).
  • Security and limits

    • Strict file validation (sanitized name, MIME, and magic bytes), 40 MB max, slowapi rate‑limits per endpoint, and strict response security headers.
    • Cloud Tasks worker access is gated by X-Tasks-Secret in OSS mode; use OIDC in production.
    • GCS object age check (~60s) helps detect unauthorized uploads.

Local/dev mode (no Cloud Tasks/GCS)

  • You can call FastAPI directly:
    • POST /analyze accepts multipart file upload and returns per‑chunk predictions and aggregate verdict synchronously.
    • POST /transcribe accepts multipart file and returns transcript/sentiment. In the app, transcription is kicked off client‑side in parallel to the background report for snappy UX.

Where Cloud Run & Containerization fit

  • FastAPI runs in a container on Cloud Run (Uvicorn). Cloud Tasks targets the Cloud Run URL (WORKER_URL) to invoke /process-report.
  • Next.js can run on your preferred host (e.g., Vercel), using its API routes as a thin proxy (token issuance, Stripe interactions).
  • Containerization: package fast_api/ into an image (Python 3.9+, uvicorn app:app). Provide env vars (GOOGLE_CLOUD_PROJECT, GCS_BUCKET_NAME, CLOUD_TASKS_QUEUE, CLOUD_TASKS_LOCATION, WORKER_URL, TASKS_SHARED_SECRET, DEEPGRAM_API_KEY, GOOGLE_AI_API_KEY, Stripe keys). Deploy to Cloud Run; set min instances as needed for cold‑start mitigation.

Diagrams

Architecture

graph LR
  subgraph "Browser + Next.js"
    NUI["Upload UI + Timeline + Chat"]
    NXAPI["Next API routes (/api/*)"]
  end
 
  subgraph "FastAPI (Cloud Run container)"
    FAPI["FastAPI REST API"]
    WORKER["/process-report worker (same service)"]
  end
 
  subgraph "GCP"
    GCS(("Cloud Storage (GCS)"))
    TASKS(("Cloud Tasks queue"))
  end
 
  subgraph "Third‑party APIs"
    DG["Deepgram"]
    GEM["Gemini"]
    STR["Stripe"]
  end
 
  NUI --> NXAPI
  NXAPI -->|" /api/auth/token → proxies "| FAPI
  NUI -->|" Bearer token "| FAPI
  FAPI -->|" v4 signed URL "| NUI
  NUI -->|" PUT (signed URL) "| GCS
 
  NUI -->|" POST /report {bucket,file} "| FAPI
  FAPI -->|" create "| TASKS
  TASKS -->|" HTTP POST "| WORKER
  WORKER -->|" download "| GCS
  WORKER -->|" transcribe "| DG
  WORKER -->|" ML inference "| WORKER
  WORKER -->|" store status/result "| FAPI
  NUI -->|" poll /report-status/{task} "| FAPI
 
  NXAPI <--> STR
  FAPI --> GEM

Sequence

sequenceDiagram
  autonumber
  participant U as User (Browser)
  participant FE as Next.js App (Client)
  participant NX as Next API Routes (/api/*)
  participant BE as FastAPI (Cloud Run)
  participant GCS as Cloud Storage
  participant CT as Cloud Tasks
  participant WK as Worker (/process-report)
  participant DG as Deepgram
  participant STR as Stripe
 
  U->>FE: Select/drag-drop audio file
  FE->>NX: POST /api/auth/token
  NX->>BE: POST /auth/token
  BE-->>NX: { token, expires_in }
  NX-->>FE: { token }
  FE->>BE: POST /generate-upload-url (Bearer token)
  BE-->>FE: { signed_url, file_name, bucket }
  FE->>GCS: PUT file to signed_url
  par In parallel
    FE->>BE: POST /report { bucket, file_name }
    FE->>BE: POST /transcribe (multipart) [optional gating]
  and
    FE->>NX: GET /api/check-subscription
    NX->>STR: Verify customer/subscription
    STR-->>NX: Active? details
    NX-->>FE: { hasSubscription }
  end
  BE->>CT: Enqueue task → /process-report
  CT->>WK: HTTP POST (task payload)
  WK->>GCS: Download object
  WK->>WK: Normalize + 3s windows + PyTorch inference
  WK->>DG: Transcribe + sentiment + summary
  WK-->>BE: Persist in-memory job status/results
  loop Polling
    FE->>BE: GET /report-status/{taskId}
    BE-->>FE: { status | result | limited-by-plan }
  end
  FE-->>U: Timeline grid, colored transcript, summary stats, overall verdict

See the codebases:

The generalization crisis (and what we're doing about it)

This is the hard part. Deepfake‑Eval‑2024 — a large in‑the‑wild benchmark with 56.5 hours of audio across 88 websites and 52 languages — showed that state‑of‑the‑art open‑source detectors drop sharply on real‑world data. AUC decreases by ~48% for audio compared to the academic benchmarks these models were trained on. That's a massive distribution‑shift problem. See: arXiv:2503.02857.

AUC Performance comparison — bar chart showing 95% AUC on academic benchmarks collapsing to 47% for state-of-the-art models on real-world in-the-wild audio, versus AI Spy's model recovering to 79% on the same benchmark with an EER of 0.05

This mirrors what I see in practice: performance degrades across new generators, new speakers, different microphones, different rooms, different codecs. Commercial systems do better than off‑the‑shelf open‑source models but still trail expert forensic analysts.

I'm working to close this gap by:

  • Broadening domain coverage (languages, speakers, microphones, acoustics, codecs)
  • Tracking generator churn with frequent sampling of new TTS/voice‑cloning models and prompts
  • Hard‑negative mining and adversarial augmentations
  • OOD‑first evaluation with calibrated thresholds and cohort‑aware analysis
  • Continual validation and versioned releases to catch drift quickly

The ~48% drop is a threat to this entire category. But I've managed to get access to the eval set and brought degradation down to -17% with an EER of 0.05, up from the -48% in the original study. I'm working on an article to outline this research — stay tuned.