AI Spy (Web): AI Speech Detection

Overview

AI Spy is a production-grade AI safety platform focused on detecting AI-generated speech. Among deepfakes, synthetic audio is particularly pernicious: human auditory acuity is far less precise than visual acuity, while many critical services (banking, customer support, contact centers, and high‑risk approvals) are mediated by voice. That asymmetry makes voice‑based impersonation both harder for people to spot and easier for attackers to exploit.

To help counter this risk, AI Spy provides free checks for everyone. The goal is simple: broaden access to trustworthy detection so we can collectively reduce the harm from an increasingly capable wave of AI speech deepfakes.

Why did I build this?

In short, I tried ElevenLabs' API in the summer of 2023 and realized that this technology was likely to cross the uncanny valley in a matter of months. Of course, this isn't just true for speech, but also for images, text and video. However, our visual senses far exceeds that of our auditory acuity — we can see better than we can hear — so if I were to wear my black hat for a second, any deepfake‑based attack should aim to provide as little signal as possible in the medium we are least attuned to distinguish — this is clearly audio‑only. So I started there.

In digging further into the space, I realized that there were a ton of critical systems that leveraged audio‑only verification to authenticate customers. Banks, utility companies, doctors offices, schools — the list went on. Without updated biometric systems that could distinguish between voices that AI systems deliberately tried to clone there would be little stopping bad actors from escalating their privileges.

Not only that — research on humans' ability to detect audio deepfakes has exposed a concerning reality: people miss roughly a quarter to a half of all audio deepfakes (UF News, Systematic review and meta‑analysis).

The web platform exists to make this capability broadly accessible in the browser for teams and enterprises, while still offering free checks for the public. It shares the same core safety goals as mobile, but is designed for higher‑throughput workflows and collaboration.

Platform and stack

Frontend (this repo): Next.js App Router, Tailwind, modern React patterns; responsive, accessible UI for drag‑and‑drop upload and instant results.
Backend (companion service): FastAPI + PyTorch inference pipeline, strict file validation, rate limiting, and storage integration.
Payments & auth: Stripe for subscriptions; Clerk for auth (when enabled).

Architecture overview

Auth bootstrap (client → FastAPI via Next proxy)
- Client requests a short‑lived, anonymous token via POST /api/auth/token (Next route proxies to FastAPI POST /auth/token).
- FastAPI returns a signed, HMAC token with expiry; client uses it as Authorization: Bearer <token> for all FastAPI calls.
Upload using signed URLs (client → GCS)
- Client asks FastAPI for a V4 GCS signed URL: POST /generate-upload-url { file_name, file_type }.
- FastAPI sanitizes the filename, validates extension/MIME/magic bytes, generates a unique object name, and returns a PUT URL with ~10s TTL.
- Client uploads the audio directly to GCS via HTTP PUT to the signed URL.
Queue background analysis (client → FastAPI → Cloud Tasks → Cloud Run worker)
- Client starts a report: POST /report { bucket_name, file_name }.
- FastAPI enqueues a Cloud Task (parent from GOOGLE_CLOUD_PROJECT, CLOUD_TASKS_LOCATION, CLOUD_TASKS_QUEUE) targeting WORKER_URL/process-report with a 300s deadline.
- Cloud Tasks delivers an authenticated HTTP POST (by shared secret in this repo) to the same FastAPI service’s /process-report handler running on Cloud Run. The worker verifies X-Tasks-Secret (prefer OIDC in prod).
Worker processing (Cloud Run)
- Download: Worker downloads the uploaded object from GCS to /tmp.
- Transcription (Deepgram): Calls Deepgram “nova-2” for transcript, word‑level timestamps, sentiment, and summary (with robust fallback if SDK typing issues occur).
- ML inference (PyTorch):
  - Resample to 16 kHz, window into 3‑second non‑overlapping chunks (no 50% overlap), compute log‑mel spectrograms, run a CNN (DeepfakeDetectorCNN) with sigmoid output as AI probability.
  - For each chunk: label “AI” if prob > 0.5 else “Human”; confidence is max(prob, 1‑prob).
  - Aggregate: compute percent AI/Human; set overall verdict via thresholds (AI if >60% AI chunks, Human if >60% Human chunks; else Mixed/Uncertain by confidence band).
- Results assembly:
  - Summary stats + per‑3s timeline + overall verdict + transcription payload.
  - Persisted in‑memory under task_id for GET /report-status/{task_id}.
Client UX (poll + render)
- Client polls GET /report-status/{task_id} until status = completed. If the user lacks an active subscription, the API returns a limited view (summary + first few timeline entries, truncated transcription).
- The UI renders:
  - Timeline of 3‑second segments with per‑chunk confidence.
  - Transcript with word‑level risk colorization (green/yellow/red) mapped from the associated timeline confidence at each word’s timestamp.
  - Summary statistics and overall verdict.
Subscription gating and chat
- Subscription checks use a Next API route GET /api/check-subscription that verifies active Stripe subscriptions (customer search by userId metadata).
- Pro features (full transcript, full timeline, chat) are gated for subscribers. Chat uses Gemini and enforces a per‑report quota (10 messages).
Security and limits
- Strict file validation (sanitized name, MIME, and magic bytes), 40 MB max, slowapi rate‑limits per endpoint, and strict response security headers.
- Cloud Tasks worker access is gated by X-Tasks-Secret in OSS mode; use OIDC in production.
- GCS object age check (~60s) helps detect unauthorized uploads.

Local/dev mode (no Cloud Tasks/GCS)

You can call FastAPI directly:
- POST /analyze accepts multipart file upload and returns per‑chunk predictions and aggregate verdict synchronously.
- POST /transcribe accepts multipart file and returns transcript/sentiment. In the app, transcription is kicked off client‑side in parallel to the background report for snappy UX.

Where Cloud Run & Containerization fit

FastAPI runs in a container on Cloud Run (Uvicorn). Cloud Tasks targets the Cloud Run URL (WORKER_URL) to invoke /process-report.
Next.js can run on your preferred host (e.g., Vercel), using its API routes as a thin proxy (token issuance, Stripe interactions).
Containerization: package fast_api/ into an image (Python 3.9+, uvicorn app:app). Provide env vars (GOOGLE_CLOUD_PROJECT, GCS_BUCKET_NAME, CLOUD_TASKS_QUEUE, CLOUD_TASKS_LOCATION, WORKER_URL, TASKS_SHARED_SECRET, DEEPGRAM_API_KEY, GOOGLE_AI_API_KEY, Stripe keys). Deploy to Cloud Run; set min instances as needed for cold‑start mitigation.

Diagrams

Architecture

graph LR
  subgraph "Browser + Next.js"
    NUI["Upload UI + Timeline + Chat"]
    NXAPI["Next API routes (/api/*)"]
  end

  subgraph "FastAPI (Cloud Run container)"
    FAPI["FastAPI REST API"]
    WORKER["/process-report worker (same service)"]
  end

  subgraph "GCP"
    GCS(("Cloud Storage (GCS)"))
    TASKS(("Cloud Tasks queue"))
  end

  subgraph "Third‑party APIs"
    DG["Deepgram"]
    GEM["Gemini"]
    STR["Stripe"]
  end

  NUI --> NXAPI
  NXAPI -->|" /api/auth/token → proxies "| FAPI
  NUI -->|" Bearer token "| FAPI
  FAPI -->|" v4 signed URL "| NUI
  NUI -->|" PUT (signed URL) "| GCS

  NUI -->|" POST /report {bucket,file} "| FAPI
  FAPI -->|" create "| TASKS
  TASKS -->|" HTTP POST "| WORKER
  WORKER -->|" download "| GCS
  WORKER -->|" transcribe "| DG
  WORKER -->|" ML inference "| WORKER
  WORKER -->|" store status/result "| FAPI
  NUI -->|" poll /report-status/{task} "| FAPI

  NXAPI <--> STR
  FAPI --> GEM

Sequence

sequenceDiagram
  autonumber
  participant U as User (Browser)
  participant FE as Next.js App (Client)
  participant NX as Next API Routes (/api/*)
  participant BE as FastAPI (Cloud Run)
  participant GCS as Cloud Storage
  participant CT as Cloud Tasks
  participant WK as Worker (/process-report)
  participant DG as Deepgram
  participant STR as Stripe

  U->>FE: Select/drag-drop audio file
  FE->>NX: POST /api/auth/token
  NX->>BE: POST /auth/token
  BE-->>NX: { token, expires_in }
  NX-->>FE: { token }
  FE->>BE: POST /generate-upload-url (Bearer token)
  BE-->>FE: { signed_url, file_name, bucket }
  FE->>GCS: PUT file to signed_url
  par In parallel
    FE->>BE: POST /report { bucket, file_name }
    FE->>BE: POST /transcribe (multipart) [optional gating]
  and
    FE->>NX: GET /api/check-subscription
    NX->>STR: Verify customer/subscription
    STR-->>NX: Active? details
    NX-->>FE: { hasSubscription }
  end
  BE->>CT: Enqueue task → /process-report
  CT->>WK: HTTP POST (task payload)
  WK->>GCS: Download object
  WK->>WK: Normalize + 3s windows + PyTorch inference
  WK->>DG: Transcribe + sentiment + summary
  WK-->>BE: Persist in-memory job status/results
  loop Polling
    FE->>BE: GET /report-status/{taskId}
    BE-->>FE: { status | result | limited-by-plan }
  end
  FE-->>U: Timeline grid, colored transcript, summary stats, overall verdict

See the codebases:

Web app: ibiggy9/ai-spy-web — github.com/ibiggy9/ai-spy-web
Mobile app: ibiggy9/ai-spy-mobile-public — github.com/ibiggy9/ai-spy-mobile-public

The generalization crisis

Deepfake‑Eval‑2024 — a large in‑the‑wild benchmark with 56.5 hours of audio, 45 hours of video, and 1,975 images spanning 88 websites and 52 languages — shows that state‑of‑the‑art open‑source detectors drop sharply on real‑world data. Compared to prior academic benchmarks, AUC decreases by ~48% for audio (50% for video, 45% for images), underscoring a substantial distribution‑shift problem. See: arXiv:2503.02857.

This mirrors what we see in practice: performance can degrade across new generators, prompts, speakers, microphones, rooms, and codecs. Commercial systems and models fine‑tuned on Deepfake‑Eval‑2024 perform better than off‑the‑shelf open‑source models but still trail expert forensic analysts [arXiv:2503.02857].

I'm working to close this gap by:

Broadening domain coverage (languages, speakers, microphones, acoustics, codecs)
Tracking generator churn with frequent sampling of new TTS/voice‑cloning models and prompts
Hard‑negative mining and adversarial augmentations
OOD‑first evaluation with calibrated thresholds and cohort‑aware analysis
Continual validation and versioned releases to catch drift quickly

The ~48% decrease in performance is a threat to this entire category.

Where Ai-SPY Shines

With Ai-SPY, we trained our custom curated dataset up to a competitive EER of 0.05 and were able to reduce the degradation on the same dataset used in the 2024 eval study to -17% up from -48% as mentioned in their study, demonstrating an improved generalization. I am working on an article to outline this research. Stay tuned.