AI Spy (Web): AI Speech Detection

Overview
AI Spy is a production-grade AI safety platform focused on detecting AI-generated speech. Among deepfakes, synthetic audio is particularly pernicious: human auditory acuity is far less precise than visual acuity, while many critical services (banking, customer support, contact centers, and high‑risk approvals) are mediated by voice. That asymmetry makes voice‑based impersonation both harder for people to spot and easier for attackers to exploit.
To help counter this risk, AI Spy provides free checks for everyone. The goal is simple: broaden access to trustworthy detection so we can collectively reduce the harm from an increasingly capable wave of AI speech deepfakes.
Why did I build this?
In short, I tried ElevenLabs' API in the summer of 2023 and realized that this technology was likely to cross the uncanny valley in a matter of months. Of course, this isn't just true for speech, but also for images, text and video. However, our visual senses far exceeds that of our auditory acuity — we can see better than we can hear — so if I were to wear my black hat for a second, any deepfake‑based attack should aim to provide as little signal as possible in the medium we are least attuned to distinguish — this is clearly audio‑only. So I started there.
In digging further into the space, I realized that there were a ton of critical systems that leveraged audio‑only verification to authenticate customers. Banks, utility companies, doctors offices, schools — the list went on. Without updated biometric systems that could distinguish between voices that AI systems deliberately tried to clone there would be little stopping bad actors from escalating their privileges.
Not only that — research on humans' ability to detect audio deepfakes has exposed a concerning reality: people miss roughly a quarter to a half of all audio deepfakes (UF News, Systematic review and meta‑analysis).
The web platform exists to make this capability broadly accessible in the browser for teams and enterprises, while still offering free checks for the public. It shares the same core safety goals as mobile, but is designed for higher‑throughput workflows and collaboration.
Platform and stack
- Frontend (this repo): Next.js App Router, Tailwind, modern React patterns; responsive, accessible UI for drag‑and‑drop upload and instant results.
- Backend (companion service): FastAPI + PyTorch inference pipeline, strict file validation, rate limiting, and storage integration.
- Payments & auth: Stripe for subscriptions; Clerk for auth (when enabled).
Architecture overview
-
Auth bootstrap (client → FastAPI via Next proxy)
- Client requests a short‑lived, anonymous token via
POST /api/auth/token
(Next route proxies to FastAPIPOST /auth/token
). - FastAPI returns a signed, HMAC token with expiry; client uses it as
Authorization: Bearer <token>
for all FastAPI calls.
- Client requests a short‑lived, anonymous token via
-
Upload using signed URLs (client → GCS)
- Client asks FastAPI for a V4 GCS signed URL:
POST /generate-upload-url { file_name, file_type }
. - FastAPI sanitizes the filename, validates extension/MIME/magic bytes, generates a unique object name, and returns a PUT URL with ~10s TTL.
- Client uploads the audio directly to GCS via HTTP PUT to the signed URL.
- Client asks FastAPI for a V4 GCS signed URL:
-
Queue background analysis (client → FastAPI → Cloud Tasks → Cloud Run worker)
- Client starts a report:
POST /report { bucket_name, file_name }
. - FastAPI enqueues a Cloud Task (parent from
GOOGLE_CLOUD_PROJECT
,CLOUD_TASKS_LOCATION
,CLOUD_TASKS_QUEUE
) targetingWORKER_URL/process-report
with a 300s deadline. - Cloud Tasks delivers an authenticated HTTP POST (by shared secret in this repo) to the same FastAPI service’s
/process-report
handler running on Cloud Run. The worker verifiesX-Tasks-Secret
(prefer OIDC in prod).
- Client starts a report:
-
Worker processing (Cloud Run)
- Download: Worker downloads the uploaded object from GCS to
/tmp
. - Transcription (Deepgram): Calls Deepgram “nova-2” for transcript, word‑level timestamps, sentiment, and summary (with robust fallback if SDK typing issues occur).
- ML inference (PyTorch):
- Resample to 16 kHz, window into 3‑second non‑overlapping chunks (no 50% overlap), compute log‑mel spectrograms, run a CNN (
DeepfakeDetectorCNN
) with sigmoid output as AI probability. - For each chunk: label “AI” if prob > 0.5 else “Human”; confidence is max(prob, 1‑prob).
- Aggregate: compute percent AI/Human; set overall verdict via thresholds (AI if >60% AI chunks, Human if >60% Human chunks; else Mixed/Uncertain by confidence band).
- Resample to 16 kHz, window into 3‑second non‑overlapping chunks (no 50% overlap), compute log‑mel spectrograms, run a CNN (
- Results assembly:
- Summary stats + per‑3s timeline + overall verdict + transcription payload.
- Persisted in‑memory under
task_id
forGET /report-status/{task_id}
.
- Download: Worker downloads the uploaded object from GCS to
-
Client UX (poll + render)
- Client polls
GET /report-status/{task_id}
until status = completed. If the user lacks an active subscription, the API returns a limited view (summary + first few timeline entries, truncated transcription). - The UI renders:
- Timeline of 3‑second segments with per‑chunk confidence.
- Transcript with word‑level risk colorization (green/yellow/red) mapped from the associated timeline confidence at each word’s timestamp.
- Summary statistics and overall verdict.
- Client polls
-
Subscription gating and chat
- Subscription checks use a Next API route
GET /api/check-subscription
that verifies active Stripe subscriptions (customer search byuserId
metadata). - Pro features (full transcript, full timeline, chat) are gated for subscribers. Chat uses Gemini and enforces a per‑report quota (10 messages).
- Subscription checks use a Next API route
-
Security and limits
- Strict file validation (sanitized name, MIME, and magic bytes), 40 MB max, slowapi rate‑limits per endpoint, and strict response security headers.
- Cloud Tasks worker access is gated by
X-Tasks-Secret
in OSS mode; use OIDC in production. - GCS object age check (~60s) helps detect unauthorized uploads.
Local/dev mode (no Cloud Tasks/GCS)
- You can call FastAPI directly:
POST /analyze
accepts multipart file upload and returns per‑chunk predictions and aggregate verdict synchronously.POST /transcribe
accepts multipart file and returns transcript/sentiment. In the app, transcription is kicked off client‑side in parallel to the background report for snappy UX.
Where Cloud Run & Containerization fit
- FastAPI runs in a container on Cloud Run (Uvicorn). Cloud Tasks targets the Cloud Run URL (
WORKER_URL
) to invoke/process-report
. - Next.js can run on your preferred host (e.g., Vercel), using its API routes as a thin proxy (token issuance, Stripe interactions).
- Containerization: package
fast_api/
into an image (Python 3.9+,uvicorn app:app
). Provide env vars (GOOGLE_CLOUD_PROJECT
,GCS_BUCKET_NAME
,CLOUD_TASKS_QUEUE
,CLOUD_TASKS_LOCATION
,WORKER_URL
,TASKS_SHARED_SECRET
,DEEPGRAM_API_KEY
,GOOGLE_AI_API_KEY
, Stripe keys). Deploy to Cloud Run; set min instances as needed for cold‑start mitigation.
Diagrams
- Architecture
graph LR
subgraph "Browser + Next.js"
NUI["Upload UI + Timeline + Chat"]
NXAPI["Next API routes (/api/*)"]
end
subgraph "FastAPI (Cloud Run container)"
FAPI["FastAPI REST API"]
WORKER["/process-report worker (same service)"]
end
subgraph "GCP"
GCS(("Cloud Storage (GCS)"))
TASKS(("Cloud Tasks queue"))
end
subgraph "Third‑party APIs"
DG["Deepgram"]
GEM["Gemini"]
STR["Stripe"]
end
NUI --> NXAPI
NXAPI -->|" /api/auth/token → proxies "| FAPI
NUI -->|" Bearer token "| FAPI
FAPI -->|" v4 signed URL "| NUI
NUI -->|" PUT (signed URL) "| GCS
NUI -->|" POST /report {bucket,file} "| FAPI
FAPI -->|" create "| TASKS
TASKS -->|" HTTP POST "| WORKER
WORKER -->|" download "| GCS
WORKER -->|" transcribe "| DG
WORKER -->|" ML inference "| WORKER
WORKER -->|" store status/result "| FAPI
NUI -->|" poll /report-status/{task} "| FAPI
NXAPI <--> STR
FAPI --> GEM
- Sequence
sequenceDiagram
autonumber
participant U as User (Browser)
participant FE as Next.js App (Client)
participant NX as Next API Routes (/api/*)
participant BE as FastAPI (Cloud Run)
participant GCS as Cloud Storage
participant CT as Cloud Tasks
participant WK as Worker (/process-report)
participant DG as Deepgram
participant STR as Stripe
U->>FE: Select/drag-drop audio file
FE->>NX: POST /api/auth/token
NX->>BE: POST /auth/token
BE-->>NX: { token, expires_in }
NX-->>FE: { token }
FE->>BE: POST /generate-upload-url (Bearer token)
BE-->>FE: { signed_url, file_name, bucket }
FE->>GCS: PUT file to signed_url
par In parallel
FE->>BE: POST /report { bucket, file_name }
FE->>BE: POST /transcribe (multipart) [optional gating]
and
FE->>NX: GET /api/check-subscription
NX->>STR: Verify customer/subscription
STR-->>NX: Active? details
NX-->>FE: { hasSubscription }
end
BE->>CT: Enqueue task → /process-report
CT->>WK: HTTP POST (task payload)
WK->>GCS: Download object
WK->>WK: Normalize + 3s windows + PyTorch inference
WK->>DG: Transcribe + sentiment + summary
WK-->>BE: Persist in-memory job status/results
loop Polling
FE->>BE: GET /report-status/{taskId}
BE-->>FE: { status | result | limited-by-plan }
end
FE-->>U: Timeline grid, colored transcript, summary stats, overall verdict
See the codebases:
- Web app:
ibiggy9/ai-spy-web
— github.com/ibiggy9/ai-spy-web - Mobile app:
ibiggy9/ai-spy-mobile-public
— github.com/ibiggy9/ai-spy-mobile-public
The generalization crisis
Deepfake‑Eval‑2024 — a large in‑the‑wild benchmark with 56.5 hours of audio, 45 hours of video, and 1,975 images spanning 88 websites and 52 languages — shows that state‑of‑the‑art open‑source detectors drop sharply on real‑world data. Compared to prior academic benchmarks, AUC decreases by ~48% for audio (50% for video, 45% for images), underscoring a substantial distribution‑shift problem. See: arXiv:2503.02857.
This mirrors what we see in practice: performance can degrade across new generators, prompts, speakers, microphones, rooms, and codecs. Commercial systems and models fine‑tuned on Deepfake‑Eval‑2024 perform better than off‑the‑shelf open‑source models but still trail expert forensic analysts [arXiv:2503.02857].
I'm working to close this gap by:
- Broadening domain coverage (languages, speakers, microphones, acoustics, codecs)
- Tracking generator churn with frequent sampling of new TTS/voice‑cloning models and prompts
- Hard‑negative mining and adversarial augmentations
- OOD‑first evaluation with calibrated thresholds and cohort‑aware analysis
- Continual validation and versioned releases to catch drift quickly
The ~48% decrease in performance is a threat to this entire category.
Where Ai-SPY Shines
With Ai-SPY, we trained our custom curated dataset up to a competitive EER of 0.05 and were able to reduce the degradation on the same dataset used in the 2024 eval study to -17% up from -48% as mentioned in their study, demonstrating an improved generalization. I am working on an article to outline this research. Stay tuned.
Links
- Web repository: ibiggy9/ai-spy-web
- Mobile repository: ibiggy9/ai-spy-mobile-public
- Deepfake‑Eval‑2024 in‑the‑wild benchmark: arXiv:2503.02857
- Study on generalization degradation (2024): arXiv:2308.04177
- Human ability to detect audio deepfakes (news): UF News
- Human deepfake detection performance (systematic review): ResearchGate