Ian Bigford

Ai-SPY (Mobile) — AI Speech Detection on iOS

4/10/2024

Overview

While the web version is an enterprise‑grade product, Ai‑SPY Mobile is a consumer‑grade iOS app that lets people verify audio on the go. It connects to the same detection pipeline and is optimized for quick capture, upload, and instant results. The underlying model is trained on a custom dataset targeting "in‑the‑wild" audio from many languages and sources, solidified across codecs and improved with augmentation and corruption during training.

Why I built this

In short, I tried ElevenLabs' API in the summer of 2023 and realized that this technology was likely to cross the uncanny valley in a matter of months. Of course, this isn't just true for speech, but also for images, text and video. However, our visual senses far exceeds that of our auditory acuity - we can see better than we can hear, so if I were to wear my black hat for a second, any deepfake-based attack should aim to provide as little signal as possible in the medium we are least attuned to distinguish - this is clearly audio-only. So I started there.

In digging further into the space, I realized that there were a ton of critical systems that leveraged audio-only verification to authenticate customers. Banks, utility companies, doctors offices, schools - the list went on. Without updated biometric systems that could distinguish between voices that AI systems deliberately tried to clone there would be little stopping bad actors from escalating their privileges.

Not only that — research on humans' ability to detect audio deepfakes has exposed a concerning reality: people miss roughly a quarter to a half of all audio deepfakes (UF News, Systematic review and meta-analysis).

Why speech deepfakes matter for AI safety

  • Lower human perceptual defenses: We are much better at catching visual artifacts than subtle audio manipulations.
  • High‑stakes, voice‑mediated workflows: Banking, healthcare triage, enterprise approvals, and emergency dispatch still rely on voice.
  • Economic and social damage: Impersonation and social engineering via synthetic speech have rapidly growing real‑world impact.

Platform and stack

  • Frontend (this app): Expo React Native (iOS) with a fast, accessible UI for recording or uploading audio and viewing results.
  • Backend (companion service): FastAPI + PyTorch inference pipeline with strict validation, rate limiting, and storage integration.
  • Distribution: App Store listing with regular updates.

Architecture overview (code-accurate)

  • Backend: FastAPI (api-mobile-app/app.py) with:

    • Auth: POST /auth/token issues HMAC-signed bearer tokens (base64 of client_id|expiry|timestamp|signature) using JWT_SECRET.
    • File analysis: POST /analyze accepts multipart/form-data file, validates, runs PyTorch model (AudioInference.analyze_file), returns immediate per-3s results.
    • Signed upload: POST /generate-upload-url returns a v4 GCS signed PUT URL (10s expiry) after filename/content-type validation.
    • Background job: POST /report enqueues Cloud Tasks → POST /process-report does download + inference + optional Deepgram transcription, stores results in jobs dict.
    • Polling: GET /report-status/{task_id}?has_subscription=… returns full or limited data based on tier.
    • Transcription direct: POST /transcribe uploads a file, Deepgram ASR; free tier limited to first 50 words.
    • Chat: POST /chat?has_subscription=…&task_id=… with optional analysis_data from client; free users receive gating message; pro users tracked per-report (10 message limit).
  • Inference pipeline (api-mobile-app/audio_processor.py):

    • Load audio with librosa.load(sr=None) then resample to 16 kHz.
    • Windowing: fixed non-overlapping 3-second chunks (target_sr 16000 × 3 s).
    • Features: log-mel spectrogram via torchaudio.transforms.MelSpectrogram (n_fft=512, hop=160, n_mels=128, f_min=20, f_max=8000, Slaney scale), then torch.log(mel + 1e-9).
    • Model: DeepfakeDetectorCNN (3 conv blocks + FC; sigmoid output yields AI probability).
    • Per-chunk classification: probability > 0.5 → AI else Human; confidence is max(prob_ai, 1 - prob_ai).
    • Aggregation: counts, percent_ai, percent_human, aggregate_confidence = mean(chunk confidences as defined above). Overall label:
      • AI if percent_ai > 60
      • Human if percent_human > 60
      • Uncertain if 40 ≤ aggregate_confidence ≤ 60
      • Mixed otherwise
    • Output: status, overall_prediction, aggregate_confidence, arrays of predictions[] and confidences[], counts and percentages.
  • Transcription (transcribe_audio_file in app.py):

    • Primary: Deepgram SDK nova-2 with smart_format, diarize, summarize v2, topics, sentiment.
    • Fallback: Raw HTTP if SDK union type issues; parses transcripts, words with timestamps, average sentiment, summary.
    • Returns uniform structure: text, words[{word,start,end,confidence}], average_sentiment, summary.
  • Security and ops:

    • Strict file validation: extension allowlist, MIME check, magic-byte heuristics (MP3/WAV), filename sanitization, size cap (40 MB).
    • Security headers, CORS from env (ALLOWED_ORIGINS), CSP restricting external connects to Deepgram and Google Generative AI.
    • Rate limiting via slowapi.
    • jobs in-memory store holds job lifecycle and chat usage counters.
  • Mobile app (Expo RN, ai-spy-mobile-app):

    • Auth and requests: Components/enhancedApiService.js
      • Generates/validates auth token; stores in expo-secure-store.
      • All requests require HTTPS and Bearer token.
      • Analyze direct: POST /analyze (form-data file).
      • Signed URL workflow: /generate-upload-url → PUT to GCS → /report → poll /report-status/{id}.
      • Tier-aware: free users see full timeline but no transcription; pro users get transcription and chat.
      • Chat: includes full analysis context optionally (analysis_data) to improve responses.
    • UI:
      • Screens/Home.js: upload audio; uses signed URL flow; shows Results on completion.
      • Screens/EnterLink.js: link-based processing via enhanced service; uses same results view.
      • Components/Results.js: computes timeline view model from various backend formats; renders:
        • SummaryStats: pie chart percentages computed from timeline.
        • Timeline grid of 3s segments; risk color coding uses thresholds on AI probability (>75 red, 40–75 yellow, <40 green).
        • Transcription: words colored by AI-probability per overlapping 3s interval; free users see prompt to upgrade, pro users see full text.
        • Chat button to ChatScreen with analysis context.
      • Components/Transcription.js: aligns word timestamps to 3s buckets, colors text accordingly; uses prediction+confidence to derive AI probability if not present.

Request flow (expanded, code-accurate)

  • Record or upload audio in the app

    • For files: Home.js uses expo-document-picker, validates client-side (fileValidator.js), then calls enhancedApiService.submitAndMonitor(..., isFile=true).
    • For links: EnterLink.js uses enhancedApiService.submitAndMonitor(..., isFile=false) and falls back appropriately.
  • Upload to inference service

    • Preferred flow:
      1. POST /auth/token → bearer token.
      2. POST /generate-upload-url with {file_name, file_type} using Authorization: Bearer {token}.
      3. PUT the audio bytes to the returned signed URL (10s expiry) in GCS.
      4. POST /report { bucket_name, file_name }.
      5. Client polls GET /report-status/{task_id}?has_subscription={bool} until status: completed.
    • Direct analyze (fallback): POST /analyze multipart with file, returns immediate result (no transcription).
  • Backend processing

    • /process-report (called by Cloud Tasks):
      • Download object from GCS to temp path.
      • Try transcription via Deepgram (best-effort, non-fatal if it fails).
      • Run AudioInference.analyze_file(temp_path):
        • Load and resample to 16 kHz.
        • Split into non-overlapping 3 s chunks.
        • For each chunk: compute 128-mel log spectrogram and run CNN to get AI probability.
        • Build per-chunk prediction/confidence; compute summary and overall label.
      • Assemble result array:
        • First item: summary_statistics with totals and percentages.
        • Timeline items: [ { timestamp: i*3, prediction, confidence } ] for each chunk.
        • Attach overall_prediction, aggregate_confidence, and transcription_data (if available).
      • Save to jobs[task_id].
  • Response shaping by tier

    • GET /report-status/{task_id}:
      • Free: returns full timeline plus summary; transcription_data: null, is_limited: true.
      • Pro: returns full job payload including transcription_data.
  • App rendering

    • Results.js normalizes backend formats (results vs result vs Results.chunk_results), computes AI probability per chunk for display:
      • Color thresholds: AI probability >75 red, 40–75 yellow, <40 green.
    • Transcription.js colors words by cross-referencing word.start with the 3-second chunk that contains it.
  • Chat

    • App sends POST /chat?has_subscription=…&task_id=… with message, context, and optionally analysis_data built from the local results so the LLM can cite structured findings.
    • Backend gates free users; for pro, it uses Gemini generativeai to generate a response, inlining analysis/transcription snippets.

Diagrams

flowchart LR
  A["Mobile App (Expo React Native)"]
  B["FastAPI API (app.py)"]
  C["AudioInference (PyTorch, audio_processor.py)"]
  D["Deepgram ASR (Transcription)"]
  E["Google Cloud Storage (GCS)"]
  F["Cloud Tasks (enqueue → HTTP POST /process-report)"]
  G["Gemini (Chat)"]
  H["In-memory jobs store (app.py: jobs dict)"]

  A -->|"POST /auth/token"| B
  A -->|"POST /generate-upload-url {file_name,type}"| B
  B -->|"V4 signed PUT URL (10s)"| A
  A -->|"PUT object bytes"| E
  A -->|"POST /report {bucket,file_name}"| B
  B -->|"create_task"| F
  F -->|"POST /process-report (authenticated by Cloud Tasks headers)"| B
  B -->|"download file"| E
  B -->|"analyze_file()"| C
  B -->|"transcribe_audio_file() (best‑effort)"| D
  B -->|"store formatted results"| H
  A -->|"GET /report-status/{task_id}?has_subscription=…"| B
  A -->|"POST /chat?has_subscription=…&task_id=…"| B
  B -->|"prompt with analysis context"| G
sequenceDiagram
  autonumber
  participant App as Mobile App
  participant API as FastAPI
  participant GCS as Google Cloud Storage
  participant Tasks as Cloud Tasks
  participant Proc as AudioInference (PyTorch)
  participant ASR as Deepgram ASR
  participant Store as jobs dict

  App->>API: POST /auth/token { app_user_id }
  API-->>App: { token, expires_in }
  App->>API: POST /generate-upload-url (Bearer token)
  API-->>App: { signed_url, file_name, bucket }
  App->>GCS: PUT object bytes (signed_url)
  App->>API: POST /report { bucket_name, file_name } (Bearer)
  API->>Tasks: create_task → /process-report
  Tasks-->>API: POST /process-report (task headers)
  API->>GCS: download file
  par Inference
    API->>Proc: analyze_file(temp_path)
    Proc-->>API: { per-chunk predictions/confidences, aggregates }
  and Transcription (best-effort)
    API->>ASR: transcribe_audio_file(temp_path)
    ASR-->>API: { text, words[timestamps], sentiment, summary }
  end
  API->>Store: jobs[task_id] = { status: completed, result[], overall_prediction, aggregate_confidence, transcription_data }
  App->>API: GET /report-status/{task_id}?has_subscription=… (Bearer)
  alt free tier
    API-->>App: { status, result: [summary + ALL timeline], transcription_data: null, is_limited: true }
  else pro
    API-->>App: Full job payload
  end
  App->>API: POST /chat?has_subscription=…&task_id=… { message, context, analysis_data? }
  API-->>App: { response, context }

What users get today

  • Free tier: Full 3‑second timeline with per‑chunk prediction and confidence; no transcription; chat disabled.
  • Pro tier: Adds full transcription with word‑level risk coloring and chat (10 messages per report).
  • Clear verdicts: Overall AI/Human/Mixed/Uncertain with confidence summary.
  • Robust pipeline: Works across codecs, compression, and background noise.
  • Fast, consistent UX across iOS and web.

See the codebases:

The generalization crisis (and what we’re doing about it)

Deepfake‑Eval‑2024 — a large in‑the‑wild benchmark with 56.5 hours of audio, 45 hours of video, and 1,975 images spanning 88 websites and 52 languages — shows that state‑of‑the‑art open‑source detectors drop sharply on real‑world data. Compared to prior academic benchmarks, AUC decreases by ~48% for audio (50% for video, 45% for images), underscoring a substantial distribution‑shift problem. See: arXiv:2503.02857.

This mirrors what we see in practice: performance can degrade across new generators, prompts, speakers, microphones, rooms, and codecs. Commercial systems and models fine‑tuned on Deepfake‑Eval‑2024 perform better than off‑the‑shelf open‑source models but still trail expert forensic analysts [arXiv:2503.02857].

I'm working to close this gap by:

  • Broadening domain coverage (languages, speakers, microphones, acoustics, codecs)
  • Tracking generator churn with frequent sampling of new TTS/voice‑cloning models and prompts
  • Hard‑negative mining and adversarial augmentations
  • OOD‑first evaluation with calibrated thresholds and cohort‑aware analysis
  • Continual validation and versioned releases to catch drift quickly

The ~48% decrease in performance is a threat to this entire category. However I have managed to get access to this set of evals and have managed to get degradation of performance down to 8% with test scores of 88% accuracy and eval around 80%.

This exceeds top human performance and while there is a long way to go, I am constantly finding ways to improve this

Links