Ai-SPY (Mobile) — AI Speech Detection on iOS

Overview
While the web version is an enterprise‑grade product, Ai‑SPY Mobile is a consumer‑grade iOS app that lets people verify audio on the go. It connects to the same detection pipeline and is optimized for quick capture, upload, and instant results. The underlying model is trained on a custom dataset targeting "in‑the‑wild" audio from many languages and sources, solidified across codecs and improved with augmentation and corruption during training.
Why I built this
In short, I tried ElevenLabs' API in the summer of 2023 and realized that this technology was likely to cross the uncanny valley in a matter of months. Of course, this isn't just true for speech, but also for images, text and video. However, our visual senses far exceeds that of our auditory acuity - we can see better than we can hear, so if I were to wear my black hat for a second, any deepfake-based attack should aim to provide as little signal as possible in the medium we are least attuned to distinguish - this is clearly audio-only. So I started there.
In digging further into the space, I realized that there were a ton of critical systems that leveraged audio-only verification to authenticate customers. Banks, utility companies, doctors offices, schools - the list went on. Without updated biometric systems that could distinguish between voices that AI systems deliberately tried to clone there would be little stopping bad actors from escalating their privileges.
Not only that — research on humans' ability to detect audio deepfakes has exposed a concerning reality: people miss roughly a quarter to a half of all audio deepfakes (UF News, Systematic review and meta-analysis).
Why speech deepfakes matter for AI safety
- Lower human perceptual defenses: We are much better at catching visual artifacts than subtle audio manipulations.
- High‑stakes, voice‑mediated workflows: Banking, healthcare triage, enterprise approvals, and emergency dispatch still rely on voice.
- Economic and social damage: Impersonation and social engineering via synthetic speech have rapidly growing real‑world impact.
Platform and stack
- Frontend (this app): Expo React Native (iOS) with a fast, accessible UI for recording or uploading audio and viewing results.
- Backend (companion service): FastAPI + PyTorch inference pipeline with strict validation, rate limiting, and storage integration.
- Distribution: App Store listing with regular updates.
Architecture overview (code-accurate)
-
Backend: FastAPI (
api-mobile-app/app.py
) with:- Auth:
POST /auth/token
issues HMAC-signed bearer tokens (base64 of client_id|expiry|timestamp|signature) usingJWT_SECRET
. - File analysis:
POST /analyze
acceptsmultipart/form-data
file, validates, runs PyTorch model (AudioInference.analyze_file
), returns immediate per-3s results. - Signed upload:
POST /generate-upload-url
returns a v4 GCS signed PUT URL (10s expiry) after filename/content-type validation. - Background job:
POST /report
enqueues Cloud Tasks →POST /process-report
does download + inference + optional Deepgram transcription, stores results injobs
dict. - Polling:
GET /report-status/{task_id}?has_subscription=…
returns full or limited data based on tier. - Transcription direct:
POST /transcribe
uploads a file, Deepgram ASR; free tier limited to first 50 words. - Chat:
POST /chat?has_subscription=…&task_id=…
with optionalanalysis_data
from client; free users receive gating message; pro users tracked per-report (10 message limit).
- Auth:
-
Inference pipeline (
api-mobile-app/audio_processor.py
):- Load audio with
librosa.load(sr=None)
then resample to 16 kHz. - Windowing: fixed non-overlapping 3-second chunks (target_sr 16000 × 3 s).
- Features: log-mel spectrogram via
torchaudio.transforms.MelSpectrogram
(n_fft=512, hop=160, n_mels=128, f_min=20, f_max=8000, Slaney scale), thentorch.log(mel + 1e-9)
. - Model:
DeepfakeDetectorCNN
(3 conv blocks + FC; sigmoid output yields AI probability). - Per-chunk classification: probability > 0.5 → AI else Human; confidence is max(prob_ai, 1 - prob_ai).
- Aggregation: counts, percent_ai, percent_human, aggregate_confidence = mean(chunk confidences as defined above). Overall label:
- AI if percent_ai > 60
- Human if percent_human > 60
- Uncertain if 40 ≤ aggregate_confidence ≤ 60
- Mixed otherwise
- Output: status, overall_prediction, aggregate_confidence, arrays of
predictions[]
andconfidences[]
, counts and percentages.
- Load audio with
-
Transcription (
transcribe_audio_file
inapp.py
):- Primary: Deepgram SDK
nova-2
with smart_format, diarize, summarize v2, topics, sentiment. - Fallback: Raw HTTP if SDK union type issues; parses transcripts, words with timestamps, average sentiment, summary.
- Returns uniform structure:
text
,words[{word,start,end,confidence}]
,average_sentiment
,summary
.
- Primary: Deepgram SDK
-
Security and ops:
- Strict file validation: extension allowlist, MIME check, magic-byte heuristics (MP3/WAV), filename sanitization, size cap (40 MB).
- Security headers, CORS from env (
ALLOWED_ORIGINS
), CSP restricting external connects to Deepgram and Google Generative AI. - Rate limiting via
slowapi
. jobs
in-memory store holds job lifecycle and chat usage counters.
-
Mobile app (Expo RN,
ai-spy-mobile-app
):- Auth and requests:
Components/enhancedApiService.js
- Generates/validates auth token; stores in
expo-secure-store
. - All requests require HTTPS and Bearer token.
- Analyze direct:
POST /analyze
(form-data file). - Signed URL workflow:
/generate-upload-url
→ PUT to GCS →/report
→ poll/report-status/{id}
. - Tier-aware: free users see full timeline but no transcription; pro users get transcription and chat.
- Chat: includes full analysis context optionally (
analysis_data
) to improve responses.
- Generates/validates auth token; stores in
- UI:
Screens/Home.js
: upload audio; uses signed URL flow; showsResults
on completion.Screens/EnterLink.js
: link-based processing via enhanced service; uses same results view.Components/Results.js
: computes timeline view model from various backend formats; renders:SummaryStats
: pie chart percentages computed from timeline.- Timeline grid of 3s segments; risk color coding uses thresholds on AI probability (>75 red, 40–75 yellow, <40 green).
Transcription
: words colored by AI-probability per overlapping 3s interval; free users see prompt to upgrade, pro users see full text.- Chat button to
ChatScreen
with analysis context.
Components/Transcription.js
: aligns word timestamps to 3s buckets, colors text accordingly; uses prediction+confidence to derive AI probability if not present.
- Auth and requests:
Request flow (expanded, code-accurate)
-
Record or upload audio in the app
- For files:
Home.js
usesexpo-document-picker
, validates client-side (fileValidator.js
), then callsenhancedApiService.submitAndMonitor(..., isFile=true)
. - For links:
EnterLink.js
usesenhancedApiService.submitAndMonitor(..., isFile=false)
and falls back appropriately.
- For files:
-
Upload to inference service
- Preferred flow:
POST /auth/token
→ bearertoken
.POST /generate-upload-url
with{file_name, file_type}
usingAuthorization: Bearer {token}
.PUT
the audio bytes to the returned signed URL (10s expiry) in GCS.POST /report { bucket_name, file_name }
.- Client polls
GET /report-status/{task_id}?has_subscription={bool}
untilstatus: completed
.
- Direct analyze (fallback):
POST /analyze
multipart withfile
, returns immediate result (no transcription).
- Preferred flow:
-
Backend processing
/process-report
(called by Cloud Tasks):- Download object from GCS to temp path.
- Try transcription via Deepgram (best-effort, non-fatal if it fails).
- Run
AudioInference.analyze_file(temp_path)
:- Load and resample to 16 kHz.
- Split into non-overlapping 3 s chunks.
- For each chunk: compute 128-mel log spectrogram and run CNN to get AI probability.
- Build per-chunk
prediction
/confidence
; compute summary and overall label.
- Assemble
result
array:- First item:
summary_statistics
with totals and percentages. - Timeline items:
[ { timestamp: i*3, prediction, confidence } ]
for each chunk. - Attach
overall_prediction
,aggregate_confidence
, andtranscription_data
(if available).
- First item:
- Save to
jobs[task_id]
.
-
Response shaping by tier
GET /report-status/{task_id}
:- Free: returns full timeline plus summary;
transcription_data: null
,is_limited: true
. - Pro: returns full job payload including
transcription_data
.
- Free: returns full timeline plus summary;
-
App rendering
Results.js
normalizes backend formats (results
vsresult
vsResults.chunk_results
), computes AI probability per chunk for display:- Color thresholds: AI probability >75 red, 40–75 yellow, <40 green.
Transcription.js
colors words by cross-referencingword.start
with the 3-second chunk that contains it.
-
Chat
- App sends
POST /chat?has_subscription=…&task_id=…
withmessage
,context
, and optionallyanalysis_data
built from the local results so the LLM can cite structured findings. - Backend gates free users; for pro, it uses Gemini
generativeai
to generate a response, inlining analysis/transcription snippets.
- App sends
Diagrams
flowchart LR
A["Mobile App (Expo React Native)"]
B["FastAPI API (app.py)"]
C["AudioInference (PyTorch, audio_processor.py)"]
D["Deepgram ASR (Transcription)"]
E["Google Cloud Storage (GCS)"]
F["Cloud Tasks (enqueue → HTTP POST /process-report)"]
G["Gemini (Chat)"]
H["In-memory jobs store (app.py: jobs dict)"]
A -->|"POST /auth/token"| B
A -->|"POST /generate-upload-url {file_name,type}"| B
B -->|"V4 signed PUT URL (10s)"| A
A -->|"PUT object bytes"| E
A -->|"POST /report {bucket,file_name}"| B
B -->|"create_task"| F
F -->|"POST /process-report (authenticated by Cloud Tasks headers)"| B
B -->|"download file"| E
B -->|"analyze_file()"| C
B -->|"transcribe_audio_file() (best‑effort)"| D
B -->|"store formatted results"| H
A -->|"GET /report-status/{task_id}?has_subscription=…"| B
A -->|"POST /chat?has_subscription=…&task_id=…"| B
B -->|"prompt with analysis context"| G
sequenceDiagram
autonumber
participant App as Mobile App
participant API as FastAPI
participant GCS as Google Cloud Storage
participant Tasks as Cloud Tasks
participant Proc as AudioInference (PyTorch)
participant ASR as Deepgram ASR
participant Store as jobs dict
App->>API: POST /auth/token { app_user_id }
API-->>App: { token, expires_in }
App->>API: POST /generate-upload-url (Bearer token)
API-->>App: { signed_url, file_name, bucket }
App->>GCS: PUT object bytes (signed_url)
App->>API: POST /report { bucket_name, file_name } (Bearer)
API->>Tasks: create_task → /process-report
Tasks-->>API: POST /process-report (task headers)
API->>GCS: download file
par Inference
API->>Proc: analyze_file(temp_path)
Proc-->>API: { per-chunk predictions/confidences, aggregates }
and Transcription (best-effort)
API->>ASR: transcribe_audio_file(temp_path)
ASR-->>API: { text, words[timestamps], sentiment, summary }
end
API->>Store: jobs[task_id] = { status: completed, result[], overall_prediction, aggregate_confidence, transcription_data }
App->>API: GET /report-status/{task_id}?has_subscription=… (Bearer)
alt free tier
API-->>App: { status, result: [summary + ALL timeline], transcription_data: null, is_limited: true }
else pro
API-->>App: Full job payload
end
App->>API: POST /chat?has_subscription=…&task_id=… { message, context, analysis_data? }
API-->>App: { response, context }
What users get today
- Free tier: Full 3‑second timeline with per‑chunk prediction and confidence; no transcription; chat disabled.
- Pro tier: Adds full transcription with word‑level risk coloring and chat (10 messages per report).
- Clear verdicts: Overall AI/Human/Mixed/Uncertain with confidence summary.
- Robust pipeline: Works across codecs, compression, and background noise.
- Fast, consistent UX across iOS and web.
See the codebases:
- Mobile app:
ibiggy9/ai-spy-mobile-public
— github.com/ibiggy9/ai-spy-mobile-public - Web app:
ibiggy9/ai-spy-web
— github.com/ibiggy9/ai-spy-web
The generalization crisis (and what we’re doing about it)
Deepfake‑Eval‑2024 — a large in‑the‑wild benchmark with 56.5 hours of audio, 45 hours of video, and 1,975 images spanning 88 websites and 52 languages — shows that state‑of‑the‑art open‑source detectors drop sharply on real‑world data. Compared to prior academic benchmarks, AUC decreases by ~48% for audio (50% for video, 45% for images), underscoring a substantial distribution‑shift problem. See: arXiv:2503.02857.
This mirrors what we see in practice: performance can degrade across new generators, prompts, speakers, microphones, rooms, and codecs. Commercial systems and models fine‑tuned on Deepfake‑Eval‑2024 perform better than off‑the‑shelf open‑source models but still trail expert forensic analysts [arXiv:2503.02857].
I'm working to close this gap by:
- Broadening domain coverage (languages, speakers, microphones, acoustics, codecs)
- Tracking generator churn with frequent sampling of new TTS/voice‑cloning models and prompts
- Hard‑negative mining and adversarial augmentations
- OOD‑first evaluation with calibrated thresholds and cohort‑aware analysis
- Continual validation and versioned releases to catch drift quickly
The ~48% decrease in performance is a threat to this entire category. However I have managed to get access to this set of evals and have managed to get degradation of performance down to 8% with test scores of 88% accuracy and eval around 80%.
This exceeds top human performance and while there is a long way to go, I am constantly finding ways to improve this
Links
- Mobile repository: ibiggy9/ai-spy-mobile-public
- Web repository: ibiggy9/ai-spy-web
- Deepfake‑Eval‑2024 in‑the‑wild benchmark: arXiv:2503.02857
- Study on generalization degradation (2024): arXiv:2308.04177
- Human ability to detect audio deepfakes (news): UF News
- Human deepfake detection performance (systematic review): ResearchGate