Table of Contents
Some products come from market research. Some come from personal pain points. This one came from a fan message. A tech YouTuber with a significant following received a comment: “I’d love to ask you questions on YouTube, but I have social anxiety — even typing feels stressful, let alone a real call.” He decided to do something about it — not write an advice post, but actually build a product.
This article traces the technical architecture of an “AI-assisted video calling” product, and the engineering decisions an indie developer makes with limited resources.
TL;DR
- Target user: People with social anxiety who want to practice real conversation but are afraid of real human interaction
- Core feature: AI plays a conversation partner, providing real-time voice responses in a video call interface simulating a real conversation
- Tech stack: WebRTC video streaming + real-time ASR (speech recognition) + LLM + TTS (speech synthesis) + AI avatar (Tavus or HeyGen-type service)
- Biggest challenge: End-to-end latency must stay under 800ms for conversation to feel natural
Background and Challenge
Social Anxiety Disorder affects approximately 7% of adults globally, with excessive fear of judgment as its core symptom. This makes many people feel extreme stress during in-person communication or video calls.
Traditional solutions involve cognitive behavioral therapy (CBT), with exposure therapy as a core method — gradually exposing patients to anxiety-inducing social situations. The theory is sound, but there are real-world problems:
- Therapist appointments are expensive with long waitlists
- Practice opportunities are limited (you can’t find a real person to practice with every day)
- Failure cost is high (failing with a real person may reinforce anxiety)
The AI video calling hypothesis: provide a “low-risk practice space” — the conversation partner is AI, so you can restart if you mess up, with no one judging you.
Solution Design
Core UX Assumptions
A successful AI social anxiety practice tool needs to satisfy:
- Visual realism: It should look like a real person talking, not a text box or disembodied voice
- Low enough latency: Response delay over 1 second breaks the rhythm of conversation
- Natural conversation: AI must understand context and not forget what was said earlier
- Sense of safety: Clearly tell users this is AI — don’t make them feel deceived
Technology Stack Choices
Video Streaming: WebRTC
Browser-native WebRTC support, no app install required. P2P connection latency is low — under 50ms in the same region under optimal conditions.
For an indie developer, self-hosting a WebRTC signaling server + TURN server costs significant effort. More practical choices are managed services:
- Daily.co: $0.004/participant-minute, has SDK, fastest to ship
- Livekit: Open source, self-hostable, higher free tier
- Agora: Enterprise-grade, stable but more complex pricing
AI Avatar: Giving AI a Face
Options for making AI visually present break into tiers:
Option A (simplest): Static avatar + audio. AI is a still image that doesn’t move its mouth, only has sound. Easiest to implement, but poor user experience.
Option B (medium): 2D avatar + lip sync. Use Ready Player Me or TalkingHead.js to map audio waveforms to mouth animations. Low cost, but looks like a virtual YouTuber rather than a real person.
Option C (most realistic): AI-generated real-time video stream. Tavus and HeyGen provide this type of API — upload a real person video as a base, and the API generates a speaking video stream in real time with speech matching your text or voice input. Latency runs 300–800ms, and results are closest to a real person. Tavus’s Conversational Video Interface (CVI) is purpose-built for this scenario.
For indie developers: validate with Option A/B first, upgrade to Option C once you have enough users.
Real-Time Speech Recognition (ASR)
User speaks → converts to text → sends to LLM. Speed and accuracy are both critical.
- Whisper (OpenAI): High accuracy, but real-time streaming latency is 500ms+
- Deepgram: Optimized for real-time streaming, latency under 200ms, API pricing at $0.0059/minute
- AssemblyAI: Middle option with speaker diarization capability
Indie developer first choice: Deepgram or AssemblyAI’s real-time API.
LLM Response Generation
Once ASR has the text, send it to an LLM for response generation. Latency-sensitive, so you want fast models:
- GPT-4.1 (OpenAI): Fast time to first token, good for streaming output
- Claude Haiku 3.5: Low latency, low cost, smart enough for conversation handling
- Gemini Flash: Google’s low-latency option
The LLM layer also needs carefully engineered system prompts to simulate specific conversation scenarios (job interview practice, asking for directions, calling someone on the phone) and control response pacing and tone.
Text-to-Speech (TTS)
LLM outputs text → convert to speech → play for user.
- ElevenLabs: Best audio quality, supports real-time streaming, many voice clone options
- OpenAI TTS: $15/million characters, decent quality, simple API
- Play.ai: Designed for conversation, supports emotional tone adjustment
Implementation Details
End-to-End Latency Budget
User finishes speaking (VAD detects silence)
→ ASR recognition: ~200ms (Deepgram)
→ LLM first token: ~300ms (Claude Haiku)
→ TTS first audio chunk: ~200ms (ElevenLabs streaming)
─────────────────────────────────────────
Total: ~700ms
700ms is near the acceptable edge. When any link encounters latency (network fluctuation, LLM load), it exceeds the ~1 second “natural feel” threshold. This is the hardest engineering problem in this type of product.
VAD (Voice Activity Detection)
Must accurately detect “user finished talking” before sending to ASR. Triggering too early (user is still talking) causes the AI to interrupt; triggering too late increases latency. WebRTC has built-in VAD; you can also use Silero VAD (open source, high accuracy).
Conversation Memory Management
Multi-turn conversation requires remembering context, but LLM APIs are stateless — you need to manage conversation history yourself. Strategy:
- Keep the last N turns in full
- Summarize earlier turns (compress token usage)
- Store important facts in a vector database (Pinecone, Supabase pgvector)
Market Context
AI companion apps globally crossed 220 million downloads in 2025, with the number of AI companion apps growing 700% in two years. Born (the company behind virtual pet Pengu) raised a $15 million Series A in 2025 specifically for social AI companions.
For indie developers, there’s sufficient user demand in this market, but competition is also increasing rapidly. Differentiation comes from going deep on specific scenarios (focus only on interview practice, or only on phone call anxiety) rather than trying to build a general-purpose AI chat product.
Lessons Learned
- Latency is the first engineering problem: Conversation feel is subjective, but response latency over 800ms is perceptible to nearly all users
- Visual realism > feature richness: Users care about “does this feel like a real conversation,” not how long the feature list is
- Validate with the cheap version first: An ElevenLabs + Deepgram + Claude MVP is low-cost and quickly validates whether users actually use it
- System prompt is the core product differentiator: Prompt engineering that makes AI behave naturally in conversation matters more than any infrastructure optimization
References
🇺🇸 English
There's a category of product that doesn't come from a pitch deck or a market gap analysis. It comes from a single message. A YouTuber got a comment from a fan that said, essentially: "I'd love to ask you questions, but I have social anxiety — even typing feels stressful, let alone getting on a real call." And instead of writing a blog post about it, he built something.
What he built is an AI-powered video call practice platform. And the engineering decisions behind it are genuinely interesting.
So let's talk about the problem first. Social anxiety affects roughly 7% of adults globally. The gold standard treatment is exposure therapy — you gradually face the situations that make you anxious. Makes sense in theory. But in practice, therapist appointments are expensive, waitlists are long, and you can't exactly summon a real human every time you want to practice saying hello to a stranger. And when practice does go badly with a real person? That can actually reinforce the anxiety. The stakes feel too high.
The hypothesis here: what if the practice partner is AI? Low risk, no judgment, restart anytime. The failure cost drops to zero.
But here's where it gets technically interesting, because "just use ChatGPT" doesn't cut it. For this to actually help someone with social anxiety, it has to *feel* like a real conversation. And that means solving three hard problems simultaneously: visual realism, latency, and conversational coherence.
Let's go layer by layer.
**Video transport** is WebRTC — the browser-native peer-to-peer protocol. For an indie developer, self-hosting the full WebRTC infrastructure is a rabbit hole. The practical move is a managed service like Daily.co or Livekit, where you trade some control for the ability to actually ship.
**Giving the AI a face** is where the architecture gets interesting. You have a spectrum of options. On the simple end: a static avatar image with audio playing over it. Works, but it doesn't feel like a real conversation. In the middle: a 2D avatar with lip-sync animations, matching mouth movement to audio waveforms. Looks more like a virtual YouTuber. Then at the high end: services like Tavus, which take a base video of a real person and generate a live speaking video stream that matches whatever text or audio you feed it in real time. Latency on those is 300 to 800 milliseconds, but the result is closest to actually talking to someone. The indie developer's strategy here is smart: validate with the cheap version first, upgrade to the realistic video stream once you know people are actually using it.
**Speech recognition** — getting the user's words into text fast — is where Deepgram stands out. OpenAI's Whisper is accurate but adds 500 milliseconds or more in real-time streaming. Deepgram is purpose-built for real-time, hits under 200ms, and the pricing is reasonable. For this use case, speed matters more than marginal accuracy gains.
**The language model** layer takes that transcribed text and generates a response. You want fast time-to-first-token here, so models like Claude Haiku or GPT-4.1 in streaming mode. And the system prompt is doing enormous work — it's defining the scenario, the persona, the pacing. Job interview practice feels completely different from practicing a phone call to make a doctor's appointment, and that distinction lives entirely in the prompt.
**Text-to-speech** converts the LLM's response back to audio. ElevenLabs is the quality benchmark, supports real-time streaming so you don't wait for the full response to generate before audio starts playing.
Now here's the engineering crux of the whole thing: **end-to-end latency**. Stack it up: 200ms for speech recognition, 300ms for the model's first token, 200ms for the first audio chunk from TTS. You're at 700ms before anything goes wrong. And 700ms is right at the edge of what feels natural. Any network hiccup, any model load spike, and you cross 1 second — which is perceptible to basically everyone and breaks the rhythm of conversation.
There's also a subtle but critical piece called **Voice Activity Detection** — knowing when the user has actually finished speaking before you fire off the recognition. Too sensitive and the AI interrupts mid-sentence. Too slow and you've added unnecessary latency. WebRTC has this built in, and there's also an open-source model called Silero VAD that's highly accurate.
And conversation memory — the LLM has no state between calls, so the application has to manage the conversation history manually. The practical approach: keep the last several turns verbatim, summarize older context to reduce token count, and use a vector database for anything truly important to remember.
Let me leave you with the three things that actually matter here.
First: **latency is the product**. Not in a buzzword sense — literally, if response delay crosses 800 milliseconds, users feel it, and the illusion of conversation collapses. Every architectural decision in this stack traces back to that constraint.
Second: **visual realism beats feature count**. Users don't care how many settings there are. They care whether it feels like talking to someone. That's the bar.
Third: **the system prompt is the real differentiator**. The infrastructure is table stakes — anyone can assemble WebRTC plus Deepgram plus Claude plus ElevenLabs. What makes one product better than another is whether the AI actually behaves naturally in conversation. That lives in the prompt engineering, and it's harder than it looks.
🇹🇼 中文
有時候一個產品的起點,不是市場調查報告,而是一封粉絲信。
一個技術 YouTuber 收到留言:「我很想問你問題,但我有社交焦慮,連打字都緊張,更不用說真人通話。」這句話讓他決定不寫一篇建議文,而是直接動手做一個產品——AI 驅動的視訊通話練習平台。
---
先講問題本身。社交焦慮影響全球大概 7% 的成人,核心是對「被評判」的過度恐懼。傳統解法是認知行為治療裡的「曝露療法」,概念是對的——讓人逐漸習慣焦慮的情境——但現實很骨感:治療師貴、等待期長、而且你不可能每天找真人來練習說話。
AI 視訊通話的假設很直接:給一個低風險的練習場。說錯了可以重來,對面沒有人在評判你。
---
這類產品要成立,有四個技術問題要解。
**第一,視訊串流。** 用 WebRTC,瀏覽器原生支援,不用裝 app。但自己架信令伺服器和 TURN 伺服器成本不低,indie developer 比較實際的選擇是用 Daily.co 或 Livekit 這類託管服務,前者按用量計費,後者開源可自架。
**第二,AI 要有臉。** 這是最有趣的部分,解法分三層。最簡單是靜態頭像加聲音,實作容易但體驗很差。中間層是 2D 虛擬形象加口型同步,成本低,但看起來比較像虛擬 YouTuber。最接近真人的方案是 Tavus 或 HeyGen 這類服務——你上傳一段真人影片作為基底,API 即時生成對應文字的說話視訊流,延遲大概 300 到 800 毫秒。對 indie developer 的建議是:先用最簡單的方案驗證用戶是否買單,有人用了再升級。
**第三,即時語音識別。** 用戶說話要轉成文字才能送給 AI 處理。Whisper 準確率高但即時串流延遲偏高,Deepgram 專為串流優化,延遲可以壓到 200 毫秒以下,是這個場景的首選。
**第四,語音合成。** AI 回應文字要轉成自然的語音。ElevenLabs 音質最好,支援即時串流;OpenAI 自己的 TTS 也夠用,API 也簡單。
---
這四塊串在一起之後,最難的工程問題浮現了——端到端延遲。
算一下:語音識別大概 200 毫秒,LLM 生成首字 300 毫秒,TTS 首段語音 200 毫秒,加起來大概 700 毫秒。這已經接近可接受的邊界了。一旦任何環節出現波動,就會超過 1 秒,而超過 1 秒之後對話節奏就斷了——用戶感受得到,而且很明顯。
除了延遲,還有兩個細節值得提。
一個是 VAD,也就是靜音偵測。系統要能判斷「用戶說完了」才觸發識別,太早觸發會打斷用戶,太晚又讓延遲上升。WebRTC 內建 VAD,或者用 Silero VAD 這個開源方案,準確率更高。
另一個是對話記憶。LLM API 是無狀態的,你要自己把對話歷史帶進去。策略是前幾輪完整保留,更早的對話用摘要壓縮,重要資訊可以存進向量資料庫。
---
最後講市場現況。2025 年的數據顯示,AI 陪伴應用全球下載量已超過 2.2 億次,相關 app 數量兩年內成長了 700%,有公司靠虛擬寵物社交陪伴拿到千萬美元 A 輪。這個需求是真實的,但競爭也在快速加劇。
對 indie developer 來說,差異化的關鍵不是做通用 AI 聊天,而是聚焦特定場景——只做面試練習,或只解決電話焦慮——深度優於廣度。
---
總結三件事:
第一,延遲是對話類 AI 產品的第一工程問題,800 毫秒是心理臨界值,過了就很難挽回。
第二,視覺真實感比功能清單重要,用戶問的是「這看起來像真實對話嗎」,不是「你有哪些功能」。
第三,system prompt 是這類產品真正的競爭力所在——讓 AI 在對話中表現自然,比任何基礎設施優化都更關鍵,也更難被複製。
Tags
Related Articles
AlphaProof: DeepMind's Neurosymbolic AI That Solved Olympic Math Problems
DeepMind's AlphaProof combines a language model with AlphaZero-style reinforcement learning to produce fully machine-verifiable mathematical proofs — achieving silver-medal level at the 2024 International Mathematical Olympiad.
MCP in Claude Code: How Model Context Protocol Connects AI to Your Tool Ecosystem
MCP (Model Context Protocol) is an open protocol designed by Anthropic that lets Claude Code call external tools and data sources through a standardized interface. Since its November 2024 release, it has rapidly become the de facto standard for AI agent tool integration, adopted by Cursor, Windsurf, and 40+ other editors.
How AI Reshapes How You Think: The Cognitive Shift Beyond the Tool
AI tools change more than your speed — they change how you think. The shift from 'how to do it' to 'what to do' and 'is this right?' has real long-term implications for engineers.