Language as Digital Infrastructure: Enabling Inclusive AI Across Communities
Contents
Executive Summary
This panel discussion explores the multi-layered architecture required to deploy voice AI at production scale in India, covering infrastructure, orchestration, policy, and adoption challenges. The speakers emphasize that voice AI success in India requires addressing not just model quality, but also telecom infrastructure, regulatory compliance, multilingual support, and user trust—creating a holistic ecosystem rather than isolated technological solutions.
Key Takeaways
-
Voice AI in India requires thinking beyond the model. Infrastructure, telecom integration, edge computing, policy compliance, and user trust design are equally important as LLM quality. Success is an ecosystem play.
-
Declare that you're AI—don't try to fake humanity. Users care more about problem-solving consistency and transparency than perfect human mimicry. Building trust through trustworthy behavior (solving problems, being responsive, being transparent) is the unlock for adoption at scale.
-
Data responsibility is not a compliance checkbox; it's a competitive advantage. Organizations that adopt proactive duty-of-care frameworks (auditability, deletion pipelines, purpose limitation) will build user trust faster and face fewer regulatory surprises than those treating policy as a burden.
-
Multilingual India demands platform-level orchestration, not model parity. No single off-the-shelf model handles code-switching, dialect sensitivity, and cost optimization simultaneously. Purpose-built orchestration platforms that route requests intelligently across multiple models are the missing layer.
-
Success metrics must tie to business impact and user retention. Leading indicators (e.g., CES, repeat call rates, problem-solving consistency, latency) are easier to track than long-term ROI. Start with high-volume, low-stakes use cases; measure obsessively; iterate continuously.
Key Topics Covered
- Infrastructure & Telephony Layer: GPU compute sizing, edge data centers, cloud scalability, PSTN vs. data network integration, redundancy requirements
- Orchestration & Multi-Model Coordination: Language-aware routing, code-switching, multilingual prompt engineering, human-in-the-loop workflows
- Policy, Governance & Data Protection: DPDP compliance, consent frameworks, voice as biometric data, sectoral regulations (BFSI/RBI), liability attribution
- Adoption Barriers & User Trust: Thinking vs. typing gap, conversational design, behavioral retention, ROI measurement, observability/feedback loops
- Multilingual Challenges in India: Code-switching, dialect diversity, language bias blind spots, cost sensitivity per minute
- Edge Computing & Sovereignty: Sovereign cloud requirements, data residency, compliance with national data protection laws
- Failure Modes & Testing: Detailed logging/tracing, regression testing, impact of prompt changes, graceful degradation
Key Points & Insights
-
Voice AI isn't just a model problem—it's an infrastructure ecosystem problem. Sunil Gupta emphasizes that telecom network quality, latency, concurrency handling, and GPU compute sizing are equally critical as LLM performance. A single dropped call or network failure cascades to millions of users at scale.
-
Telecom network architecture in India requires dual-path solutions. PSTN networks (for feature phones and landlines) and IP/data networks (for smartphones) operate under different constraints. Voice AI must support both seamlessly, with intelligent call routing via voice gateways.
-
Edge data centers are no longer theoretical—they're essential for real-time voice AI. Latency sensitivity in conversational systems makes central data centers impractical; processing must happen at the edge to deliver conversation-like responsiveness within milliseconds.
-
Multilingual orchestration is fundamentally different from translation. Code-switching (mixing languages in single sentences), language-specific weights in speech patterns, and context-aware model selection require platform-level intelligence—not just tokenization of translated text.
-
Consent is the beginning, not the end, of data responsibility. Deepika Malletti argues that voice is biometric data; consent alone doesn't satisfy duty of care. Deletion pipelines, purpose limitation, collection minimization, and auditability are non-negotiable.
-
Policy-first mindset > compliance-checkbox mentality. The framing should shift from "I must take consent because required" to "I must create a safe experience because I value my user." This mindset change unlocks more meaningful and sustainable compliance.
-
Trust barriers in voice AI adoption are multifaceted. Users have unclear expectations about bot capabilities, distrust unnatural-sounding speech, and harbor anxiety about talking to systems. Building trust requires humane conversation design, empathy, and transparent problem-solving.
-
Cost per minute is a critical metric for Indian enterprises. Unlike Western enterprises, Indian businesses operate with extreme cost sensitivity (e.g., 3 rupees/min vs. 2.8 rupees/min is significant). Model efficiency, token optimization, and cheap model alternatives are strategic priorities.
-
Observability and continuous feedback loops drive retention. Voice AI products fail silently without deep tracing across all system layers (speech-to-text, LLM, text-to-speech, telephony). Identifying leading indicators (e.g., repeat call rates) and problem discovery at scale requires sophisticated monitoring.
-
The next 12–18 months are critical. DPDP enforcement begins May 2025, sectoral regulations (BFSI) are emerging, and model diversity is accelerating. Organizations must prepare now—not after problems surface.
Notable Quotes or Statements
-
Sunil Gupta: "Once you move from prototype to scale, the stack starts to matter in ways you didn't anticipate. Latency, multilingual support, infra, evals, governance frameworks—all of that keeps what's being deployed."
-
Sunil Gupta: "The telecom part plays as much a bigger role as the model part... voice is so much latency-sensitive... you may not have time for your call to go from Gujhati to a data center in Bombay and back."
-
Matria Vag: "We don't believe there will be one model to rule them all situation ever in the future... each call needs to run on different models for all these layers."
-
Deepika Malletti: "Voice is biometric. It's very different from any other form of data... consent is only the beginning. It is not by any means the end of the responsibility."
-
Deepika Malletti: "If we were to approach this whole problem not as 'I need to take consent because I'm required to take consent' but 'I need to create a safe experience for my user because that's how I'm creating trust and valuing my user,' we may actually do a lot of things much more meaningfully."
-
Subhash Mukharji: "The thinking versus the typing gap... you're thinking in an Indic language in your mother tongue and you're trying to type using an alien English keyboard in Latin... voice is important to provide that freedom of expression."
-
Subhash Mukharji: "Retention is defined when what you expect the app to solve, it solves that. When that happens, retention happens... it really boils down to consistency in problem solving."
-
Matria Vag: "Once you go beyond 'why is this voice AI calling me,' people are happy to have the conversation... people are now understanding that yes, they can have a nice conversation with an AI if it is solving the problem for them."
Speakers & Organizations Mentioned
| Speaker | Role / Organization | Focus Area |
|---|---|---|
| Sunil Gupta | Co-founder & CEO, Yota | Infrastructure, sovereign cloud, GPU compute, telecom integration |
| Matria Vag | Founder, Bolna | Voice AI orchestration, multilingual platforms, enterprise workflows |
| Deepika Malletti | Chief of Policy & Partnership, AXA Foundation | Policy, governance, DPDP compliance, data protection, sectoral frameworks |
| Subhash Mukharji (Shiddhut) | Head of Demand Engineering, Misho | Adoption barriers, user trust, behavioral retention, observability, ROI measurement |
Institutions/Frameworks Mentioned:
- Digital Personal Data Protection (DPDP) Regime – enforcement May 2025
- AI Governance Framework – released, non-binding guidance
- BFSI/RBI Sectoral Guidelines – banking & financial services regulations
- PSTN (Public Switched Telephone Network) – legacy telecom infrastructure
Technical Concepts & Resources
Infrastructure & Systems Architecture
- PSTN (Public Switched Telephone Network) – legacy phone system for feature phones and landlines
- Data networks – IP-based networks carrying voice as packets (WhatsApp, Signal, Telegram calls)
- Voice gateway – identifies caller origin (PSTN vs. data network) and routes appropriately
- GPU compute sizing – pre-calculation of concurrent capacity needed for production scale
- Cloud autoscaling – dynamic capacity adjustment (burst on demand, scale down at off-peak)
- Active-active redundancy – all system components duplicated; failure of one doesn't break service
- Edge data centers – distributed compute at geographic edges to minimize latency in real-time conversations
- Sovereign cloud – cloud infrastructure physically and legally within national boundaries; compliant with local data residency laws
Multilingual & NLP Processing
- Code-switching – seamless mixing of multiple languages in single utterance (Hindi + English, e.g., "tinchar patch")
- Speech-to-text (STT) → LLM → Text-to-speech (TTS) – full pipeline
- Language detection – identifying speaker's preferred language from first few sentences
- Language-specific weights – understanding how speakers weight multiple languages (e.g., numbers in English even in Hindi conversation)
- Dialect diversity – 700+ active dialects in India; differences between Hindi in Bihar vs. Rajasthan, Tamil in Madurai vs. Madras
- Language bias blind spots – systems trained on one language variant may systematically deny loans/services to speakers of other variants
- Dynamic model routing – orchestration layer selects optimal STT, LLM, and TTS model per call based on language, cost, quality trade-offs
Models & Inference
- Eleven Labs – high-quality TTS (modern English)
- Cartesia – TTS alternative
- Serv – cheaper TTS option
- Token optimization – minimizing input/output tokens to reduce cost-per-call (critical in cost-sensitive Indian market)
- Model diversity – newer, smaller indie models from universities and labs competing on specific metrics (cheapest, fastest, best-for-specific-domain)
Operational & Testing
- Detailed log tracing – capturing decision path at each layer (STT, LLM, TTS, telephony) to diagnose failures
- Regression testing – running 1,000–100,000+ test calls internally before production to catch unintended consequences of prompt/config changes
- Observability platforms – tools enabling real-time monitoring and continuous problem discovery
- Leading indicators – proxy metrics easier to measure than long-term ROI (e.g., CES, repeat call rate, problem-solving consistency, latency)
- A/B testing – comparing voice AI enabled vs. disabled (or multiple approaches) to measure business lift (conversion, customer effort score, etc.)
- Human-in-the-loop escalation – intelligent transfer to human agents when AI detects task failure, without explicit user request
- Deep fake liability gap – risk of fraud/impersonation via synthetic voice; liability chains unclear
Policy, Governance & Compliance
- DPDP (Digital Personal Data Protection) Act – consent, biometric data handling, purpose limitation, deletion rights; full enforcement May 2025
- AI Governance Framework – India's voluntary (non-binding) guidance for responsible AI; sectoral frameworks (BFSI, healthcare, etc.) expected to emerge
- Graded liability chain – distributed responsibility across multiple parties (cloud provider, model provider, platform orchestrator, enterprise deploying); emphasis on auditability and duty-of-care demonstration rather than punishment
- Data flow mapping – documenting where voice/biometric data flows, who has access, how long retained
- Deletion pipeline – process for purging biometric data post-transaction; as critical as collection pipeline
- Purpose limitation, collection minimization, time limitation – principles for responsible voice data handling
- Localization – storing/processing data locally (sovereign cloud) to give users maximum control and minimize enterprise risk
User Experience & Adoption
- Thinking vs. typing gap – friction users face typing in Indic languages on Latin keyboards; voice removes this friction
- Trust barriers – unclear expectations, unnatural speech, scripted/repetitive conversations, anxiety about bot competence
- Conversational design – humane, empathetic, non-scripted interaction structure (clarify before acting, confirm understanding, probe appropriately, escalate gracefully)
- Call drop patterns – data showing 10-second drop rate (user hangs up after realizing it's AI), 10–30 second low drop rate (after accepting it's AI), and gradual increase thereafter
- Behavioral retention – user continues using voice AI when it consistently solves their problem; consistency > naturalness
- Feed UX friction – voice avoids screen real estate conflict in mobile apps; allows simultaneous browsing + voice interaction (unlike chat UI)
- High-volume, low-stakes use cases – recommended starting point before tackling sensitive/complex problems
Summary Table: Layer-by-Layer Voice AI Stack
| Layer | Key Owner(s) | Key Challenges | Key Unlocks |
|---|---|---|---|
| Infrastructure & Telecom | Cloud operators, telecom providers, Yota | Dual-network support (PSTN + IP), latency, concurrency, redundancy, cost optimization | Sovereign cloud, edge data centers, auto-scaling contracts |
| Orchestration | Bolna (and similar platforms) | Multilingual routing, code-switching, model diversity, cost-per-call efficiency, human escalation | Multi-model selection framework, intelligent code-switch detection, purpose-built India orchestration |
| Policy & Governance | Regulators, enterprises, AXA Foundation | DPDP compliance, sectoral frameworks, biometric data liability, consent > checkbox mindset | Auditability, deletion pipelines, graded liability chains, duty-of-care demonstration |
| Adoption & User Experience | Enterprises, Misho (and adopters) | Trust barriers, conversational design, behavioral retention, ROI measurement, observability | Problem-solving consistency, transparent AI positioning, leading indicators, feedback loops |
Recommended Next Steps for Stakeholders
For Infrastructure Providers:
- Invest in edge data center rollout across India's geography
- Negotiate auto-scaling SLAs with cloud operators
- Build sovereign cloud offerings compliant with DPDP
For Orchestration/Platform Builders:
- Implement multi-model selection logic with language detection at call start
- Build deletion pipelines and data lineage tracking
- Develop regression testing suites for prompt/config changes
- Create detailed observability dashboards capturing all pipeline layers
For Enterprises/Adopters:
- Start with high-volume, low-stakes use cases (customer support, recruitment screening, etc.)
- Define leading indicators early (CES, repeat call rate, problem-solving consistency)
- Implement detailed logging and continuous feedback loop infrastructure
- Design conversational workflows around user clarification, confirmation, and graceful escalation
For Policymakers & Regulators:
- Finalize sectoral guidance (BFSI, healthcare, education) before May 2025 DPDP enforcement
- Clarify liability allocation in multi-party voice AI ecosystems
- Encourage proactive duty-of-care culture over retroactive punishment
- Support sovereign cloud infrastructure development
Critical Timeline
- May 2025 – DPDP Act fully enforced; all voice data handling must comply
- Next 12–18 months – Sectoral regulations emerge; edge data center deployments accelerate; model diversity increases; adoption at scale becomes feasible
- Beyond 2025 – Winner-takes-most consolidation likely among orchestration platforms; voice AI becomes primary channel for citizen-government, customer-business, and peer-to-peer interaction in India
