How Multilingual AI Bridges the Gap to Inclusive Access

Contents

Executive Summary

This session at India AI Summit 2026 emphasized that AI can only serve the public good if it serves all languages and cultures, framing multilingual and culturally-contextual AI as both a technical challenge and democratic imperative. Multiple international speakers demonstrated how academic institutions, government partnerships, and collaborative networks (like ICAN) are building open, equitable multilingual AI systems as alternatives to Big Tech dominance, with concrete applications in agriculture, healthcare, and digital public services.

Key Takeaways

Multilingual AI is infrastructure, not a feature: Treating language inclusion as a checklist item or marketing claim (without real domain expertise and user validation) actively harms marginalized communities. Language diversity requires data, talent, and iterative community feedback.
Open models with regional customization > centralized Big Tech solutions: Aperture, Sea Lion, and Bhashini show that publicly-funded, open-source models can achieve comparable performance to Meta/Google while enabling local control, cultural preservation, and accountability.
Test in reality or don't deploy: AI models must be validated in high-stakes contexts (healthcare, governance) where they'll actually be used, with real users and feedback loops. "Speaking a language" ≠ "being accurate in that language."
Academia has a unique, irreplaceable role: Only neutral, publicly-funded institutions can invest in unglamorous but essential work: low-resource languages, high-stakes validation, and serving communities with no commercial value.
Sovereignty and control are the same thing: For nations, communities, and individuals, AI sovereignty means understanding how tools work locally, controlling updates, preserving culture, and refusing to be treated as data sources—not as dependents of Big Tech.

Key Topics Covered

Linguistic exclusion as a barrier to digital participation and democratic access to AI
Multilingual model development at scale (open-source models like Aperture)
India's Bhashini initiative: multilingual AI for 22+ Indian constitutional languages
Current AI initiative: public-private partnerships for public interest AI with €400M+ funding
Data collection challenges for low-resource and non-English languages
Cultural preservation beyond mere language translation (dialects, norms, artifacts)
Academic responsibility in developing and testing models for high-stakes applications (healthcare, governance)
Global collaboration frameworks (ICAN, India-Switzerland partnerships, regional models)
Sovereignty and control over AI tools and datasets
Implementation science: testing models in real-world contexts with vulnerable populations

Key Points & Insights

Linguistic exclusion is a persistent democratic barrier: Despite global AI development, 60% of training data is English; 119 of 193 UN member states are included in none of the major multilingual AI initiatives, creating a "dinner guest vs. menu" problem.
Data collection remains the foundational bottleneck: Bhashini employed 200+ field workers speaking 22 Indian languages to manually collect monolingual and bilingual corpora—a "brute force" but necessary approach when digital data doesn't exist. This is labor-intensive but ensures community input.
Performance disparities create life-or-death risks: Dr. Annie's testing of medical AI in Ethiopian Amharic produced biblically-influenced nonsense ("thou shalt not eat insulin on a Tuesday") because training data was limited to Bible translations. This demonstrates that mere language inclusion without domain-specific accuracy is dangerous.
Big Tech's approach is fundamentally flawed for marginalized languages: Large companies rely on data scraping, unlicensed data, and treat communities as "data" rather than people. Open-source and academic alternatives enable community-led preservation and contextually-appropriate training.
Talent scarcity, not just compute, is a critical bottleneck: Only ~100 people globally have experience building foundation models at scale. Academia must be empowered with supercomputing access to develop talent outside Big Tech, not just create endpoints for commercial models.
Culture is multi-faceted and requires preservation beyond language: Current AI broadened focus from "multilingual diversity" to "cultural diversity"—including behaviors, norms, digital/physical artifacts, and dialects. Code-switching and dialectal variation must be modeled, not erased.
Sovereignty requires control over tools and continuous feedback loops: Academic neutrality enables real-world validation (clinical trials, user feedback) that commercial entities cannot provide; sovereignty means understanding how models work in local contexts and controlling updates.
"Narrow use cases" enable value delivery at scale: Bhashini deliberately launched limited-scope applications (farmer advisories, manuscript digitization) with controlled vocabularies rather than waiting for perfect general models—allowing data contribution loops to improve over time.
Regional collaboration (Singapore's Sea Lion, Switzerland's Aperture) demonstrates viability: Multilingual models built on open foundations can be adapted regionally, leveraging economies of scale while preserving local context.
Implementation science and high-stakes testing are underfunded: Testing AI in war zones, rural clinics, and with vulnerable populations is unglamorous but essential; academia must be ambitious enough to fund these validation efforts.

Notable Quotes or Statements

"AI can only serve the public good if it serves all languages and all cultures. Ensuring multilingual access is therefore not only a technical challenge, it's a democratic imperative." — Swiss delegation statement

"We must fight scale with scale." — Aya Patel, CEO of Current AI (on why public-private partnerships and billions in funding are necessary to compete with Big Tech)

"Big tech won't solve it. There's a big role to play for academia." — Dr. Alex Ilich, AI Center ETH Zurich (on why universities with supercomputing must develop multilingual models)

"They treat individuals and communities as data, whereas they are people and they are not data." — Aya Patel, on the ethical problem with Big Tech's data scraping for multilingual diversity

"The Bible isn't very accurate in medicine." — Dr. Annie Swanson Reid, highlighting that language-only inclusion produces dangerous models when training data is limited to biblical translations

"If any tool comes out, any new model, we can test it... when it breaks we don't just say 'oh this model is bad in this setting,' we really try to get that information and put it back into the model to continuously improve it." — Dr. Annie Swanson Reid, on the Move (Massive Open Online Validation and Evaluation) project

"Sovereignty is control of tools and control of the environment and to understand how these models work in reality." — Dr. Annie Swanson Reid

"When you are at a patient's bedside and you ask questions that are high stakes... you can't rely on these models to make these decisions because they are inequitably inaccurate in the places that need it most." — Dr. Annie Swanson Reid

"We launched [narrow use cases] and built them... without waiting for something to become perfect." — Amitab (Bhashini CEO), on iterative deployment rather than perfectionism

"Access to language and culture is a human right." — Petri Toiviainen, from UN HL Lab advisory experience

Speakers & Organizations Mentioned

Government & Policy Institutions

Swiss Government (including President's office, Swiss delegation)
Indian Government (collaborating on Bhashini, funding Current AI)
Kenyan, Moroccan, and other governments (Current AI partners)
UN Office of the Secretary-General (HL Lab advisory body)
Indian Ministry of Earth Sciences
Indian Council of Social Science Research
Indian Department of Biotechnology
Indian Council of Medical Research

Research Institutions & Universities

ETH Zurich (ETH Turk) — Aperture model development
EPFL (École Polytechnique Fédérale de Lausanne) — Aperture co-developer
NTU Singapore — Sea Lion model (newest ICAN member)
University of Helsinki / Finnish Supercomputing Center
Yale University — Dr. Annie Swanson Reid affiliation
National Taiwan University
Ellis Network (European Lab for Learning Systems)

Initiatives & Consortia

ICAN (International Computation and AI Network) — linking academic partners across Europe, Africa, Singapore
Bhashini — India's national language initiative (22 constitutional languages + expansion to 36)
Current AI — Public-private partnership emerging from French AI summit (€400M initial commitment, €2.5B target)
Move (Massive Open Online Validation and Evaluation) — Real-world AI validation project
Light Lab (Laboratory for Intelligent Global Health and Humanitarian Response Technology) — Dr. Annie's lab

Companies/Private Sector

Google DeepMind, Salesforce (Current AI partners)
Meta (open-weight models compared to Aperture)

Foundations/Philanthropy

MacArthur Foundation
Ford Foundation
McGovern Foundation

Individual Speakers

Amitab — CEO, Bhashini initiative
Aya Patel — CEO, Current AI (joined ~1.5 months before summit)
Dr. Alex Ilich — Founder and Executive Director, AI Center ETH Zurich; co-founder ICAN
Petri Toiviainen — Founding member ICAN, Ellis Network, Finnish representative; UN HL Lab experience
Professor (name partially unclear) — NTU Singapore historian, Dean of College of Humanities Arts and Social Sciences
Dr. Annie Swanson Reid — EPFL/Yale, leads Light Lab and Move project
Nina Frey/Katarina Frey — Executive Director, ICAN
Professor Thren Suede — President, Swiss National Science Foundation (SNSF)

Technical Concepts & Resources

AI Models & Systems

Aperture (70 billion parameters) — World's first fully open, multilingual foundation model (ETH Zurich/EPFL); trained with multilingual data from inception, not English-first
Sea Lion — Southeast Asian large language model (13 languages, Singapore-funded, versions built on Aperture, aspirations to expand beyond Asia including Tamil)
Bhashini — Multilingual AI framework supporting:
- Automatic Speech Recognition (ASR) in 22 languages
- Bidirectional Text-to-Text Translation (22 languages, expanded to 36 for text)
- Text-to-Speech (TTS) (22 languages)
- Optical Character Recognition (OCR) (22 languages)
- Digital Vocabulary/Dictionary (including proper nouns: places, people, companies)

Key Technical Challenges Identified

Data scarcity: Low-resource languages lack digitized text, speech, and domain-specific corpora
Compute bottleneck: Only ~11,000 GPUs of newest generation globally (e.g., at Swiss supercomputing center); broader distribution needed
Talent scarcity: ~100 people globally with foundation model training experience
Benchmark bias: Major AI companies define benchmarks; they typically excel on their own metrics (cherry-picked)
Language representation in training data: 60% English, 40% non-English in typical internet-scraped datasets; inadequate for multilingual performance parity

Methodological Approaches

Field-based data collection: Bhashini employed 200+ workers to collect spoken and written data in-person (monolingual and bilingual corpora)
Community-centered training: Partnered with 70 research institutes across India; involved communities in data collection and use cases
Narrow use cases first: Bhashini deployed limited-scope applications (farmer advisory, manuscript digitization) to deliver value while building data feedback loops
Implementation science & validation: Move project uses clinical trials and real-world deployment to test accuracy in high-stakes contexts (healthcare, governance)
Frugality-first design: Singapore and Current AI emphasize effective resource use (small amounts of language data, operating from scarcity)

Data & Corpora

Swiss data: 4 languages, multiple dialects
Indian constitutional languages: 22 languages (Bhashini); expanding to 36+ including tribal languages without traditional scripts
South African context: 11 official languages
**Southeast Asian: **13 languages (Sea Lion)
Critical observation: Most digitized data in non-English languages is Bible translations, limiting domain accuracy

Benchmarks & Metrics

Traditionally defined by Big Tech; need to be driven by regional/cultural priorities
Performance parity with English-trained models remains goal for next 100+ languages
Accuracy testing in domain-specific, high-stakes contexts (not just language fluency) is essential

Infrastructure & Governance

Swiss National Supercomputing Center: 11,000 newest-generation GPUs
Shared compute access: Academia programs enabling researchers globally to access supercomputing infrastructure
Indo-Swiss Joint Research Program (JRP): Bilateral funding with three new calls:
- Geosciences (natural hazards in mountain regions)
- Social Sciences (societal questions)
- One Health (interconnected human-animal-environment health)
Explore, Experiment, Expand grants: SNSF new funding scheme for novel collaborations, blue-sky research, and expansion of established consortia

Policy & Governance Insights

Geneva AI Summit 2027: Upcoming international forum; Switzerland positioning multilingual AI as foundation for inclusive digital public services
Paris 2025 AI Summit: Origin point of Current AI and public interest AI working group
India AI Summit 2026: Current summit; framed as continuation of Paris-to-Geneva arc
UN Human Rights Framework: Access to language and culture recognized as a human right under international law
Non-aligned principles: Indian approach to AI governance emphasizes multi-polar world, dispersed sovereignty, and resistance to US/China hegemony
Data collection urgency in high-stakes domains: Healthcare, conflict zones, and marginalized populations require academic prioritization because commercial entities will not invest

Limitations & Caveats

Some speaker names and institutional affiliations were unclear or partially inaudible in transcript
Specific metrics/performance comparisons between Aperture, Sea Lion, and proprietary models not detailed in this session
Timeline for reaching performance parity across 100+ languages not specified
Budget/cost figures for Bhashini data collection and other initiatives not fully itemized
Details on the "device" launching at 3:30 PM withheld for dramatic reveal (not covered in transcript)