How Multilingual AI Bridges the Gap to Inclusive Access
Contents
Executive Summary
This session at India AI Summit 2026 emphasized that AI can only serve the public good if it serves all languages and cultures, framing multilingual and culturally-contextual AI as both a technical challenge and democratic imperative. Multiple international speakers demonstrated how academic institutions, government partnerships, and collaborative networks (like ICAN) are building open, equitable multilingual AI systems as alternatives to Big Tech dominance, with concrete applications in agriculture, healthcare, and digital public services.
Key Takeaways
-
Multilingual AI is infrastructure, not a feature: Treating language inclusion as a checklist item or marketing claim (without real domain expertise and user validation) actively harms marginalized communities. Language diversity requires data, talent, and iterative community feedback.
-
Open models with regional customization > centralized Big Tech solutions: Aperture, Sea Lion, and Bhashini show that publicly-funded, open-source models can achieve comparable performance to Meta/Google while enabling local control, cultural preservation, and accountability.
-
Test in reality or don't deploy: AI models must be validated in high-stakes contexts (healthcare, governance) where they'll actually be used, with real users and feedback loops. "Speaking a language" ≠ "being accurate in that language."
-
Academia has a unique, irreplaceable role: Only neutral, publicly-funded institutions can invest in unglamorous but essential work: low-resource languages, high-stakes validation, and serving communities with no commercial value.
-
Sovereignty and control are the same thing: For nations, communities, and individuals, AI sovereignty means understanding how tools work locally, controlling updates, preserving culture, and refusing to be treated as data sources—not as dependents of Big Tech.
Key Topics Covered
- Linguistic exclusion as a barrier to digital participation and democratic access to AI
- Multilingual model development at scale (open-source models like Aperture)
- India's Bhashini initiative: multilingual AI for 22+ Indian constitutional languages
- Current AI initiative: public-private partnerships for public interest AI with €400M+ funding
- Data collection challenges for low-resource and non-English languages
- Cultural preservation beyond mere language translation (dialects, norms, artifacts)
- Academic responsibility in developing and testing models for high-stakes applications (healthcare, governance)
- Global collaboration frameworks (ICAN, India-Switzerland partnerships, regional models)
- Sovereignty and control over AI tools and datasets
- Implementation science: testing models in real-world contexts with vulnerable populations
Key Points & Insights
-
Linguistic exclusion is a persistent democratic barrier: Despite global AI development, 60% of training data is English; 119 of 193 UN member states are included in none of the major multilingual AI initiatives, creating a "dinner guest vs. menu" problem.
-
Data collection remains the foundational bottleneck: Bhashini employed 200+ field workers speaking 22 Indian languages to manually collect monolingual and bilingual corpora—a "brute force" but necessary approach when digital data doesn't exist. This is labor-intensive but ensures community input.
-
Performance disparities create life-or-death risks: Dr. Annie's testing of medical AI in Ethiopian Amharic produced biblically-influenced nonsense ("thou shalt not eat insulin on a Tuesday") because training data was limited to Bible translations. This demonstrates that mere language inclusion without domain-specific accuracy is dangerous.
-
Big Tech's approach is fundamentally flawed for marginalized languages: Large companies rely on data scraping, unlicensed data, and treat communities as "data" rather than people. Open-source and academic alternatives enable community-led preservation and contextually-appropriate training.
-
Talent scarcity, not just compute, is a critical bottleneck: Only ~100 people globally have experience building foundation models at scale. Academia must be empowered with supercomputing access to develop talent outside Big Tech, not just create endpoints for commercial models.
-
Culture is multi-faceted and requires preservation beyond language: Current AI broadened focus from "multilingual diversity" to "cultural diversity"—including behaviors, norms, digital/physical artifacts, and dialects. Code-switching and dialectal variation must be modeled, not erased.
-
Sovereignty requires control over tools and continuous feedback loops: Academic neutrality enables real-world validation (clinical trials, user feedback) that commercial entities cannot provide; sovereignty means understanding how models work in local contexts and controlling updates.
-
"Narrow use cases" enable value delivery at scale: Bhashini deliberately launched limited-scope applications (farmer advisories, manuscript digitization) with controlled vocabularies rather than waiting for perfect general models—allowing data contribution loops to improve over time.
-
Regional collaboration (Singapore's Sea Lion, Switzerland's Aperture) demonstrates viability: Multilingual models built on open foundations can be adapted regionally, leveraging economies of scale while preserving local context.
-
Implementation science and high-stakes testing are underfunded: Testing AI in war zones, rural clinics, and with vulnerable populations is unglamorous but essential; academia must be ambitious enough to fund these validation efforts.
Notable Quotes or Statements
"AI can only serve the public good if it serves all languages and all cultures. Ensuring multilingual access is therefore not only a technical challenge, it's a democratic imperative." — Swiss delegation statement
"We must fight scale with scale." — Aya Patel, CEO of Current AI (on why public-private partnerships and billions in funding are necessary to compete with Big Tech)
"Big tech won't solve it. There's a big role to play for academia." — Dr. Alex Ilich, AI Center ETH Zurich (on why universities with supercomputing must develop multilingual models)
"They treat individuals and communities as data, whereas they are people and they are not data." — Aya Patel, on the ethical problem with Big Tech's data scraping for multilingual diversity
"The Bible isn't very accurate in medicine." — Dr. Annie Swanson Reid, highlighting that language-only inclusion produces dangerous models when training data is limited to biblical translations
"If any tool comes out, any new model, we can test it... when it breaks we don't just say 'oh this model is bad in this setting,' we really try to get that information and put it back into the model to continuously improve it." — Dr. Annie Swanson Reid, on the Move (Massive Open Online Validation and Evaluation) project
"Sovereignty is control of tools and control of the environment and to understand how these models work in reality." — Dr. Annie Swanson Reid
"When you are at a patient's bedside and you ask questions that are high stakes... you can't rely on these models to make these decisions because they are inequitably inaccurate in the places that need it most." — Dr. Annie Swanson Reid
"We launched [narrow use cases] and built them... without waiting for something to become perfect." — Amitab (Bhashini CEO), on iterative deployment rather than perfectionism
"Access to language and culture is a human right." — Petri Toiviainen, from UN HL Lab advisory experience
Speakers & Organizations Mentioned
Government & Policy Institutions
- Swiss Government (including President's office, Swiss delegation)
- Indian Government (collaborating on Bhashini, funding Current AI)
- Kenyan, Moroccan, and other governments (Current AI partners)
- UN Office of the Secretary-General (HL Lab advisory body)
- Indian Ministry of Earth Sciences
- Indian Council of Social Science Research
- Indian Department of Biotechnology
- Indian Council of Medical Research
Research Institutions & Universities
- ETH Zurich (ETH Turk) — Aperture model development
- EPFL (École Polytechnique Fédérale de Lausanne) — Aperture co-developer
- NTU Singapore — Sea Lion model (newest ICAN member)
- University of Helsinki / Finnish Supercomputing Center
- Yale University — Dr. Annie Swanson Reid affiliation
- National Taiwan University
- Ellis Network (European Lab for Learning Systems)
Initiatives & Consortia
- ICAN (International Computation and AI Network) — linking academic partners across Europe, Africa, Singapore
- Bhashini — India's national language initiative (22 constitutional languages + expansion to 36)
- Current AI — Public-private partnership emerging from French AI summit (€400M initial commitment, €2.5B target)
- Move (Massive Open Online Validation and Evaluation) — Real-world AI validation project
- Light Lab (Laboratory for Intelligent Global Health and Humanitarian Response Technology) — Dr. Annie's lab
Companies/Private Sector
- Google DeepMind, Salesforce (Current AI partners)
- Meta (open-weight models compared to Aperture)
Foundations/Philanthropy
- MacArthur Foundation
- Ford Foundation
- McGovern Foundation
Individual Speakers
- Amitab — CEO, Bhashini initiative
- Aya Patel — CEO, Current AI (joined ~1.5 months before summit)
- Dr. Alex Ilich — Founder and Executive Director, AI Center ETH Zurich; co-founder ICAN
- Petri Toiviainen — Founding member ICAN, Ellis Network, Finnish representative; UN HL Lab experience
- Professor (name partially unclear) — NTU Singapore historian, Dean of College of Humanities Arts and Social Sciences
- Dr. Annie Swanson Reid — EPFL/Yale, leads Light Lab and Move project
- Nina Frey/Katarina Frey — Executive Director, ICAN
- Professor Thren Suede — President, Swiss National Science Foundation (SNSF)
Technical Concepts & Resources
AI Models & Systems
- Aperture (70 billion parameters) — World's first fully open, multilingual foundation model (ETH Zurich/EPFL); trained with multilingual data from inception, not English-first
- Sea Lion — Southeast Asian large language model (13 languages, Singapore-funded, versions built on Aperture, aspirations to expand beyond Asia including Tamil)
- Bhashini — Multilingual AI framework supporting:
- Automatic Speech Recognition (ASR) in 22 languages
- Bidirectional Text-to-Text Translation (22 languages, expanded to 36 for text)
- Text-to-Speech (TTS) (22 languages)
- Optical Character Recognition (OCR) (22 languages)
- Digital Vocabulary/Dictionary (including proper nouns: places, people, companies)
Key Technical Challenges Identified
- Data scarcity: Low-resource languages lack digitized text, speech, and domain-specific corpora
- Compute bottleneck: Only ~11,000 GPUs of newest generation globally (e.g., at Swiss supercomputing center); broader distribution needed
- Talent scarcity: ~100 people globally with foundation model training experience
- Benchmark bias: Major AI companies define benchmarks; they typically excel on their own metrics (cherry-picked)
- Language representation in training data: 60% English, 40% non-English in typical internet-scraped datasets; inadequate for multilingual performance parity
Methodological Approaches
- Field-based data collection: Bhashini employed 200+ workers to collect spoken and written data in-person (monolingual and bilingual corpora)
- Community-centered training: Partnered with 70 research institutes across India; involved communities in data collection and use cases
- Narrow use cases first: Bhashini deployed limited-scope applications (farmer advisory, manuscript digitization) to deliver value while building data feedback loops
- Implementation science & validation: Move project uses clinical trials and real-world deployment to test accuracy in high-stakes contexts (healthcare, governance)
- Frugality-first design: Singapore and Current AI emphasize effective resource use (small amounts of language data, operating from scarcity)
Data & Corpora
- Swiss data: 4 languages, multiple dialects
- Indian constitutional languages: 22 languages (Bhashini); expanding to 36+ including tribal languages without traditional scripts
- South African context: 11 official languages
- **Southeast Asian: **13 languages (Sea Lion)
- Critical observation: Most digitized data in non-English languages is Bible translations, limiting domain accuracy
Benchmarks & Metrics
- Traditionally defined by Big Tech; need to be driven by regional/cultural priorities
- Performance parity with English-trained models remains goal for next 100+ languages
- Accuracy testing in domain-specific, high-stakes contexts (not just language fluency) is essential
Infrastructure & Governance
- Swiss National Supercomputing Center: 11,000 newest-generation GPUs
- Shared compute access: Academia programs enabling researchers globally to access supercomputing infrastructure
- Indo-Swiss Joint Research Program (JRP): Bilateral funding with three new calls:
- Geosciences (natural hazards in mountain regions)
- Social Sciences (societal questions)
- One Health (interconnected human-animal-environment health)
- Explore, Experiment, Expand grants: SNSF new funding scheme for novel collaborations, blue-sky research, and expansion of established consortia
Policy & Governance Insights
- Geneva AI Summit 2027: Upcoming international forum; Switzerland positioning multilingual AI as foundation for inclusive digital public services
- Paris 2025 AI Summit: Origin point of Current AI and public interest AI working group
- India AI Summit 2026: Current summit; framed as continuation of Paris-to-Geneva arc
- UN Human Rights Framework: Access to language and culture recognized as a human right under international law
- Non-aligned principles: Indian approach to AI governance emphasizes multi-polar world, dispersed sovereignty, and resistance to US/China hegemony
- Data collection urgency in high-stakes domains: Healthcare, conflict zones, and marginalized populations require academic prioritization because commercial entities will not invest
Limitations & Caveats
- Some speaker names and institutional affiliations were unclear or partially inaudible in transcript
- Specific metrics/performance comparisons between Aperture, Sea Lion, and proprietary models not detailed in this session
- Timeline for reaching performance parity across 100+ languages not specified
- Budget/cost figures for Bhashini data collection and other initiatives not fully itemized
- Details on the "device" launching at 3:30 PM withheld for dramatic reveal (not covered in transcript)
