All sessions

Whose Language, Whose Model? Public-Interest Multilingual LLMs

Contents

Executive Summary

This AI summit panel addresses the critical gap between AI development (dominated by industry and the Global North) and meaningful participation by affected communities, civil society, and experts from the Global South. The speakers argue that meaningful multistakeholder engagement must occur across the entire AI lifecycle—from data collection and model design through deployment and post-deployment evaluation—with particular emphasis on preserving linguistic and cultural identity in low-resource language contexts.

Key Takeaways

  1. The participation framework matters more than the participation act itself: Meaningful engagement requires clear governance mechanisms specifying which stakeholders engage at which lifecycle stages, with defined power to shape outcomes—not blanket "stakeholder consultations."

  2. Low-resource languages expose systemic injustice: Ketwa, indigenous communities, and other marginalized-language speakers face accelerated cultural erasure through LLMs. This is not a niche issue—it reveals how participatory AI development is inseparable from human rights and cultural sovereignty.

  3. Regulatory timelines are collapsing: The window to move from voluntary commitments to binding standards is 2-3 years. After that, path dependency and market consolidation will make meaningful participation much harder to enforce, repeating social media's regulatory failures.

  4. Data ownership and community control are non-negotiable: Communities must transition from data providers to data owners and controllers. This structural change—not just consultation—determines whether AI systems serve or harm affected populations.

  5. Technical architecture is political: Choices between frontier models vs. regional models, human-in-the-loop vs. machine-in-the-loop, and centralized vs. distributed governance are not neutral technical decisions. They enable or constrain participation.

Key Topics Covered

  • Regulatory trajectories and governance gaps: How AI governance has stalled in voluntary commitments (unlike social media regulation lessons)
  • Meaningful participation vs. "box-checking": Defining authentic stakeholder engagement versus performative consultation
  • Power distribution and community-centered design: Shifting from "human-in-the-loop" to "machine-in-the-loop" approaches
  • Language as identity and cultural preservation: Risks to indigenous and low-resource languages in LLM development
  • Data governance and ownership: Who controls, curates, and owns datasets used for model training
  • Evaluation and accountability mechanisms: Who determines success metrics and has authority to reject models
  • Incentive structures for inclusive AI: Regulatory, financial, and market-based approaches
  • Intersectional risks: Caste-based hate speech, misrepresentation of marginalized communities, colonial power dynamics
  • Technical architecture alternatives: Smaller/regional models as participatory alternatives to frontier models
  • Trust and safety in low-resource contexts: Mechanisms for content moderation in underserved languages

Key Points & Insights

  1. Regulatory capture and timeline urgency: The field has spent too long in voluntary commitment phases (as with social media). Hard regulation—whether standards, laws, or mandatory processes—is needed within the next 2-3 years to prevent repeating the platform accountability failures of the past two decades.

  2. Power as the central question: Meaningful participation is fundamentally about power distribution—who makes decisions, whose expertise is valued, and who bears the consequences. Without shifting power dynamics, participation remains extraction.

  3. Communities as insiders, not external stakeholders: Effective engagement requires positioning affected communities as co-creators and decision-makers from data collection onwards, not as consultants reviewing finished systems. This is a structural change, not a procedural one.

  4. Language preservation as an urgent justice issue: The Ketwa case study (10 million speakers across South America) reveals that LLMs can accelerate linguistic extinction by mediating expression through dominant-language models, raising fundamental questions about cultural sovereignty and identity.

  5. Data is where bias and fairness originate: Multistakeholder participation must prioritize the data phase of the AI lifecycle. Representatives (students, civil society, local government, community groups) need coordination and ongoing involvement as languages and contexts evolve—not one-off consultations.

  6. Self-determination and refusal rights: Some communities may not want to be represented in AI datasets at all. Informed choice and the right to refuse participation is as important as inclusion. Data ownership, not just data provision, should be community-controlled.

  7. Context and cultural norms cannot be retrofitted: The "broccoli problem" (culturally irrelevant recommendations) and broader failures of one-size-fits-all models reveal that context-awareness requires participatory design from inception, not post-hoc evaluation.

  8. Evaluation metrics are political: Standard-setting bodies (ISO 42001 cited) and benchmarks must include marginalized communities in defining what "success" means. Who asks evaluation questions determines whose interests are served.

  9. Coordination costs are real but manageable: Engaging multiple stakeholders (students, local governments, advocacy groups) at multiple lifecycle stages creates administrative burden, but this has been successfully managed in other domains and is the cost of legitimate AI development.

  10. Economic incentives must align with inclusivity: Profit-maximizing models alone cannot fund development for rare diseases or small-language speakers. Funding from nonprofits, impact funds, and value-aligned hiring are necessary to sustain participatory development.


Notable Quotes or Statements

  • Ali Bhhata (Center for Democracy and Technology): "Conversations are being led and centralized in the global north in industry or G2G government spaces, losing out on valuable expertise from civil society and the most affected by these technologies."

  • Jalak Kakar (Center for Communication Governance, NLU Delhi): "If we don't make that shift over the next couple of years from voluntary commitments into harder, more structured forms of regulation, we will see a repetition of what we've seen with social media where platforms ran free with no mechanism to hold them truly accountable."

  • Dhan Raj Thesis: "When we talk about participation and collaboration, we're in effect talking about how power is distributed and what kinds of decisions are made by who." (Referencing the principle of "putting first those considered last," from development literature.)

  • On Ketwa indigenous languages: "Communities were concerned about how large language models could potentially enable or continue oppression of their language compared to other languages that were more dominant in the region...This comes down to questions about identity, particularly for cultures not within high-resource European languages."

  • Participant insight: "Community involvement should not be a checkbox activity but a meaningful activity. Communities should not just be providers of datasets but owners and controllers of datasets, giving them space as insiders in the creation of LLMs."


Speakers & Organizations Mentioned

Primary Panelists:

  • Ali Bhhata — Senior Policy Analyst, Center for Democracy and Technology (Washington, D.C., nonprofit/nonpartisan)
  • Jalak Kakar — Executive Director, Center for Communication Governance, National Law University Delhi
  • Dhan Raj Thesis — Inaugural Professor and Director, Emerging Technology Initiative, George Washington University Law School

Referenced Organizations & Initiatives:

  • European Center for Not-for-Profit Law — Marina Wizniak's framework for meaningful engagement (released as resource)
  • Persistent Systems — AI governance work (represented by Praashal)
  • UK Foreign Office — Policy perspective (represented by Richard Brown)
  • Anthropic — Frontier model company; referenced "broccoli problem" and Bangalore event example
  • Wikipedia — "Machine-in-the-loop" participatory design approach (contrasted with human-in-the-loop)
  • Center for Communication Governance (CCG) — Report on AI safety across lifecycle stages (released next day)

Technical Concepts & Resources

Standards & Frameworks:

  • ISO 42001 — AI systems management standard with 13 broad dimensions for governance
  • Framework for Meaningful Engagement (European Center for Not-for-Profit Law) — Specific steps to foster participation and inclusivity

Research Studies & Reports:

  • Trust and Safety Systems in Low-Resource Languages — Four-report collection by Dhan Raj et al. examining content moderation in underserved language contexts
  • Ketwa Language Study — Empirical investigation of indigenous language community concerns regarding LLM development and cultural preservation

Technical Approaches:

  • Low-resource languages — Languages with limited training data and high-quality resources; often trained on synthetic data or poor-quality scanned representations
  • High-resource languages — European languages with abundant training data and well-developed NLP infrastructure
  • Regional/small language models — Alternatives to frontier models that can better accommodate participatory design
  • Human-in-the-loop vs. Machine-in-the-loop — Design paradigm contrasting oversight-after-deployment with community-embedded iterative development
  • Data annotation and curation — Critical lifecycle phase where multistakeholder participation can prevent bias and misrepresentation

Key Concepts:

  • Multistakeholder governance — Engagement across government, industry, civil society, academia, and affected communities
  • Soft law vs. hard law — Voluntary commitments and standards vs. binding regulation
  • Data sovereignty and self-determination — Community rights to control, refuse, and govern data representation
  • Participatory design lifecycle — Integration of stakeholders from data collection → model design → development → evaluation → deployment → post-deployment iteration

Document Version: Full transcript analysis from AI Summit panel session
Primary Focus: Governance, participation, and equity in multilingual LLM development