Reimagining Public Broadcasting in the AI Era

Contents

Executive Summary

This panel discussion explores how public broadcasting institutions across the Global South—particularly Africa, India, and Latin America—can become data providers for local language AI model development while maintaining public value, community ownership, and ethical governance. The conversation centers on a proposed governance framework in South Africa that treats broadcast archives as critical national data infrastructure, while addressing the fundamental tension between enabling AI innovation and preventing exploitative extraction of community data.

Key Takeaways

Broadcast archives are underutilized public assets: Well-structured, ethically-collected, and already-digitized broadcasting data can serve as foundational infrastructure for local language AI, but legal frameworks must clarify governance, access rights, and public benefit.
Equity requires explicit mechanisms, not aspirational principles: Revenue sharing, dataset ownership, participatory design, and fair wages must be operationalized through contracts, platforms, and institutions—not left to goodwill. Karya and similar organizations are proving this is technically and economically feasible.
Power over data matters more than access to data: Whether communities participate in AI development depends not on inclusion rhetoric but on who controls problem definition, data curation decisions, benefit distribution, and deployment choices.
Legal and technical work must advance in parallel: Information regulators cannot regulate what laws don't clarify; developers cannot implement what policies haven't defined. South Africa's multi-stakeholder governance model shows how to coordinate this.
Continental cooperation on data transfer is necessary but under-implemented: African Union frameworks exist to enable cross-border data sharing for innovation, but individual countries must domesticate these policies and resolve bilateral negotiations.

Key Topics Covered

Public Broadcasting as Data Infrastructure: Leveraging existing broadcast archives and recordings as foundational datasets for AI training
Governance Frameworks for Data Access: Legal and regulatory mechanisms needed to enable data sharing while protecting privacy and ensuring public benefit
Local Language AI Models: Why multilingual/low-resource language models are critical for equitable AI development in the Global South
Community Ownership & Revenue Sharing: Moving beyond extraction to ensure communities benefit financially and maintain decision-making power over their data
Continental Policy Alignment: How African Union frameworks (data policy, continental free trade strategy) support cross-border data sharing
Technical Challenges in Data Collection: Issues specific to voice data, anonymization, bias balance, and multimodal datasets
Participatory AI Development: Centering community involvement across the entire AI value chain, not just as passive beneficiaries
Extractivism vs. Equity: Critiques of how AI development can reproduce colonial data relationships, particularly in Latin America

Key Points & Insights

Data as the Core Constraint: Speakers consistently identified data scarcity—not computational power—as the primary bottleneck for developing AI systems for African and South Asian languages. Broadcasting data represents an underutilized gold mine: structured, high-quality, ethically-collected content that already exists at scale.
Legal Gaps Are Acute: Existing data protection laws and freedom-of-information frameworks were designed for paper-based records and are outdated for the AI era. South Africa's Information Regulator specifically called for legislative amendments to clarify algorithmic transparency, proactive data disclosure, and the legal status of automated data access.
Governance Framework Model (South Africa): The proposed framework involves multiple stakeholders working together:
- Public broadcasters (data holders)
- Information Regulator (oversight/compliance)
- Universities (technical implementation)
- Innovation funds (financial support)
- Private/commercial broadcasters (potential partners)
This creates accountability through "public value" and "public purse," ensuring data access serves public interest rather than private profit.
The Compensation Problem: When organizations are unaware of data value, they often give it away freely; once value is demonstrated, they demand payment. This creates barriers for research-focused teams. The Masakani hub's experience in Kenya (one-year negotiation) versus Nigeria (immediate agreement) illustrates how inconsistent access policies are across the continent.
Multimodal Data Complexity: Voice data from broadcasts raises additional challenges beyond text:
- Speaker balance (gender, age, demographics)
- Personal identifier removal (privacy)
- Advertisement/noise filtering
- Bias detection and mitigation
- These require both technical and governance solutions
Community as Active Builders, Not Passive Beneficiaries: Karya's work in India demonstrates that communities can meaningfully participate in complex AI tasks (500K+ human evaluations conducted). The belief that evaluation and annotation "is too complicated for local communities" is unfounded and patronizing. True equity requires:
- Ownership of datasets when possible
- Royalty/revenue sharing mechanisms
- Agency in problem definition and solution design
- Wages reflecting fair value, not poverty-rate compensation
Language ≠ Data: Local language models must be culturally grounded, not just translations or linguistic data. Asking "what's a breakfast idea?" in an LLM trained on culturally-generic data will return English-centric suggestions. Masakani's multimodal, culturally-grounded approach (50 African languages) addresses this gap.
Power, Not Just Inclusion: Diana Mascara's critical point from Latin America: representation in data alone perpetuates extractivism in new forms. The real question is who defines what data matters, for what purpose, and under what institutional arrangements. Without decision-making power over data collection, curation, and use, "inclusion" becomes another form of appropriation.
Continental Frameworks Exist: The African Union Data Policy Framework, the Convention on Cyber Security and Personal Data Protection (2014), and Resolution 620 on Access to Data (African Commission for Human and People's Rights) already provide principles and cross-border harmonization. The gap is operationalization and domestication at the national level.
Data Sharing Policies Are Fragmented: In India and across the Global South, departmental data-sharing policies require lengthy approvals. Platforms like AI Kosh anonymize and gate-keep sensitive data while making research datasets publicly available—a practical middle ground but still requiring case-by-case navigation.

Notable Quotes or Statements

"Data, data, data—as people say. So this is really what we are hoping will come to fruition in South Africa in the very near term." — Dr. Gilwald (on why data access, not computational power, is the bottleneck for local language AI)

"AI governance requires that there is protection of personal information but there's also accessing of critical databases that are needed to develop AI." — Mukalani Dima, Information Regulator of South Africa (on the necessary balance between privacy and innovation)

"Governance is not merely regulatory or administrative. It is a political topic because every data set, every classification, every optimization decision reflects power relations." — Diana Mascara, Diversa (on why data governance is fundamentally about who holds power)

"The relationship with technology is fundamentally broken because for all of us in the global south, it is always a foreign object. Technology always happens to us instead of happening from us." — Manu (Karya), paraphrasing Fei-Fei Li's critique, on the need for communities to be active builders, not passive beneficiaries

"There's no difference in capability between our communities and the people who currently get to decide how AI is built." — Manu (Karya), on centering community agency in AI development

"If local languages are not treated mercifully and raw material for model training, and without community participation in how data is curated, interpreted, and used, we reproduce extractive dynamics in a new form." — Diana Mascara, on why cultural groundedness + community control are both necessary

Speakers & Organizations Mentioned

Speakers:

Dr. Gilwald (moderator, initiating the South African governance framework proposal)
Mukalani Dima – Information Regulator of South Africa
Priya Chetti – Executive Director, Research ICT Africa (on African Union data policy frameworks)
Tajin Guadab – Masakani Hub (on NLP for African languages, broadcaster collaboration challenges)
Gita Raju – Center for Responsible AI, IIT Madras (on India's AI initiatives)
Manu – Karya (on community ownership and participatory AI development)
Diana Mascara – Diversa (on Latin American critiques of data extractivism)
Ashwin – Karya (on cross-border data transfer and regional language modeling)

Organizations & Institutions:

South Africa: Information Regulator of South Africa, University of Pretoria, Department of Science and Technology (Innovation Fund)
Africa: Research ICT Africa, Masakani/Masakani African Languages Hub, Leu Bethlehem/Leu (partners), Digital Green (Kenya & India), Lepa (South African partner)
India: IIT Madras (Center for Responsible AI), Karya, AI Kosh (India AI Mission platform), AI for Bharat initiative
Governance: African Union, African Commission for Human and People's Rights, G20 (endorsed the South African concept note)
Global: Google (partners on Indian datasets)

Technical Concepts & Resources

Key Datasets & Platforms:

AI Kosh (India AI Mission) – Platform hosting AI-ready, publicly available datasets with anonymization standards for Indian languages
Vani Phase 2 – One of the largest multimodal datasets in Indian language history (text, audio, video)
Shunā – Platform for crowdsourced data annotation across Indian languages
AI for Bharat – Initiative hosting models, datasets, and frameworks for adoption across 22 official Indian languages
Masakani Multimodal Datasets – Culturally grounded text, image, and audio data in 50+ African languages

Technical Challenges Addressed:

Anonymization standards: Removing personal identifiers while preserving data utility for LLM training
Speaker balance: Ensuring gender, age, and demographic diversity in voice datasets
Cultural groundedness: Moving beyond translation-based datasets to ensure linguistic and cultural authenticity
Multimodal data collection: Expanding beyond text to egocentric video, images, and contextual metadata
Bias detection in voice models: Identifying and mitigating speaker bias in speech recognition systems
Data preprocessing for broadcast: Filtering advertisements, removing noise, extracting usable speech segments

Governance & Policy Frameworks Referenced:

South African Promotion of Access to Information Act (requires amendment for AI era)
African Union Data Policy Framework (2021) – Principles for harmonizing data governance across Africa
African Union Continental Strategy on AI – Identifies data access, compute, and talent as core constraints
Convention on Cyber Security and Personal Data Protection (African Union, 2014)
African Commission Resolution 620 – Human rights approach to access to data
AfCFTA (African Continental Free Trade Area) – Digital single market objectives supported by AU Data Policy Framework
India's Data Sharing Policies – Departmental gating of government datasets with approval workflows
GDPR, data protection frameworks – Global south countries adapting/developing equivalents

Organizational Models:

Multi-stakeholder governance (South Africa model): Public broadcasters + Information Regulator + Universities + Innovation funds
Mutual collaboration model (referenced from Robert Mali partners): Data access in exchange for AI training provided to media organizations
Preferential access model: Prioritizing local/African developers and ensuring data remains within continent for competitive advantage
Royalty/revenue-sharing mechanisms (Karya): Data ownership by communities with benefit-sharing on downstream applications