Data Sharing for AI: Building Trust, Purpose, and Public Value
Contents
Executive Summary
This panel discussion explores how countries and communities can build data exchange systems that unlock AI innovation while protecting sovereignty, privacy, and the interests of data contributors. The panelists argue that effective data governance requires balancing competing tensions—between data access and protection, individual rights and collective interests, market innovation and regulatory safeguards—through new legal frameworks, technical architectures, and negotiation models centered on equitable value exchange.
Key Takeaways
-
Build decentralized, tiered data marketplaces that give contributors control over pricing, licensing, and access terms—not a one-size-fits-all platform, but flexible infrastructure that enables variable arrangements based on contributor preferences.
-
Embed rights protection into technical architecture (clean rooms, federated learning, local model deployment) rather than relying solely on legal frameworks—law is slow, expensive, and often inadequate. Design systems that make rights violations technically impossible.
-
Coordinate at scale—countries, communities, and regions must act collectively to negotiate favorable terms with frontier AI companies—individual actors have no leverage. The moment to coordinate is now, before model capabilities make external data less essential.
-
Recognize and resource all forms of data contribution: not just raw datasets, but labor (annotation, cleaning), domain expertise, cultural knowledge, and community governance—a truly equitable model accounts for the full value chain.
-
Center community agency and self-determination, not just state sovereignty—governance frameworks should enable communities to decide what data to share, with whom, under what terms, and how to capture value—not impose top-down paternalism.
Key Topics Covered
- Data sovereignty and governance frameworks — How countries can assert control over data resources while enabling beneficial AI development
- Tensions between data access and protection — The friction between enabling AI (which requires abundant data) and data protection laws (which enforce data minimization)
- Value exchange and business models — Different approaches to compensating data contributors (monetary, attribution, model localization, licensing)
- Community-driven data initiatives — Grassroots models like Masakhane and their approach to building African language datasets
- Decentralized data marketplaces — Alternative architectures for data sharing that enable variable pricing, licensing terms, and access controls
- Copyright and intellectual property in the AI era — Rethinking copyright frameworks to account for how large language models consume training data
- Data localization and model weights — The strategic use of localizing AI model weights as leverage for data-rich regions
- Legal and technical design — Embedding rights protection into technical architecture (e.g., clean rooms, federated learning) rather than relying solely on laws
- Labor and data annotation — The overlooked role of data labeling workers and emerging labor organizing efforts
- Africa's position in AI — Continental coordination strategies and ongoing legal battles between African nations and major tech companies
Key Points & Insights
-
Tension is productive, not problematic. The friction between data access and data protection is not a bug but a feature—it forces deeper consideration of governance trade-offs. The problem arises when stakeholders fail to recognize the tension exists.
-
Current data protection and copyright frameworks are misaligned with AI needs. GDPR-style data minimization (consent, narrow purpose, deletion after use) conflicts with AI's requirement for abundant, long-term data retention. Similarly, copyright law's assumptions about copying don't map cleanly onto how LLMs process training data.
-
Value exchange is foundational. Data sharing only works sustainably when all stakeholders—data contributors, governments, AI companies, researchers—perceive genuine value flowing back to them. Without this, systems fail or become extractive.
-
Community agency differs from state paternalism. Data sovereignty is often framed as state control "on behalf of" people, but communities may assert sovereignty directly through collective action (e.g., Masakhane's bootstrapped language datasets). Agency matters as much as formal rights.
-
Negotiating power requires scale and coordination. Individual data contributors have no leverage against large AI companies. Collective action—countries banding together, communities pooling resources, regional coalitions—is necessary to achieve favorable terms.
-
The data contributor identity must be centered in marketplace design. Tiered pricing, variable licensing terms, and access controls should be decided by contributors, not imposed on them. A decentralized marketplace architecture enables this flexibility.
-
Model localization is a viable bargaining strategy. Data-rich regions (Africa, India) can negotiate for local copies of improved model weights as compensation, enabling them to build sovereign AI capabilities and prevent future dependency.
-
Technical design can embed rights protection better than law alone. Clean rooms (data stays in place, models come to data) and federated learning architectures enforce data rights without relying on litigation—which is expensive and often inadequate compensation.
-
The window for negotiation is closing. As foundation models mature and AGI approaches, the need for external data may diminish. The current moment—when frontier models are hungry for diverse training data—is the optimal negotiation point.
-
Data labeling labor is underrecognized in this conversation. The physical work of cleaning, annotating, and preparing data—often outsourced to low-wage workers—is where much value is created but rarely discussed in policy forums.
Notable Quotes or Statements
-
On tension as productive: "I think it's actually great that there's tension... When there is tension, we are forced to think hard about the decisions we make. If there's no tension, we stop thinking." — Rahul (speaker on data rights framework)
-
On data protection misalignment: "Data protection is a framework where we have enforced cost data scarcity... This framework is at odds with what artificial intelligence needs because AI needs as much information as possible for as long as possible."
-
On negotiating leverage: "I'm a lawyer. Ultimately, I'm just looking in a negotiation for leverage. And I don't think a single contributor of data has leverage... So if essentially we want to negotiate a deal, we're going to need to make a critical mass of people."
-
On the closing window: "We have a small window where the models need to be refined by this additional cultural data. I think if we don't use our collective ability to negotiate, we're not going to be able to achieve this."
-
On data localization as leverage: "If you're a sovereign AI, if you're a large AI company frontier AI company, you have trained with my data that you know has some value, I want you to commit to localizing the weights of the model that has been enriched by my data for the benefit of the community."
-
On African language datasets (Masakhane): "They were laughed out of rooms because it was 'what do you mean Africans can actually do NLP and machine learning?'" — Chennai (Masakhane contributor), describing barriers to recognition of African AI work
-
On AI's transformative potential: "If a lawyer who has never written a line of code can now actually have an app on the app store [using Claude], I don't know... We all will have superpowers." — Rahul, on AI democratizing technical capability
Speakers & Organizations Mentioned
- Vijay — Gates Foundation (speaker on ecosystem-building and data governance)
- Chennai (Asakane Sarana) — Masakhane (community-driven African NLP initiative)
- Rahul — Trial / data rights researcher and lawyer; author of "New Deal for Data" paper
- Sarinia — Speaker on decentralized data marketplaces and data sovereignty
- Google — Mentioned as funder of Masakhane research center and broader data extraction practices
- Gates Foundation — Active in healthcare data ecosystem building, benchmarking, and governance
- Masakhane — Community of African researchers building NLP resources and datasets for African languages
- Partnership on AI (PAI) — Mentioned as organizational collaborator
- African governments/courts — South Africa, Kenya, Uganda referenced in relation to ongoing legal cases against Silicon Valley tech companies over data usage
- State broadcasters in African nations — Referenced as data gatekeepers in language dataset collection
Technical Concepts & Resources
Key Papers & Frameworks
- "New Deal for Data" — Rahul's paper proposing governance framework rethinking data protection, copyright, and data sovereignty
- Decentralized data marketplace paper — Sarinia's work on tiered, variable-access marketplace architecture
Licensing & Legal Frameworks
- Esetu and Noodle licenses — Community-created licensing frameworks by Masakhane allowing two-year exclusivity for African innovators before broader access/payment
Technical Architectures
- Clean rooms / Federated learning — Data stays in place; models come to data; prevents unauthorized copying or reuse
- Model localization / Weights distribution — Negotiated strategy where data contributors receive copies of improved model weights as compensation
- Decentralized marketplace architecture — Infrastructure enabling variable pricing, licensing terms, and access controls managed by contributors
AI/ML Models & Systems
- Foundation models — Large language models (Claude, ChatGPT, Gemini) mentioned as primary AI systems consuming training data
- Claude/Claude Code — Anthropic's LLM used as example of AI-enabled capability democratization
- State-of-the-art narrow vertical models — Indian models (mentioned: dubbed language, OCR for Indian scripts) as example of specialized sovereign AI
- MMLU benchmarks — Standard AI evaluation metrics referenced as insufficient for non-Western contexts
Datasets & Initiatives
- Masakhane — Largest community-driven African language NLP initiative; datasets hosted on GitHub; developing licensing frameworks
- Deep Learning Indaba — African machine learning conference and community mentioned as related initiative
- Equalize AI — African AI initiative mentioned alongside Masakhane
Concepts
- Digital Public Infrastructure (DPI) — Mentioned in relation to ID/payment systems; proposed as model for data infrastructure
- Data minimization — Core principle of GDPR/data protection laws; enforced through consent, narrow purpose, deletion requirements
- Epistemic injustice — Social injustice in knowledge production (referenced in context of whose knowledge counts in AI training)
- Ubuntu/Auntu philosophy — African value system emphasizing relational humanity ("my humanity is centered on your humanity") as design principle for data governance
Labor & Rights Issues
- Data labeling/annotation labor — Underrecognized workers who clean and prepare training datasets; increasingly organizing via trade unions
- Court cases — Ongoing litigation in South Africa, Kenya, Uganda against tech companies over data usage without proper compensation
Questions for Further Exploration
The transcript suggests but does not fully resolve:
- How to operationalize "variable terms" in a decentralized marketplace while maintaining discoverability and preventing fragmentation
- Whether model localization strategies actually remain viable as foundation models mature and require less retraining
- How to involve data labeling workers and informal data contributors in governance frameworks designed around formal "data contributors"
- Specific mechanics of embedding rights into technical architecture (clean rooms, federated learning) at scale, given infrastructure costs
- How to coordinate across African nations given existing geopolitical tensions and different regulatory priorities
