All sessions

Public Data and AI Training: Safeguards for Responsible Reuse

Contents

Executive Summary

The Open Data Charter, in partnership with the Global Center on AI Governance, is conducting a multi-country research project examining the legal and ethical boundaries of using publicly accessible data for AI training. The project focuses on identifying governance gaps and safeguards across data protection, copyright, and intellectual property frameworks—revealing that publicly available data is not legally free to use and requires responsible reuse mechanisms.

Key Takeaways

  1. Public data requires responsible guardrails: Data being "publicly available" does not exempt it from data protection, copyright, or emerging AI governance laws. Organizations must assess legal boundaries carefully.

  2. Regulatory frameworks are country-specific but undersupply clarity: This research reveals significant variation across jurisdictions (Brazil, Japan, Australia), and most lack explicit guidance on text and data mining for AI—creating both risks and opportunities for policy improvement.

  3. Safeguards must address data quality and bias, not just access: Responsible reuse of public data requires attention to hidden limitations, selective collection practices, and potential harms—beyond simple legal compliance.

  4. Government has both an opportunity and responsibility: As governments serve as major data providers for AI, especially in the Global South, they must establish governance regimes that enable innovation while protecting rights, privacy, and trust.

  5. This is an emerging, evolving field: The project is in its early stages; desktop research is complete, but surveys and interviews are ongoing. The conversation is only beginning, and results will inform future policy and practice.

Key Topics Covered

  • Regulatory Frameworks: Data protection laws, copyright regimes, and emerging AI governance regulations and their application to publicly accessible data
  • Governance Gaps: Unclear rules around whether access to publicly available data grants the right to analyze or reuse it for AI training
  • Data Reuse Boundaries: Legal and ethical limitations on text and data mining, even when data is anonymized or openly published
  • Geographic Variation: Different regulatory approaches across Brazil, Japan, and Australia
  • Stakeholder Perspectives: Understanding how private sector, civil society, and government actors interpret exceptions and safeguards
  • Government as Data Provider: The role of governments in supplying diverse datasets for AI, particularly in the Global South
  • Data Quality & Bias: Hidden consequences of publicly available data, including inherent bias and selective collection practices
  • Personal Data Protection: How data protection laws apply to personal data even when publicly accessible on social media or web platforms

Key Points & Insights

  1. Governance Gap Exists: Copyright regimes and data protection laws create regulatory uncertainty for innovators and public sector users when applied to publicly available data used for AI training. Rules are "unclear" about whether access implies analytical rights.

  2. Public Availability ≠ Free Use: The fact that data is publicly accessible on the web or social media does not automatically grant the right to use it for AI training; data protection laws still apply to personal data regardless of public visibility.

  3. Three-Framework Focus: The research examines three core regulatory areas simultaneously—data protection, copyright, and intellectual property regulations—recognizing they intersect and sometimes conflict in AI contexts.

  4. Text and Data Mining Exceptions Vary: Organizations interpret exceptions for text and data mining differently across jurisdictions; some frameworks (e.g., Japan) have broader permissions, while others (e.g., Brazil) lack explicit exceptions.

  5. Brazil's Legal Landscape: Brazil has strong data protection laws requiring "purpose, good faith, and public interest" in personal data use, but lacks explicit text and data mining exceptions. An emerging AI governance law presents an opportunity to clarify these rules.

  6. Japan's Broader Interpretation: Japan's copyright law permits information analysis of publicly available data provided it is not used for unjust purposes—a more permissive approach than some other jurisdictions.

  7. Data Quality & Hidden Bias: Even anonymized or openly published data carries inherent limitations—undisclosed bias, selective inclusion/exclusion decisions, and cumulative consequences when combined with other datasets. Public availability does not eliminate these risks.

  8. Government Data Complexity: Governments often hold the most diverse datasets, particularly in the Global South, and are increasingly positioned as AI data providers—yet the governance regimes protecting such data reuse remain unclear.

  9. Mixed-Method Research Approach: The project combines desk research (law, case law, industry practice), surveys, and stakeholder interviews to capture both legal frameworks and real-world perceptions and safeguards.

  10. Intersection with Digital Public Infrastructure: The Open Data Charter has expanded its focus beyond open government data to include the intersection of data governance, AI infrastructure, access rights, and personal data protection.

Notable Quotes or Statements

"The fact that you can publicly assess people's personal data on social media platforms doesn't necessarily mean that you can use that data for training models." — Fa Adele, Global Center on AI Governance

"We recognize that AI training is increasingly relying on data that is available on the web. So it's focusing on open data sets but in the utilization of this data the rules are really unclear about whether folks have the right to analyze data that they're assessing publicly." — Fa Adele

"When the copyright regime doesn't really speak directly to data already available, does it mean that it is free to use?" — Fa Adele (core research question)

"How do we enable legitimate development when protecting rights, privacy and trust?" — Fa Adele (main research objective)

"The fact that data is anonymized doesn't mean that it won't contain any sort of bias, any sort of different aspects of what they chose to include in the data and what they chose not to include." — Fa Adele (on data quality limitations)

Speakers & Organizations Mentioned

EntityRolePerson
Open Data CharterGlobal civil society organization; co-organizer of researchRenato (Research Manager)
Global Center on AI GovernanceResearch partner; leads knowledge hub and African observatory on responsible AIFa Adele (Executive Director & Co-founder)
Bini NarayanPolicy roleAPI (organization not fully specified)
Betta BelCommunications managerILA (organization not fully specified)
MicrosoftResearch funder(mentioned as supporting the project)

Geographic Contexts: Brazil (Latin America), Japan (Asia), Australia; organizations and perspectives also sought from Asia and Latin America more broadly.

Academic Affiliations Mentioned: University of Waterloo, Columbia University (Center on Sustainable Investments), Harvard University (Human Rights Program) — associated with Fa Adele's background.

Technical Concepts & Resources

  • Text and Data Mining (TDM): A core analytical capability for AI training; treatment under copyright law varies by jurisdiction and is a focus area for the research.
  • Data Protection Laws: Applied even to publicly available personal data; major regulatory constraint examined in Brazil, Japan, and Australia.
  • Copyright Frameworks: Traditional intellectual property regimes that may or may not explicitly address data analysis and reuse in AI contexts.
  • Intellectual Property (IP) Regulations: The third pillar of legal analysis alongside data protection and copyright.
  • Anonymization: Referenced as a data protection technique, but the research notes it does not eliminate bias or hidden quality issues.
  • Menti (Interactive Tool): Used during the session for real-time audience polling on sector (private, civil society, academia, government) and data use practices.
  • Mixed-Method Research Design: Combines qualitative (interviews, case law analysis) and quantitative (surveys) approaches.
  • Open Data Policy & Open by Default Approach: Brazil's policy framework requiring government data to be openly published unless restricted.
  • Digital Public Infrastructure: Framed as an emerging intersection point with AI governance and data protection.

Status of Research: Initial desktop research phase complete; survey and interview phases ongoing across three countries. Results will be shared as the project advances over several months.