All sessions

AI and Open Data | Unlocking Public Value and Impact at Scale

Contents

Executive Summary

This panel discussion explores how AI can enhance the value and impact of open public data while maintaining trust, quality, and equity. The speakers emphasize that National Statistical Offices (NSOs) remain essential custodians of high-quality, trustworthy data, and outline how AI—particularly generative AI—creates both opportunities (democratized data access, improved data production) and risks (synthetic data flooding, erosion of statistical knowledge, data extraction by private entities) that require careful governance and investment in data literacy.

Key Takeaways

  1. NSOs are not obsolete in the AI era—they're essential and evolving: Their role expands from data production to ecosystem orchestration, validation, standard-setting, and trust-building. Investment in modern NSO capacity is critical.

  2. Three technical prerequisites for AI-ready data: Enriched metadata + interoperable standards + provenance documentation. Without these, data remains siloed and underutilized.

  3. Trust is not a legal problem, it's a data architecture problem: When official statistics are AI-accessible and high-quality, they naturally become the preferred reference point over generic LLM outputs.

  4. The biggest near-term risk isn't AI itself—it's the absence of foundational infrastructure and governance: Without addressing connectivity, electricity, local data production, and data sovereignty, AI benefits will further entrench global inequalities.

  5. Transparency about who benefits from data is essential: Explicit data contribution agreements, cost models, and dissemination policies prevent quiet extraction and ensure public funding generates public value.

Key Topics Covered

  • Role of National Statistical Offices in the AI era: Quality assurance, standardization, governance, and institutional trustworthiness
  • Four waves of open data: Freedom of information → government data access → private sector participation → AI-data intersection (fourth wave)
  • AI-readiness requirements for public data: Metadata enrichment, standardization, provenance documentation
  • Technical infrastructure: Data Commons, schema standards (SDMX, schema.org), MCP servers, natural language interfaces
  • Data governance and sovereignty: Balancing open access with protection against exploitation, particularly in the Global South
  • Statistical literacy and trust: Public education on data quality, distinguishing trustworthy data from synthetic/manipulated data
  • Institutional adaptation: How NSOs evolve from passive data producers to orchestrators of entire data ecosystems
  • Equity and power asymmetries: Concerns about wealthy nations and tech companies extracting value from data produced in lower-income countries
  • Innovation and citizen-generated data: Complementary data sources and collaborative approaches
  • Local context and infrastructure gaps: Electricity, connectivity, capacity-building in developing regions

Key Points & Insights

  1. NSOs bring three critical contributions to AI systems: (1) High-quality, validated, transparently produced data; (2) Standards, classifications, and structured metadata that make data machine-readable; (3) Strong legal and ethical frameworks protecting confidentiality and serving public interest rather than commercial interests.

  2. The "fourth wave" of open data is characterized by AI-data intersection, where AI can both improve data (through quality detection, synthetic data generation, conversational interfaces) and be improved by high-quality open data (reducing hallucinations and bias).

  3. NSOs cannot be purely data producers or purely regulators—they must evolve into orchestrators of entire data ecosystems, validating administrative data from other sources, setting standards, and ensuring interoperability across the ecosystem.

  4. AI-ready data requires three foundational elements: enriched metadata (describing what data is and how it can be used), interoperable standards (not competing standards), and provenance documentation (origin, usage rights, modification history—like a "nutrition label" for data).

  5. Trust in official statistics is not achieved through legal compliance alone, but through AI-readiness: if data is properly structured and accessible via AI interfaces, users will naturally prefer authoritative NSO sources over generic LLM outputs.

  6. Fundamental gaps remain unaddressed in many countries: connectivity, electricity, infrastructure capacity, local language support, local data production, multistakeholder governance, data sovereignty, and accountability mechanisms—these must be addressed before AI can be meaningful.

  7. The "both/and" nature of AI's impact on open data: AI can democratize access and generate insights at scale, but without robust data governance, can introduce synthetic data flooding, erode institutional statistical knowledge, and create perverse incentives to defund surveys/censuses.

  8. Power asymmetries are real and growing: Wealthy nations and tech companies benefit disproportionately from data produced in lower-income countries; Kenya's health data agreements and Worldcoin's iris-scanning activities illustrate unethical extraction patterns.

  9. Citizen-generated data and participatory approaches offer a counterbalance—when citizens collect and validate their own data with rigor, they can augment official statistics and build accountability from the ground up.

  10. "Open data is not free data"—public data requires investment, and defining clear data contribution agreements, cost structures, and dissemination policies ensures value flows are transparent and equitable.


Notable Quotes or Statements

"AI doesn't emerge in a vacuum. It really depends fundamentally on data and with that in particular on high quality trusted well-governed public data." — Mercedes Fugarasi (Moderator, Paris 21)

"This is the worst of times and the best of times when it comes to open data. Data has become more closed than ever before. But on the other hand we also can be optimistic where AI can really become a force that enables us to democratize data in ways that we've never had before." — Stefan Beerhost (co-founder, Grad Lab & Data Tank, NYU)

"NSOs produce data that are validated, quality controlled, transparent and comparable across time and countries. This data flood—this is where the NSO plays a unique role." — [Panelist, Paris 21] (describing NSO quality advantage)

"If data is AI ready, believe me it's going to be used by any AI system... Trust comes not from legality but from AI readiness." — Rohit Bardage (Deputy Director General, India's National Statistical Office)

"Open data is not free data... We should have a very contextual agreement of data contribution and a data dissemination policy. We should put a cost to our data." — Rohit Bardage

"You don't need a PhD to sort through data and statistics anymore—that's great. But synthetic data flooding the ecosystem, platform substitution, and defunding of surveys can create a death spiral of knowledge." — Christopher Maloney (Huelet Foundation)

"AI turbocharging data access in environments with power asymmetries—we're seeing unethical things. Kenyan health data being handed over, Worldcoin scanning farmers' irises. We have to be quite concerned." — Christopher Maloney

"Aggressive piloting and wise deployment"—the statistical community is not chasing hype; they're serious about improving quality. — Franjo [surname unclear] (OECD/Paris 21)


Speakers & Organizations Mentioned

Panelists & Moderator:

  • Mercedes Fugarasi — Moderator, Paris 21 (Partnership in Statistics for Development in the 21st Century)
  • [First name unclear, possibly Volswis] — Senior Policy Adviser, Paris 21; Adviser on AI and Data, Chief Statistician of the OECD (20+ years experience in statistical innovation and capacity building, Global South focus)
  • Rohit Bardage — Deputy Director General, India's National Statistical Office (NSO India); leads innovation and emerging technology initiatives
  • Christopher Maloney — Program Officer, Huelet Foundation; previously held senior roles at USAID and the Millennium Challenge Corporation
  • Randip [surname unclear] — Senior Technical Leader, Google; leads infrastructure strategy for Data Commons

Video Presenter:

  • Stefan Beerhost — Co-founder, Grad Lab and Data Tank; Research Professor, Tandon School of Engineering, NYU; developer of the "four waves of open data" framework

Questioners (identified):

  • Premi — Final-year student, TA University
  • Shahed — DataKind (data science nonprofit)
  • [Ganach resident questioner] — Aspiring researcher on metadata granularity
  • Dolly Musim — Question on open innovation and distributed data governance

Other Organizations/Initiatives Mentioned:

  • National Statistical Offices (NSOs) globally, including NSO India, NSO Mongolia
  • Google Data Commons
  • World Bank (WDI database)
  • UN Statistical Commission
  • OECD
  • UN agencies
  • UNESCO
  • Worldcoin (cited as example of unethical data extraction)
  • US government (Kenya health data agreement mentioned)

Technical Concepts & Resources

Data Infrastructure & Standards:

  • Data Commons — Open-source initiative making global public data accessible and AI-ready via open APIs and MCP servers
  • MCP (Model Context Protocol) servers — Technology enabling cloud-based, secure access to NSO databases without requiring downloads
  • SDMX (Statistical Data and Metadata eXchange) — International standard for statistical data exchange
  • Schema.org — Shared vocabulary for structured data markup
  • NMDS — National Metadata Standards (referenced as interoperable standard)

Data Architecture Principles:

  • Provenance documentation — Origin, usage rights, modification history of data (the "nutrition label" analogy)
  • Metadata enrichment — Structured descriptions making data understandable to both humans and AI agents
  • Interoperability — Standards must "talk to each other" via adapters; no single dominant standard
  • Data contracts (automated) — AI-driven quality assurance to ensure ingested data meets quality criteria

AI Approaches & Use Cases:

  • Generative AI for conversational data interfaces — Natural language querying without requiring data science expertise
  • Synthetic data generation — Filling gaps in sensitive or underrepresented data
  • AI-driven data quality detection — Identifying challenges in data structure and content
  • Large Language Models (LLMs) — Mentioned: OpenAI's ChatGPT, Google's Gemini, others
  • Chatbots and AI agents — Making data discoverable and queryable by general public

Methodologies Referenced:

  • Statistical literacy/numeracy education — Training citizens to distinguish quality data from hallucinated/synthetic data
  • Citizen-generated data collection — Participatory, bottom-up data with rigor and validation
  • National data strategies — Localized frameworks defining governance, priorities, and interoperability
  • Data dissemination policies — Clear rules on cost, access, attribution, and beneficial use
  • Data governance frameworks — Legal, ethical, and operational rules for data stewardship

Platforms & Portals Referenced:

  • India's AI course — Government of India initiative providing accessible datasets via APIs
  • National Data Portal (India) — Centralized access to government datasets
  • Microdata Portal (India) — Granular, unit-level statistical data
  • WDI (World Development Indicators) — World Bank's primary statistical database
  • Comtrade — UN's trade statistics database

Quality & Trust Frameworks:

  • UN Fundamental Principles of Official Statistics — Defining NSO mandates
  • Data quality frameworks — NSO-defined standards (referenced: NSO India's own frameworks)
  • Audit trails and legal compliance — Accountability mechanisms (noted as insufficient alone for trust)

Implicit Assumptions & Limitations

  • Connectivity and electricity are assumed problems in Global South, but solutions are not discussed in detail
  • Tech company participation in data governance (Google's Data Commons) is presented as beneficial, but concerns about corporate capture are not deeply explored
  • Language access and localization are mentioned as critical but not detailed
  • Survey and census costs are implied but not quantified
  • The "death spiral" risk (defunding surveys due to AI capability) is theoretically posed but not grounded in observed practice

Conference Context: This talk was part of an AI summit focused on AI for economic development and social good, held as part of a larger conference week covering AI governance, infrastructure, and inclusive development.