All sessions

The Power of Open Data: Unlocking Global Insights with Data Commons

Contents

Executive Summary

This panel discussion explores how open data infrastructure and AI can democratize access to government statistics and public data across India. The speakers introduce Data Commons—an open-source platform designed to break down data silos, eliminate barriers for non-technical users, and enable trustworthy decision-making at scale through natural language interfaces and standardized schemas.

Key Takeaways

  1. Open data infrastructure is a prerequisite for democratizing AI benefits: Data Commons demonstrates that with proper platform design (open APIs, standard schemas, natural language interfaces), non-technical users can access complex data without hiring data scientists. This is essential for inclusive AI deployment across 1.4 billion people.

  2. Trust in data systems depends on visible stewardship and local validation: Big Tech should prioritize participatory AI auditing with domain experts and affected communities before deploying models. Transparency about data sources, algorithmic methods, and limitations builds trust that technical sophistication alone cannot achieve.

  3. India's digital public goods strategy (UPI → mobile → AI) requires solving data interoperability at the foundation: Just as India leapfrogged legacy banking and landline infrastructure, solving standardization and accessibility of government data now can position India to lead in trustworthy, inclusive AI globally.

  4. Multilingual and voice-based interfaces are non-negotiable for equity: AI systems built on English corpuses and high-bandwidth assumptions will exclude rural and non-English-speaking populations. Project Bhashini and localization efforts are necessary, not optional, for genuine last-mile access.

  5. Data as a public utility requires legal, governance, and cultural shifts: Moving from siloed institutional data to interoperable public data requires removing organizational incentives to hoard, establishing feedback loops between data usage and institutional value, and building cultural acceptance that data quality improves through transparent use and feedback.

Key Topics Covered

  • Data Accessibility Problem: Government data fragmented across independent ministry databases with incompatible formats; requires data scientists to clean and normalize data before use
  • Data Commons Platform: Open-source infrastructure using open APIs, schema.org schemas, and AI-powered natural language search to democratize data access
  • Trust & Transparency in AI: Risk of hallucination in LLM-based data queries; importance of source attribution and provenance in data systems
  • India's Data Boarding Pass Initiative: Pilot project integrating government data sources to enable non-technical users to query policy-relevant data
  • Last-Mile Data Accessibility: Challenges of reaching non-English speakers, low-bandwidth users, and rural populations; Project Bhashini's multilingual AI efforts
  • Data Standardization & Governance: Role of National Statistics Offices (NSOs) in orchestrating data ecosystems; need for shared metadata and interoperability standards
  • Trust Barriers to AI Adoption: Concerns about data privacy, algorithmic transparency, autonomy loss, and institutional distrust in India's context
  • Participatory AI Auditing: Testing AI recommendations against local expertise before deployment (e.g., agricultural guidance for farmers)
  • Data as Public Utility: Vision of open, transparent data ecosystems supporting shared decision-making between governments and citizens
  • Organizational Commitments: Next-year goals from NSO India, IIIT Bangalore, Civic Data Lab, and others to advance the vision

Key Points & Insights

  1. The Hidden Tax of Data Work: Non-technical decision-makers (government ministers, 74 million Indian MSMEs, startup founders) currently require 6 years of specialized training to rename column headers and integrate fragmented datasets. This represents a massive economic and social opportunity cost.

  2. AI Cannot Substitute Poor Data Foundations: LLM hallucination on data queries is a critical failure mode. Example: Three different AI responses on India's fruit export data (83,000 cr, 9,500 cr, and 1,44,000 cr rupees), none correct. Trustworthy data requires source attribution and provenance—not language model confidence.

  3. Schema-First vs. Meaning-First Approaches: Current data systems over-value technical schema at the expense of semantic meaning. Moving toward human-centered interfaces requires deprioritizing SQL complexity in favor of natural language query semantics.

  4. "Voice is the New UI": Natural language and voice interfaces are critical for AI literacy and adoption, especially in linguistically diverse countries. Existing AI systems are English/high-bandwidth centric, excluding billions of speakers of local languages and dialects.

  5. Trust Requires Stewardship, Not Just Technology: Deep institutional distrust in India means data systems must be designed with visible, accountable stewardship. Institutional data is often trusted less than informal "grapevine" alternatives. Aadhaar's success model shows secure-by-design architecture builds trust.

  6. Disruption Has Real Costs: Interoperability and data sharing threaten established practices and livelihoods (e.g., 70+ year-old crop-cutting survey infrastructure; community identity concerns). Resistance to data silos often reflects legitimate fears of livelihood disruption and loss of autonomy.

  7. Participatory AI Auditing is Essential for Localization: Big Tech recommendations without local validation cause harm (e.g., AI-recommended urea application to farmers diverging 4x from local agricultural best practices). Auditing with domain experts, local practitioners, and affected communities prevents dangerous hallucination.

  8. Dual Innovation Strategy: Organizations should deliver both "minimum viable products" (replicable, accessible baseline solutions) and "aspirational products" (advanced offerings). Avoiding over-specialized solutions ensures ecosystem-wide benefit rather than siloed innovation.

  9. Data Producers ≠ Data Orchestrators: NSOs must evolve from being sole data producers to orchestrating entire data ecosystems—ensuring standardization, metadata linking, and quality across all data-producing organizations and agencies.

  10. Metadata Absence is a Crisis: A major pain point is "not finding data about data." Without standardized, accessible metadata describing datasets' provenance, collection methods, and limitations, even open data cannot be effectively used or integrated.


Notable Quotes or Statements

  • Prem Shankar Mali (Data Commons lead): "We want to avoid the tax that's often paid by folks trying to play with data... go straight to the question you want to ask in your own natural language."

  • Rohit Badwaj (NSO India): "We value too much the schema and too less the meaning... that's when we're going to talk about data and everybody else other than data scientists."

  • Gorov Godwani (Civic Data Lab): "If you have to make AI work for public good, we need to make sure data works for public good and that cannot happen if we don't have good open government data."

  • Dr. Sinátsa (IIIT Bangalore): "The value of data is recognized but the trust with the data is not really there... with the right stewardship, data would get used and would actually benefit people."

  • Aparajita Chowri (People+AI Foundation): "It is our generation's responsibility to make our previous generation AI literate... that's our failure, not a problem of LLM."

  • Rohit Badwaj (NSO India): "Voice is the new UI. We have to ensure that in a country like India, the voice enables AI users, AI literacy."

  • Gorov Godwani (Civic Data Lab): "The problem is that we have not been able to standardize the foundation of the data. No amount of AI can compensate for a bad foundation."

  • Prem Shankar Mali: "We know that LLMs are great with language, but they're not yet great with data... [Three AI systems gave wildly different India fruit export figures]. Which number is real? Which one do I trust?"


Speakers & Organizations Mentioned

NameTitle/RoleOrganization
Prem Shankar MaliFounder/LeadData Commons
Rohit BadwajDeputy Director General, Data, Informatics & Innovation DivisionNSO (National Statistics Office) India
Aparajita ChowriLeading Data & AI SolutionsPeople+AI Foundation (previously Evestnet Yodlee)
Dr. SinátsaDean of R&D; Heads Web Science LabIIIT-B (International Institute of Information Technology, Bangalore)
Gorov GodwaniCo-founder & Executive DirectorCivic Data Lab
ASTEP FoundationPolicy & governance partnerData Boarding Pass initiative
United NationsPartner organizationUN Stats group; SDG data integration
WHO, ILOData partnersIntegrated into Data Commons via UN backend
Project BhashiniGovernment of India multilingual AI initiativeGates Foundation collaboration

Technical Concepts & Resources

Platforms & Infrastructure

  • Data Commons – Open-source platform with open APIs, schema.org integration, and natural language query interface
  • Data Boarding Pass (databingpass.ai) – Prototype service integrating government ministry data; AI-powered policy brief generation
  • MCP Server (Mossp) – NSO India's open-source tool for accessing national data portal without schema knowledge
  • Namayatri – Example of DPI (Digital Public Infrastructure) application using open data; public mobility platform

Data Standards & Methodologies

  • schema.org – Semantic web markup standard used by 40+ million websites; chosen for Data Commons standardization
  • Multiverse – Formalism for purpose-agnostic data integration (IIIT-B research)
  • Intervention Science – IIIT-B methodology for using data to design policy interventions and nudge outcomes
  • Participatory AI Auditing – Testing AI recommendations against domain expert and community knowledge before deployment

Data Sources

  • Ministry of Statistics (India) – Government administrative data integrated into Data Boarding Pass
  • Kisan Call Centers – Agricultural knowledge base; benchmark for farm advice validation
  • Crop-Cutting Surveys – 70+ year historical practice for crop yield measurement (standardization challenge)
  • Micro Data Portals – NSO India's individual-level survey records (1M+ records per survey)
  • Sustainable Development Goals (SDGs) – UN data integrated via Data Commons backend

Languages & Accessibility

  • Project Bhashini – Government of India initiative for multilingual AI; focus on local languages, dialects, and voice-only languages
  • Cindi, Hindi, regional languages – Use cases for low-literacy, oral-knowledge communities

Governance & Quality Frameworks

  • Aadhaar – Model for trust-by-design in digital systems; secure authentication architecture
  • Chief Data Officers – Role being developed for data standardization and AI-readiness across government agencies

Additional Context & Recommendations

Gaps Identified

  • Metadata scarcity: Difficulty finding and describing dataset provenance, collection methods, and limitations
  • Skill mismatch: Non-technical decision-makers cannot independently access or integrate fragmented data sources
  • Hallucination risk: Large language models generate confident-sounding but incorrect answers on numerical/factual data queries
  • Localization failures: AI systems trained on English, high-bandwidth assumptions fail for 90% of India's population
  • Trust deficits: Institutional data less trusted than informal community knowledge; requires visible stewardship to overcome

Next 6–12 Month Goals (Per Panelists)

  1. ASTEP Foundation: Publish "data boarding pass" conceptual framework; expand AI-ready database standardization across agencies
  2. NSO India: Launch "data as a service" AI stack with natural language access; enhance MCP server and micro data portal accessibility
  3. IIIT-B: Train ecosystem personnel; publish research on data integration formalisms and intervention science applications
  4. Civic Data Lab: Build Chief Data Officer capacity; enable change makers to consume standardized, near-real-time, AI-ready government data

Document Type: AI Conference Panel Discussion
Focus Areas: Open Data Infrastructure, Trustworthy AI, Public Goods, India-Specific Implementation
Relevance: Policy makers, AI practitioners, data scientists, government statisticians, international development organizations