Using AI to Strengthen Public Service Delivery: Evidence & Impact | India AI Impact Summit 2026

Contents

Executive Summary

This talk presents evidence from a cash transfer program in Togo using AI and mobile phone data for beneficiary targeting during COVID-19, demonstrating both the promise and significant limitations of AI in governance. While AI successfully identified poor households for rapid program deployment, it failed to measure program impact—a critical lesson that AI is powerful in specific contexts but neither universal nor magical, requiring rigorous evaluation and careful deployment in government systems.

Key Takeaways

AI is powerful for targeted tasks with rich historical data but cannot be applied universally. Success requires matching AI to specific problems with appropriate data quality and domain complexity. Rigorous evaluation before scaling is non-negotiable, especially in heterogeneous contexts (education, health, agriculture).
Start with acute pain points, not aspirational systems. Solutions addressing specific, felt problems (judge burnout from writing, targeting vulnerable populations in crises) generate organic demand and scale naturally. System-wide improvements require trust first, built by solving visible problems.
Governments need measurement infrastructure and evidence capacity, not just AI tools. The bottleneck is not technology but ability to monitor performance (telemetry), design valid experiments, and interpret results. Only 45% of government leaders plan to evaluate pilots; this must increase.
AI and human agency must not be zero-sum. The risk of automation is that it reduces human judgment and accountability. Augmentation (AI assisting humans) should be prioritized over agency transfer (AI replacing human decisions), with clear governance frameworks before deploying decision-making AI.
Scaling is a governance and organizational challenge, not a technical one. Model drift, data availability, third-party audit, workforce training, and task-level workflow integration matter more than algorithmic sophistication. Success requires partnership between technologists, policymakers, and evidence experts.

Key Topics Covered

AI for poverty targeting and welfare delivery: Using mobile phone data and machine learning to identify beneficiaries for cash transfers in contexts with limited administrative data
Evaluation challenges in AI deployments: Speed vs. rigor, iteration, model drift, and the limitations of AI relative to survey data
Government readiness for AI: Global and Indian perspectives on preparedness, capacity gaps, and implementation challenges
Bureaucratic automation vs. decision-making automation: Different risk profiles and government capacity for different AI applications
Scaling pilots to national programs: Bottlenecks in moving from proof-of-concept to implementation, data quality requirements, and design flaws
Measurement and evidence in government AI: Importance of telemetry, counterfactual thinking, intermediate outputs, and third-party audits
AI for justice systems: Automating stenography and workflow rather than judicial decision-making
Top-down vs. task-focused AI deployment: The importance of addressing actual pain points rather than imagined systemic improvements

Key Points & Insights

Mobile phone data can rapidly identify poverty but cannot measure program impact: In the Togo case study, AI successfully predicted consumption levels and identified poor households using call frequency, travel patterns, and text behavior. However, when used to measure the impact of cash transfers, AI-based estimates included zero and failed to detect treatment effects that survey data clearly showed, likely due to measuring different constructs (short-term vulnerability vs. long-term assets), model drift during COVID, and limitations in capturing subtle changes.
AI evaluation timelines conflict with government needs: The tension between rigorous evaluation (requiring years) and crisis response (requiring immediate action) is real but solvable. The solution lies in identifying valid short-run proxies for long-run outcomes and intermediate outputs (e.g., cases processed per day) that can simulate final impacts without waiting years.
Governments are unprepared for AI's societal implications despite optimism: 90% of public servants are optimistic about AI, but only 30% have examined automation's impact on their own roles, and only 26% of AI implementers understand their government's ethical frameworks. This gap between enthusiasm and preparedness creates risk.
Data quality and heterogeneity determine AI suitability: AI works well in high-data, low-heterogeneity domains (banking, GST, tax analysis). It becomes problematic in high-heterogeneity domains (education, agriculture, health) where outcomes depend on complex, context-dependent factors. The solution is not to avoid AI but to design rigorous, context-specific experiments before scaling.
Painkillers scale better than multivitamins: AI solutions addressing acute pain points (e.g., manual stenography in courts causing judge burnout) generate immediate demand and pull systems toward adoption. Aspirational improvements (multivitamins) lack organic demand and require sustained push, reducing scalability. Strategic sequencing—start with painkillers to build credibility, then tackle systemic improvements—is more effective.
Automation of tasks differs fundamentally from automation of decisions: Automating transcription, case flow management, or paperless filing carries lower risk of harm than automating judicial or loan decisions. Governments should prioritize task automation while building capacity for decision-making automation through regulation, skill development, and infrastructure.
Third-party audits and trust are as important as statistical accuracy: Scaling AI solutions requires not just statistical validation but institutional trust and credible third-party evaluation (sensitivity, specificity, false positives/negatives). This is particularly critical for civil society and nonprofits entering government spaces without precedent.
Pilots fail to scale due to poor experimental design and missing telemetry: 70% of government leaders have pilots, but only 50% plan to scale and only 33% have made data ready for AI. Failures stem from lack of task-level workflow analysis, missing monitoring data (telemetry), and inadequate third-party audit before rollout.
The counterfactual is essential but often forgotten: Intuitive metrics (e.g., "they got a job after training") mislead without counterfactual analysis. A training program could reduce job likelihood by wasting time, yet still show positive results if most people would get jobs anyway. Rigorous evaluation requires asking "what would have happened otherwise?"
Foreign aid and development work benefit from government capacity-building rather than direct service provision: Instead of funding specific programs, a high-leverage approach involves partnering evidence organizations with governments to build their capacity to evaluate, procure, and scale AI solutions. This amplifies impact by leveraging government's larger budget and reach.

Notable Quotes or Statements

Dean Karlan (JPAL): "AI has incredible promise as we all know this is why this room is full but it doesn't mean it's magic and it doesn't mean it's going to solve all things."
Dean Karlan: "If you're not as stressed about where you're going to get your next meal from then your mental health goes up." (On mechanisms through which cash transfers work)
Dean Karlan: "The challenge that I think we face then is that's a very mix that's a tough message to to say, okay, what do I what do I do with that? ... sometimes it works and sometimes it doesn't? Like I got to make a decision."
Robin Scott (apolitical): "AI loves bureaucracy" (on where AI's promise is strongest—in automating administrative tasks)
Robin Scott: "We talk a lot about now agentic AI and government. We don't talk about the consequences for people and I am really worried that unless we talk about agentic humans and don't make it a zero sum dynamic where the AI gets more and more agency and the humans get less and less you get real problems."
Utkash (Adalat AI): "Painkillers do better than multivitamins... the scaling is actually not as much as push from your side as much as you being pulled in by the system because you've identified a painkiller."
Muhammad Safirah (India AI Mission): "Before looking at do we actually need AI actually do you need a IT intervention at all and come step back and see what can be eliminated from the current process."

Speakers & Organizations Mentioned

Primary Speaker:

Dean Karlan – J-PAL (Abdul Latif Jameel Poverty Action Lab), economist, leading researcher on cash transfers and AI in development

Panelists:

Robin Scott – Co-founder & CEO, apolitical (global government platform with 200,000+ public servants)
Muhammad Safirah – Director, India AI Mission, Ministry of Electronics, Information & Technology (Government of India)
Utkash – Co-founder & CEO, Adalat AI (legal tech nonprofit automating court processes)
Kapil Vishwanathan – President, IFMR; Vice Chairman, Kria University; host of discussion

Organizations & Programs Referenced:

J-PAL South Asia – Research partner on Togo cash transfer program
Novesi Program – Cash transfer program in Togo (COVID-era)
Adalat AI – Nonprofit automating transcription, case management, and digitization in Indian courts
apolitical – Platform for government learning and best practice sharing
India AI Mission – Government of India initiative for AI adoption
Graduation Program – Household development program tested in West Bengal (by Bandhan) and globally
IFMR (Institute for Financial Management and Research)
Kria University – Education institution advancing India's AI ecosystem

Other Entities:

Government of Togo
World Bank (implied context)
Ministry of Social Protection (Peru)
Various state governments in India (Kerala, etc.)
Global South justice systems (Nairobi, Colombo, referenced)

Technical Concepts & Resources

AI/ML Methods & Approaches:

Mobile phone data analysis – Using call frequency, call direction (inbound/outbound), travel patterns, text messages to predict poverty
Machine learning pipeline – Training models on survey consumption data ("truth data") to predict poverty from phone metadata
Proxy means test (PMT) – Traditional machine learning method for identifying key variables to predict poverty; compared against AI approach
Randomized control trials (RCT) – Gold-standard evaluation design used to measure cash transfer impact
Unsupervised learning/clustering – COVID example: clustering co-morbidities to identify high-risk patients
Model drift – Key limitation identified: models trained on pre-COVID data performed poorly on COVID-era phone behavior
Telemetry – Real-time monitoring data needed to track AI performance in deployment (often missing)
Natural language processing – Court transcription (speech-to-text) and document digitization (image-to-text)

Evaluation Frameworks & Concepts:

Counterfactual reasoning – Asking "what would have happened without the intervention?" (emphasized as critical but often forgotten)
Surrogate outcomes – Short-run proxies for long-run impacts (e.g., cases processed per day as proxy for case resolution time)
Intermediate outputs – Measurable process metrics (judgments per day, witnesses examined, bail orders) used to estimate long-run impact
Sensitivity, specificity, false positives/negatives – Granular metrics for model performance validation
Third-party audit – External validation to build institutional trust
Telemetry & monitoring – Data infrastructure to track real-world AI performance

Data & Context:

Mobile phone metadata – 5.83 million subscribers, 1.3 billion calls in Togo dataset
Survey consumption data – Assets and food security as "truth labels" for poverty
Census data – Problem identified: outdated (1943–1986 in some countries); inadequate for targeting vulnerability
Administrative data – Tax records, employment records (richer in wealthy countries; limited in informal economies)
Voter registration data – Used as proxy for urban informal worker identification in Togo

Policy & Implementation Concepts:

Cash transfer programs – Food security, household enterprise investment, labor supply effects
Targeting – Identifying beneficiaries for welfare programs (key challenge in low-data contexts)
Task automation vs. decision automation – Different risk profiles; task automation (stenography, transcription) lower-risk than decision automation (loan approval, judicial decisions)
Pilot scalability – Distinction between painkillers (acute problems with organic demand) and multivitamins (aspirational improvements)
Procurement of AI systems – Need for evidence-based selection of vendors and tools by government
Government capacity building – High-leverage approach vs. direct service provision in development work

Governance & Readiness Metrics (from apolitical data):

90% of public servants optimistic about AI
30% of government leaders examined automation's impact on their own roles
26% of AI implementers understand their government's ethical framework
70% of leaders have AI pilots
50% of pilots have scaling plans
45% of leaders plan to evaluate pilots
33% have data ready for AI
20% of public servants comfortable understanding AI skills needed

Document Type: Conference talk transcript with panel discussion
Event: India AI Impact Summit 2026
Duration: ~60 minutes (lecture + 25-minute panel)
Key Audience: Policy makers, AI practitioners, development professionals, government officials, researchers