Why AI Evaluation Matters | Building Trust and Impact in the Social Sector

Contents

Executive Summary

This AI Impact Summit panel addressed why rigorous evaluation frameworks are critical for AI systems deployed in social sectors—healthcare, agriculture, education, and workforce development. Speakers presented practical four-level evaluation methodologies, human evaluation protocols, and real-world case studies from nonprofits implementing AI at scale, emphasizing that evaluation must be embedded from day one, not retrofitted after deployment. The core insight: AI's societal impact depends not on technology capability alone, but on institutional design, human oversight, and commitment to equitable outcomes.

Key Takeaways

Embed evaluation from day one. Don't wait for production. Use minimal viable evaluation frameworks at each confidence level (model, product, user, impact) cyclically. This is cheaper and less risky than retrospective assessment.
Human expertise is irreplaceable in social-sector AI. Combine automated and human evaluation. Domain experts, cultural insiders, and end-users catch errors and edge cases that pure algorithmic assessment misses—especially across languages and geographies.
AI's societal impact hinges on institutions and leadership, not just algorithms. Technical capability is necessary but insufficient. The question is not "will AI eliminate jobs?" but "will we shape AI to expand opportunity or concentrate it?"
Track outcomes beyond model metrics. Measure turnaround time, user satisfaction, actual behavior change (e.g., caregiver implementation of advice), and welfare impact (morbidity reduction, income gains, access expansion). Each level validates the previous one.
Democratization requires deliberate infrastructure. Connect labor market data (EPFO, NCS platforms), skill signals, and reskilling pathways. Vernacular AI, multilingual support, and bias auditing are not luxuries—they're structural necessities for equitable AI.

Conference Talk Summary

Key Topics Covered

Evaluation Frameworks for AI: Four-level evaluation model (safety/accuracy → user engagement → behavioral change → human welfare impact)
Human vs. Automated Evaluation: Methodology for grounded, culturally-informed AI assessment in multilingual contexts
Healthcare AI: Real-world implementation of AI-augmented nursing response systems in maternal/child health
Job Market Transformation: AI's role in skills matching, reskilling, and equitable access to opportunities in India
Agricultural AI: Multi-turn, multilingual AI advisor for smallholder farmers with domain-specific evaluation
Education & AI Literacy: Curriculum redesign to prepare students for AI-augmented work
Government Policy: Ministry of Labour initiatives for AI-enabled job matching and skill development
Responsible AI & Bias Detection: Guardrails, runtime evaluation, and toxicity/gender bias screening

Key Points & Insights

Evaluation is not optional—it's risk management. Only 5% of enterprise AI pilots reach successful production deployment; the gap is not model capability but integration friction and lack of rigorous evaluation. In healthcare especially, a 1-2% error rate can mean preventable deaths.
Four-level evaluation provides successive confidence-building, not linear progression. Level 1 (model safety/accuracy) → Level 2 (user engagement) → Level 3 (behavioral/attitudinal change) → Level 4 (welfare impact/RCTs) work cyclically. Organizations should invest in minimal viable evaluation at each level rather than rushing to expensive impact studies.
Human evaluation is essential in multicultural, multilingual contexts. Automated evaluation misses nuance, edge cases, and culturally-specific errors. A grounded approach (developing labels from data rather than predefined taxonomies) captures domain-specific failure modes—e.g., PII exposure through caste categories, out-of-scope responses in reproductive health bots.
AI reallocates value; it doesn't necessarily eliminate jobs. Historical evidence (ATMs, computers, automation) shows employment adjusts over time. The real danger is stagnation from resisting AI. The question is: will AI concentrate opportunity (serving only top 10%) or democratize it (reaching warehouse workers, nurses, salespeople in Tier 2/3 cities)?
Skills must become dynamic, not credentials. Single careers are obsolete; individuals need modular, continuously refreshed capabilities. AI augments human judgment rather than replacing it—the value lies in combining human empathy, creativity, and decision-making with AI's analytical speed.
Evaluation must track multiple outcomes simultaneously: model performance (precision, recall, accuracy), product metrics (turnaround time, nurse acceptance rates), user behavior (engagement, retention, application of advice), and impact (morbidity/mortality reduction, income improvement, access expansion).
"Hallucination" and scope creep are critical failure modes in social-sector AI. Bots trained on general knowledge bases respond to out-of-scope queries (e.g., nutrition advice when nutrition is not in the knowledge base). Tracking and filtering edge cases is essential; don't assume scope from training data alone.
Runtime evaluation beats post-hoc analysis. Small evaluations during query processing (topic relevance, toxicity, safety checks) prevent harmful responses before they reach users. This is superior to analyzing traces two months later.
Data quality and infrastructure matter as much as models. Conversational IDs, proper data labeling, PII scrubbing, and synthetic data generation are prerequisites. Technical debt in data engineering undermines even excellent models.
Inclusive AI requires intentional design across three pillars: job matching (NCS platform in India connecting 3.64 crore job seekers to employers), skilling/reskilling (real-time labor market data informing course design), and inclusion (multilingual interfaces, bias screening, rural access parity).

Notable Quotes or Statements

Kartik Narayan (recruitment industry): "AI does not necessarily eliminate work; it reallocates value. It destroys routine tasks but increases the value of judgment."
Kartik Narayan: "The choice is not between AI or not AI. The choice is between shaping AI or being shaped by AI."
AJ Sharma (Ministry of Labour & Employment): "The real question is not whether AI will do something, but whether we will use AI to expand opportunities rather than restrict them."
Kusumgarwal (academia): "Do not fear AI; just embrace AI."
Edmund Ki (Agency Fund): "With generative AI, tools are changing and improving so rapidly. With increased capability comes increased risk. Investing in evaluations to make sure your AI intervention is doing what it's supposed to do is more important than ever."
Purva (Nura Health): "Healthcare doesn't happen in hospitals. Healthcare happens at home." (On why AI-augmented caregiver support is critical.)
Purva: "In a setting like healthcare, evaluation needs to be built in from day one. It is non-negotiable."
Closing statement (moderator): "Technology must serve human potential and human potential must define our technological future."

Speakers & Organizations Mentioned

Private Sector & Industry:

Kartik Narayan – CEO, Jobs Marketplace / former CEO, staffing company
Purva – Head of Engineering, Nura Health (maternal/child health AI, 50M people impacted across 4 countries, 11 languages)
Venod – Tech for Dev (AI evaluation platform for nonprofits)
Digital Green – Agricultural tech nonprofit (Farmer Chat, 1M+ users, 15+ languages, India & North Africa)

Government:

AJ Sharma – Additional Secretary & Director General, Ministry of Labour & Employment, Government of India
National Career Service (NCS) Platform – 3.64 crore registered job seekers, 6.43 crore vacancies posted
PM TV (skill development) – 3.2 lakh trained in AI & big data
EPFO (Employees' Provident Fund Organization) – 28 lakh organizations, 34 crore formal sector workers' data

Academic & Research:

Kusumgarwal (identified as educator/academia leader on curriculum design)
Tattlecivic Technologies – Vish (human evaluation methodology research)
Edmund Ki – Agency Fund (four-level evaluation framework, $4M accelerator for 8 nonprofits)

Private Platforms & Partnerships:

APNA, Monster (job matching platforms)
Microsoft (MOU with NCS for 15,000+ partner vacancies & AI courses)
Bhashini (multilingual AI translation)

Technical Concepts & Resources

Evaluation Frameworks & Methodologies:

Four-Level Evaluation Framework (Agency Fund): (1) Safe & accurate responses → (2) User engagement/value → (3) Behavioral/attitudinal change → (4) Human welfare impact
Grounded Annotation Approach (Tattlecivic): Develop labels inductively from data rather than predefined taxonomies; adapt to use case, context, and edge cases
Minimal Viable Evaluation (MVE): Bare minimum rigor at each level before escalating; cyclical, not linear progression
Golden Set of Questions & Answers: Create reference correct Q&A pairs for model testing; can be generated by AI then adapted by humans
Runtime Evaluation / Guardrails: Real-time checks during inference (topic relevance, toxicity, safety, content relevance) rather than post-hoc traces

Evaluation Metrics & Models:

Healthcare (Nura Health):

Intent Recognition: 98% precision (cross-validation on labeled dataset)
Emergency Detection: 98% precision (manual validation by doctors/nurses)
Summary & Retrieval Accuracy: 90–95%
Response Generation: Rubric-based (medical accuracy, context awareness, communication quality)
Product Metrics: 36% turnaround time improvement (Indonesia), 21% (Bangladesh); 80% message filtering; 95% nurse acceptance of AI-drafted responses

Agriculture (Digital Green):

Factuality Metrics: Overlap of model-predicted facts with golden answers
Gender-Specific Performance: 3–3.5x variation across crop types
Speech-to-Text Error Analysis: Domain-specific error trees for Hindi/vernacular
User Survey (in-app): 70% of active users applied advice within 30 days
Complexity Levels: Expert, intermediate, basic skill requirements

Tech for Dev (Platform):

Similarity Metrics: Cosine similarity of AI responses to ground truth
LLM-as-Judge Scoring: Secondary model judges response quality
Multi-lingual Testing: Systematic evaluation across English, Hindi, romanized Hindi combinations

AI Models & Tools Referenced:

Claude, GPT, Gemini (LLM evaluation & response generation)
Bhashini (multilingual NLP)
Python packages for fact extraction (referenced, available publicly)
Synthetic data generation for PII-safe evaluation
PI scrubbing models (for data de-identification)

Data & Infrastructure:

EPFO Database: 34 crore formal sector worker records; being integrated with NCS for real-time labor market signals
National Career Service (NCS) Platform: Connecting job seekers, employers, APNA, Monster; AI-generated resume building, career counseling
Periodic Labour Force Survey (PLFS) & Claims Data: Unemployment trends, labor force participation rates (declining unemployment despite AI)
EvaluateFarmer Tool: Internal web platform for agriculture experts to annotate 22,000 Q&A pairs; flags wrong, incorrect, irrelevant, missing info

Published Resources:

Papers on fact extraction, speech-to-text error analysis in agriculture AI (Digital Green)
Python packages for fact-checking and domain-specific evaluation frameworks (publicly available)
Open-source annotation guides and rubrics (available for reuse across domains)

Context: Broader Themes

This summit occurred at a defining moment for AI in India—a nation with 65% population under 35, median age 29, and "Vixit Bharat 2047" ambitions. Key tensions:

Automation vs. Opportunity Expansion: Will AI concentrate wealth/access or democratize it?
Skills Crisis: Shelf life of technical knowledge has collapsed; static credentials are obsolete; continuous reskilling is mandatory.
Equity & Access: 12 million kirana shops (retail), millions of rural Indians outside formal credit, non-English speakers, women re-entering workforce—all risk being left behind without intentional design.
Institutional Coordination: No single ministry or sector can solve this; whole-of-government approaches (labor, skills development, IT, education) are essential.

The summit reflects a shift from asking "Will AI replace jobs?" (settled: it reallocates value) to "How do we govern this transition equitably?"—requiring evaluation frameworks, policy-market alignment, and human-centered AI design.