Building High-Quality AI for Education: From Innovation to System-Wide Impact

Contents

Executive Summary

This panel discussion at an AI summit examines the ecosystem required to implement AI in education safely and at scale, particularly in low- and middle-income countries (LMICs) in sub-Saharan Africa and South Asia. Rather than treating AI as an inherent solution, speakers emphasize that quality assurance, rigorous evaluation, contextual adaptation, and system-level coherence are prerequisites for achieving learning outcomes. The session bridges three critical areas: evaluation frameworks, benchmarking standards, and real-world scaling evidence from India.

Key Takeaways

"Benchmarks are not about winners and losers. They're about making decisions that protect children and improve teaching and learning." — Quality assurance serves decision-makers and practitioners, not competition. Evaluation frameworks should reduce risk and guide evidence-based adoption.
Scaling is not a bigger pilot; it's about the system, not the product. — Technical excellence is necessary but insufficient. Success depends on political economy, teacher incentives, curriculum alignment, administrative capacity, and social norms—factors beyond the tool itself.
Context is non-negotiable. — Tools validated globally may fail locally if misaligned with curriculum, infrastructure, social norms, or data governance requirements. Contextual evaluation is a distinct layer, not an afterthought.
Rapid, multi-method evaluation is feasible and valuable. — Organizations need not choose between rigor and speed. Benchmarks, rapid field evaluations (8–10 weeks), and traditional RCTs serve different decision-making needs at different product lifecycle stages.
Leverage existing infrastructure and psychology. — Universal platforms (WhatsApp, messaging) and human behavior (compulsion to respond to messages) can be harnessed to reduce deployment friction and cost, particularly in resource-constrained contexts.

Key Topics Covered

Quality Assurance for AI in Education: Multi-layered evaluation frameworks covering global safety, pedagogical effectiveness, assessment accuracy, and local contextual fit
Benchmarking and Standards: Technical benchmarks (pedagogical knowledge, visual reasoning) and their role in pre-deployment decision-making
Evidence and Rapid Evaluation: Balancing rigor with speed in measuring impact; alternatives to traditional RCTs; cognitive offloading concerns
Scaling Infrastructure: Technical and architectural requirements for population-scale deployment in government education systems
System-Level Enablers: Political economy, teacher motivation, curriculum coherence, social norms, and social compliance
Teacher Education & Inclusive Design: Perspectives from teacher-training institutions; sensitivity to diverse learner populations including students with disabilities
Messaging Infrastructure as Foundation: Leveraging ubiquitous platforms (WhatsApp) for cost-effective, high-conversion AI delivery

Key Points & Insights

Evaluation Evidence Gap: Only 9% of 352 mapped edtech tools in South Asia and sub-Saharan Africa had any evidence of effectiveness, despite millions of children using these tools. This creates urgency for standardized quality assurance.
Multi-Layered Quality Assurance Model: FabAI's framework evaluates tools across four distinct layers—global (safety, bias), educational (learning outcomes), technical (assessment accuracy), and contextual (curriculum alignment, infrastructure, social norms)—recognizing that passing one layer doesn't guarantee success in others.
Language and Localization Degradation: Pedagogical benchmark performance drops ~15% on average when translating to major African languages (Kiswahili, Xhosa); small language models show worse drops. Human translation sometimes outperforms AI translation, suggesting over-reliance on automated localization.
Quality Assurance ≠ Accuracy Alone: Traditional metrics like accuracy are insufficient. Quality requires measuring pedagogical soundness, pedagogical soundness, harmful effects, cognitive offloading, and context-specific learning outcome improvements.
Rapid Evaluation as Complement to RCTs: 8-10 week rapid evaluations with lower costs can assess whether tools improve learning outcomes, while traditional RCTs remain valuable for understanding effect persistence and sustainability.
Constrained Information Sources for Safety: Systems like Mom Connect (health chatbot) and teacher lesson-planning tools achieve quality by constraining AI to official government guidelines and curriculum knowledge graphs, preventing hallucination and misinformation.
Conversational AI's Psychological Stickiness: Messaging interfaces (WhatsApp, SMS) leverage human psychology—users feel compelled to respond to messages—enabling 99% conversion rates and strong user engagement for educational content.
The "Enoughs" Framework: Products scaling in education systems must be good enough (impact-proven), big enough (addressing large-scale need), simple enough (minimal teacher friction), and cheap enough (sustainable cost per student).
Teacher Coherence Problem: Scaling failures occur when teacher expectations conflict with AI tool design—e.g., teachers prioritizing curriculum completion while AI tools adapt to individual student pace. System-level incentive alignment is critical.
Infrastructure Unbundling at Scale: Population-scale systems require decoupling communication protocols, application layers, transactional data, and knowledge layers to handle 150,000–200,000+ requests per minute while meeting data sovereignty, privacy, and compliance requirements.
Inclusive Design Beyond Foundational Literacy: Teacher-training institutions highlight need for AI solutions serving diverse populations (students with disabilities, deaf and dumb students, athletes) with personalized feedback and meaningful teacher-child relationships.

Notable Quotes or Statements

Jonathan Stern (Gates Foundation): "AI in and of itself is not inherently an education solution. It doesn't improve learning just by the sake of having it in schools."
Romana (FabAI): "What doesn't get measured doesn't get improved." — Justifying comprehensive quality assurance across product lifecycles.
Romana (FabAI): "Only 9% [of 352 mapped tools in South Asia and sub-Saharan Africa] had any evidence on them." — Highlighting the evaluation crisis in edtech.
DJ (Comius): "How do we get AI to the last mile right? Like how UPI has done it in fintech, how do we do that for education?" — Framing the scaling challenge using a fintech analogy.
Mark Shotlin (ID Insight): "If it's not simple enough, it's going to be very, very, very difficult for it to scale within an education system." — On the "simple enough" condition for systemwide adoption.
Jonathan Stern (closing): "The goal isn't AI in classrooms. It's better learning safely for every child."
Professor Kalpana Sharma: "If that [teacher-child] connect is not created, we are not nurturing the child towards developing him into a human being for tomorrow." — On the limits of AI-mediated learning without meaningful human relationships.

Speakers & Organizations Mentioned

Speaker	Role/Organization	Expertise
Jonathan Stern	Deputy Director, Global Education Team	Gates Foundation
Romana	(affiliation unclear from transcript)	FabAI
Mark Shotlin	Research & Evaluation Lead	ID Insight
DJ/Garaj	(appears as "DJ," also "Garage")	Comius (ComioGenius)
Professor Kalpana Sharma	Vice Chancellor	Lakshmai National Institute of Physical Education (LNIPE), Madhya Pradesh

Other Organizations Mentioned:

Gates Foundation — Funding and strategy (sub-Saharan Africa, South Asia focus)
FabAI — Quality assurance, benchmarking, rapid evaluation
ID Insight — Impact evaluation, data science, system scaling
Comius/ComioGenius — Platform deployment in India (150M children, 800K schools)
CSF Foundation — Tectona standards (India)
Unit — AI for Good Framework
UK Government — Recent standards refresh (manipulation, mental health)
Google DeepMind — Rapid evaluation studies in India and Sierra Leone
Reach Digital / Ministry of Health (South Africa) — Mom Connect chatbot (health case study)
Last Mile Health (Ethiopia) — Community health worker chatbot
Ministry of Education (Senegal) — Teacher lesson-planning chatbot pilot
Dalberg — Cross-sectoral evaluation playbook, alliance launching at summit
Special Olympics International — Impact evaluation research
VVOB, Prumam — Education scalability checklist frameworks

Technical Concepts & Resources

AI/ML Concepts

Large Language Models (LLMs): Primary technology; cost implications noted at scale; token consumption drives per-student expenses
Knowledge Graphs: Used to constrain LLM outputs to official curriculum (e.g., lesson-planning tools), preventing hallucination
Conversational AI / Chatbots: Core interface (WhatsApp, SMS-based); predates LLMs; high engagement due to psychological stickiness
Small Language Models: Cheaper alternative to large models; performance degradation noted, especially in non-English and low-resource languages
Agentic Interfaces: Multi-bot, multi-transactional systems leveraging contextual data (school location, teacher presence, curriculum coverage)
AB Testing: Continuous comparison of model versions to assess learning outcome improvement rates
Retrieval-Augmented Generation (RAG): Implicit in knowledge graph constraints; limiting model to official sources

Evaluation & Benchmarking

Pedagogical Benchmark: First-of-its-kind benchmark measuring AI models' understanding of teaching practices; tested against real-world teacher exams
Visual Reasoning Benchmark: Addressing AI weakness in visual problem-solving critical for foundational numeracy instruction
Rapid Evaluation Protocols: 8–10 week field studies assessing learning outcome improvement at lower cost than standard RCTs
Impact Evaluation / RCTs: Gold-standard for measuring effect sizes and sustainability; used when product is ready for scale
Impact Evaluation: Measuring effects on literacy, numeracy; direct measurement via in-app learning outcomes in education (vs. health)
Urgency Detection Systems: Algorithmic classification of high-priority queries (e.g., medical emergencies) requiring immediate escalation
Qualitative Data Collection: Tracking unanswered questions, model failure modes, gap identification
Feedback Mechanisms: Thumbs-up/down quantitative feedback from users; monitoring dashboards

Frameworks & Standards

Tectona Standard (CSF Foundation, India) — Education-specific standards
Unit AI for Good Framework — Generalist framework for AI quality
UK Government Standards (2024) — Emerging focus on manipulation, mental health impacts
Dalberg Playbook — Cross-sectoral evaluation framework (education-applicable)
Education Scalability Checklist (VVOB, Prumam) — System-level readiness assessment covering political economy, finance, coherence, social norms
"Enoughs" Framework (Kevin Starr, Makoto Foundation) — Product conditions: good enough, big enough, simple enough, cheap enough

Infrastructure & Compliance

WhatsApp Business API: Rate-limited to 30K–40K requests/minute; insufficient for city-scale operations; trigger for building native messaging protocols
Data Protection & Privacy Act (DPDP) (India) — Data sovereignty requirement; personally identifiable information must be hosted in customer infrastructure
Synchronous & Asynchronous Messaging: Voice calls, video calls (expensive), text-based (scalable)
Native Messaging Protocols: Required to handle 150K–200K+ requests/minute for state-level populations
Knowledge Infrastructure / Learning Infrastructure: Unbundled system components for modularity and compliance

Case Studies & Data Points

FabAI Mapping: 352 edtech tools in South Asia & sub-Saharan Africa; only 9% with evidence
Pedagogical Benchmark on African Languages: ~15% performance drop on Kiswahili, Xhosa; worse on small language models
ComioGenius Scale: 150 million children, 800K schools in India (~1 in 2 schools); 99% conversion rate on WhatsApp messaging
Comius Impact Study (Michael Kramer): Student learning outcomes more than doubled over 17 months vs. control group
Rapid Evaluation Timeline: 8–10 weeks vs. standard multi-year RCTs
Mom Connect (Reach Digital, South Africa): Constrained chatbot reducing nurse backlog; urgency detection for critical cases
Teacher Lesson Planning (Senegal): Knowledge graph-based curriculum alignment to prevent out-of-sequence suggestions

Research Papers & References (Mentioned but not fully cited)

GE Papers on SMART / Best Buys — Evidence on what works in foundational learning
Structured Pedagogy & Targeted Instruction Literature — Foundational evidence base for learning improvements
Michael Kramer Study (Comius impact) — Recently released, not yet formally published

This summary reflects the transcript's emphasis on evidence, systems thinking, and the gap between technical capability and educational impact.