Building Confidence in AI: Evaluation, Verification, and Assurance
Contents
Executive Summary
This panel discussion addresses the critical challenge of building a global ecosystem for AI assessment, governance, and risk management. Rather than relying solely on top-down regulation, panelists advocate for a collaborative, bottom-up approach combining industry self-regulation, government oversight, and third-party auditing to establish reliable assurance mechanisms that enable trustworthy AI deployment across diverse sectors and geographies.
Key Takeaways
-
Assessment reliability depends on clarity of purpose, methodology, and evaluator qualifications. Before scaling AI assurance, organizations must transparently report what was tested, how it was tested, and whether it passed or failed clear pre-established criteria.
-
Global assurance ecosystems require both common principles and local adaptation. Harmonize core audit principles (independence, transparency, feedback loops) while tailoring specific test cases, thresholds, and benchmarks to regional needs and use cases.
-
Real-world deployment teaches lessons lab testing cannot. Sandbox programs and ground-level experimentation reveal implementation gaps; empirical data from actual systems should feed policy development rather than precede it.
-
Ethics and human judgment cannot be automated away. Technical testing alone is insufficient; domain experts, affected communities, and ethical oversight must remain central to assurance processes, especially in high-stakes sectors.
-
Building trust at scale requires distributed capability. Centralized audit teams cannot assess AI across diverse global deployments; universities, civil society, domain experts, and internal practitioners must be equipped with common standards, open-source tools, and training to conduct localized assessments.
Key Topics Covered
- AI Assurance Ecosystem Development – Creating frameworks for reliable assessment of AI systems across different sectors and regions
- Regulation vs. Market-Driven Assurance – Complementary roles of legal frameworks and voluntary third-party assessments
- Types of AI Assessments – Governance, conformity, and performance assessments with distinct methodologies
- Sector-Specific Challenges – Domain expertise requirements (healthcare, financial services, public systems) in AI evaluation
- Global South Perspectives – Institutional capacity gaps, resource constraints, and localization needs
- Standardization & Professionalization – Common terminologies, qualifications, and best practices for AI auditors
- Implementation Reality – Practical barriers: shortage of trained evaluators, fragmented standards, unclear incentives
- Distributed Capability Building – Decentralizing assurance through universities, civil society, and domain experts
- Ethical Integration – Embedding ethics, transparency, and human-in-the-loop approaches in assessments
- Policy-Industry Dialogue – Ongoing collaboration between regulators, businesses, academics, and civil society
Key Points & Insights
-
Assessment Types Must Be Clearly Defined – Governance assessments evaluate organizational structures; conformity assessments check regulatory compliance; performance assessments measure quality metrics. Each has distinct purposes and methodologies that must be transparent to users.
-
Aviation Analogy Breaks Down for AI – Unlike aviation with one clear safety standard, AI spans healthcare, finance, employment, and other sectors with vastly different risk profiles and regulatory landscapes. Context-specific testing is essential rather than one-size-fits-all standards.
-
Methodology Matters More Than Standards Exist – The global south lacks foundational resources (e.g., dictionaries of offensive words in minority languages) to even define benchmarks for testing. Building baseline tools and lexicons is prerequisite to assessment.
-
User/Domain Expert Involvement Is Non-Negotiable – The fetal ultrasound model example illustrated how lab-validated systems fail in real-world deployment when domain experts aren't involved in evaluation. Assurance that ignores end-user context is theoretically sound but practically useless.
-
Bottom-Up Learning Must Inform Top-Down Standards – The global assurance sandbox approach (20+ active use cases) generates empirical data from real deployments, allowing regulators to understand achievable thresholds rather than imposing arbitrary pass/fail criteria.
-
Assurance Is Becoming a Competitive Differentiator – Beyond compliance, forward-thinking businesses use third-party assessments to build customer trust, investor confidence, and brand distinction—treating AI governance as strategic advantage rather than regulatory burden.
-
Multiple Stakeholder Groups Must Converge – Audits require industry, government regulators, civil society, academics, and affected communities all agreeing on terms and conducting oversight. Industry self-regulation alone fails; so does unilateral government mandate.
-
Institutional Barriers (Not Technical) Are Limiting – Global South deployment faces shortage of trained evaluators, fragmented jurisdictional standards, lack of affordable tools, unclear incentives for companies to share information, and absence of cross-jurisdiction calibration mechanisms.
-
Distributed Assessment Can Democratize Assurance – Universities, domain experts in agriculture/healthcare, and civil society groups can conduct assessments within their expertise domains, expanding capacity beyond centralized corporate auditing teams.
-
Professionalization Requires Skill Building at Scale – Training educators, students, accountants, and practitioners globally on AI assurance principles—treating it as transferable from existing audit/risk disciplines—is essential to create a qualified workforce.
Notable Quotes or Statements
"How do you make sure that planes are safe? And that's where the technical assessment comes in. But unlike in aviation, you're going to have so many applications and so many different contexts in which AI is used."
— WC Lee (AI Verify Foundation)
"There is no way you can talk about assurance of your model unless the people who are using it are involved in the evaluation pipeline as well. Otherwise you're going to work in a vacuum."
— Ravi Balaraman (Center for Responsible AI)
"Industry self-regulation audits rarely ever work. You need community groups. You need government regulators. And government regulators on their own shouldn't forge ahead without consulting with industry."
— Philip Howard (University of Oxford, IPIE)
"If you're not ethical and come across as ethical as an AI assurance product and industry, no matter how clever the technical testing is, people will not trust it."
— Narinan Vajinathan (ACCA)
"The conversation around AI assessment and assurance assumes most deployers are large companies with compliance teams and access to specialized auditors. But in countries like India, deployment happens in public hospitals, state education systems, small fintechs, and local government bodies without that infrastructure."
— Jibu Elas (Stimson Center, Mozilla Responsible Computing)
"What's key is sharing information, sharing learning, and building a better understanding of how these models work in real life."
— Anne McCormack (EY)
Speakers & Organizations Mentioned
| Speaker | Role / Organization |
|---|---|
| WC Lee | Executive Director, AI Verify Foundation; Cluster Director, IMDA (Singapore) |
| Philip Howard | Professor, University of Oxford; President, International Panel on the Information Environment (IPIE) |
| Ravi Balaraman | Founding Head, Center for Responsible AI |
| Anne McCormack | Global Leader, Public Policy and Digital Technologies, EY |
| Narinan Vajinathan | Global Head of Policy Development, Association of Chartered Certified Accountants (ACCA) |
| Jibu Elas | Stimson Center; Mozilla Responsible Computing Challenge |
| Anskar Helgeson | Moderator (implied) |
Key Institutions:
- AI Verify Foundation
- IMDA (Infocomm Media Development Authority, Singapore)
- International Panel on the Information Environment (IPIE)
- Center for Responsible AI
- EY (Ernst & Young)
- ACCA (Association of Chartered Certified Accountants)
- Stimson Center
- Mozilla Responsible Computing Challenge
Technical Concepts & Resources
Assessment Frameworks & Standards
- Governance Assessments – Evaluate internal structures and oversight of AI systems within organizations
- Conformity Assessments – Verify compliance with laws, regulations, voluntary standards, and contractual requirements
- Performance Assessments – Measure AI systems against predefined quality and performance metrics
- ISO 42001 – Standards referenced for governance around AI system use
- NIST Risk Framework – Risk assessment methodology used in financial services
- ISA 3000 – International audit and assurance standards (mentioned in context of establishing assurance methodologies)
Tools & Initiatives
- Global Assurance Sandbox – 20+ active use cases testing AI systems in real-world contexts to inform policy development
- AI Starter Kit for Testing LLM Applications – Practical toolkit covering reliability, hallucination, undesirable content, and security testing
- Model Cards – Standardized documentation of model design and performance factors
- Data Provenance Documentation – Mechanisms to track and explain data sources used in model training
Technical Testing Dimensions
- Hallucination Detection – Preventing AI systems from generating false or misleading information
- Bias & Fairness Testing – Identifying disparate impacts across demographic groups (e.g., computational fetal age estimation models underestimating Asian babies by 30-40%)
- Security Assessment – Evaluating vulnerability to infiltration and adversarial attacks
- Undesirable Content Filtering – Preventing offensive language, misinformation, hate speech (requires localized lexicons)
- System Reliability – Ensuring consistent, accurate performance in diverse deployment contexts
- Image Quality Control – Managing real-world measurement challenges (e.g., ultrasound quality affecting model predictions)
Methodological Concepts
- Bottom-Up vs. Top-Down Approach – Empirical data from real deployments informing policy rather than policy preceding implementation
- Practitioner's Expert Model – Non-technical professionals working alongside technical experts to conduct assurance
- Human-in-the-Loop Evaluation – Integrating domain expert judgment into technical assessments
- Cross-Jurisdiction Calibration – Making assessment results comparable across different regulatory regions
Geographic & Contextual Frameworks
- Global South Considerations – Resource constraints, institutional capacity gaps, lack of trained evaluators, fragmented standards
- Frontier Models – High-risk, unknown-risk systems (AI safety institutes, transparency requirements)
- Mainstream Businesses – Mid-size AI adopters seeking compliance and competitive advantage
- Sector-Specific Deployment – Healthcare, financial services, employment, education, agriculture, public administration
Structural Gaps & Barriers Identified
| Category | Challenge |
|---|---|
| Human Capital | Severe shortage of trained AI evaluators globally; limited education in AI assurance discipline |
| Infrastructure | Lack of affordable assessment tools, especially for resource-constrained institutions |
| Standards | Fragmented standards across jurisdictions; missing benchmarks for non-English contexts (minority languages, cultural contexts) |
| Incentives | Unclear or absent incentives for companies to share assessment data; competitive concerns |
| Institutional | Public hospitals, state systems, small fintechs lack compliance infrastructure to engage assurance |
| Technical Resources | Missing foundational resources (e.g., offensive language dictionaries for minority languages) to even define test criteria |
| Interoperability | No cross-jurisdiction calibration mechanisms to ensure assessment results are comparable globally |
Policy Implications & Recommendations
- Blend regulation with market-driven assurance rather than relying on regulation alone
- Establish competency standards for AI auditors/assessors (qualifications, independence, accountability)
- Support open-source tooling and shared lexicons to reduce cost of entry for global south institutions
- Invest in distributed training and education at universities, professional bodies, and civil society levels
- Create sandbox environments to test assurance methodologies in real-world conditions before standardization
- Maintain ongoing policy-industry dialogue rather than top-down mandate; allow standards to evolve as implementation knowledge accumulates
- Embed ethics and human judgment as mandatory components alongside technical testing
- Calibrate assessment across jurisdictions to enable interoperability while respecting local needs
Report References: EY-ACCA joint report on AI governance and assurance (referenced via QR code in original event materials).
