Building Confidence in AI: Evaluation, Verification, and Assurance

Contents

Executive Summary

This panel discussion addresses the critical challenge of building a global ecosystem for AI assessment, governance, and risk management. Rather than relying solely on top-down regulation, panelists advocate for a collaborative, bottom-up approach combining industry self-regulation, government oversight, and third-party auditing to establish reliable assurance mechanisms that enable trustworthy AI deployment across diverse sectors and geographies.

Key Takeaways

Assessment reliability depends on clarity of purpose, methodology, and evaluator qualifications. Before scaling AI assurance, organizations must transparently report what was tested, how it was tested, and whether it passed or failed clear pre-established criteria.
Global assurance ecosystems require both common principles and local adaptation. Harmonize core audit principles (independence, transparency, feedback loops) while tailoring specific test cases, thresholds, and benchmarks to regional needs and use cases.
Real-world deployment teaches lessons lab testing cannot. Sandbox programs and ground-level experimentation reveal implementation gaps; empirical data from actual systems should feed policy development rather than precede it.
Ethics and human judgment cannot be automated away. Technical testing alone is insufficient; domain experts, affected communities, and ethical oversight must remain central to assurance processes, especially in high-stakes sectors.
Building trust at scale requires distributed capability. Centralized audit teams cannot assess AI across diverse global deployments; universities, civil society, domain experts, and internal practitioners must be equipped with common standards, open-source tools, and training to conduct localized assessments.

Key Topics Covered

AI Assurance Ecosystem Development – Creating frameworks for reliable assessment of AI systems across different sectors and regions
Regulation vs. Market-Driven Assurance – Complementary roles of legal frameworks and voluntary third-party assessments
Types of AI Assessments – Governance, conformity, and performance assessments with distinct methodologies
Sector-Specific Challenges – Domain expertise requirements (healthcare, financial services, public systems) in AI evaluation
Global South Perspectives – Institutional capacity gaps, resource constraints, and localization needs
Standardization & Professionalization – Common terminologies, qualifications, and best practices for AI auditors
Implementation Reality – Practical barriers: shortage of trained evaluators, fragmented standards, unclear incentives
Distributed Capability Building – Decentralizing assurance through universities, civil society, and domain experts
Ethical Integration – Embedding ethics, transparency, and human-in-the-loop approaches in assessments
Policy-Industry Dialogue – Ongoing collaboration between regulators, businesses, academics, and civil society

Key Points & Insights

Assessment Types Must Be Clearly Defined – Governance assessments evaluate organizational structures; conformity assessments check regulatory compliance; performance assessments measure quality metrics. Each has distinct purposes and methodologies that must be transparent to users.
Aviation Analogy Breaks Down for AI – Unlike aviation with one clear safety standard, AI spans healthcare, finance, employment, and other sectors with vastly different risk profiles and regulatory landscapes. Context-specific testing is essential rather than one-size-fits-all standards.
Methodology Matters More Than Standards Exist – The global south lacks foundational resources (e.g., dictionaries of offensive words in minority languages) to even define benchmarks for testing. Building baseline tools and lexicons is prerequisite to assessment.
User/Domain Expert Involvement Is Non-Negotiable – The fetal ultrasound model example illustrated how lab-validated systems fail in real-world deployment when domain experts aren't involved in evaluation. Assurance that ignores end-user context is theoretically sound but practically useless.
Bottom-Up Learning Must Inform Top-Down Standards – The global assurance sandbox approach (20+ active use cases) generates empirical data from real deployments, allowing regulators to understand achievable thresholds rather than imposing arbitrary pass/fail criteria.
Assurance Is Becoming a Competitive Differentiator – Beyond compliance, forward-thinking businesses use third-party assessments to build customer trust, investor confidence, and brand distinction—treating AI governance as strategic advantage rather than regulatory burden.
Multiple Stakeholder Groups Must Converge – Audits require industry, government regulators, civil society, academics, and affected communities all agreeing on terms and conducting oversight. Industry self-regulation alone fails; so does unilateral government mandate.
Institutional Barriers (Not Technical) Are Limiting – Global South deployment faces shortage of trained evaluators, fragmented jurisdictional standards, lack of affordable tools, unclear incentives for companies to share information, and absence of cross-jurisdiction calibration mechanisms.
Distributed Assessment Can Democratize Assurance – Universities, domain experts in agriculture/healthcare, and civil society groups can conduct assessments within their expertise domains, expanding capacity beyond centralized corporate auditing teams.
Professionalization Requires Skill Building at Scale – Training educators, students, accountants, and practitioners globally on AI assurance principles—treating it as transferable from existing audit/risk disciplines—is essential to create a qualified workforce.

Notable Quotes or Statements

"How do you make sure that planes are safe? And that's where the technical assessment comes in. But unlike in aviation, you're going to have so many applications and so many different contexts in which AI is used."
— WC Lee (AI Verify Foundation)

"There is no way you can talk about assurance of your model unless the people who are using it are involved in the evaluation pipeline as well. Otherwise you're going to work in a vacuum."
— Ravi Balaraman (Center for Responsible AI)

"Industry self-regulation audits rarely ever work. You need community groups. You need government regulators. And government regulators on their own shouldn't forge ahead without consulting with industry."
— Philip Howard (University of Oxford, IPIE)

"If you're not ethical and come across as ethical as an AI assurance product and industry, no matter how clever the technical testing is, people will not trust it."
— Narinan Vajinathan (ACCA)

"The conversation around AI assessment and assurance assumes most deployers are large companies with compliance teams and access to specialized auditors. But in countries like India, deployment happens in public hospitals, state education systems, small fintechs, and local government bodies without that infrastructure."
— Jibu Elas (Stimson Center, Mozilla Responsible Computing)

"What's key is sharing information, sharing learning, and building a better understanding of how these models work in real life."
— Anne McCormack (EY)

Speakers & Organizations Mentioned

Speaker	Role / Organization
WC Lee	Executive Director, AI Verify Foundation; Cluster Director, IMDA (Singapore)
Philip Howard	Professor, University of Oxford; President, International Panel on the Information Environment (IPIE)
Ravi Balaraman	Founding Head, Center for Responsible AI
Anne McCormack	Global Leader, Public Policy and Digital Technologies, EY
Narinan Vajinathan	Global Head of Policy Development, Association of Chartered Certified Accountants (ACCA)
Jibu Elas	Stimson Center; Mozilla Responsible Computing Challenge
Anskar Helgeson	Moderator (implied)

Key Institutions:

AI Verify Foundation
IMDA (Infocomm Media Development Authority, Singapore)
International Panel on the Information Environment (IPIE)
Center for Responsible AI
EY (Ernst & Young)
ACCA (Association of Chartered Certified Accountants)
Stimson Center
Mozilla Responsible Computing Challenge

Technical Concepts & Resources

Assessment Frameworks & Standards

Governance Assessments – Evaluate internal structures and oversight of AI systems within organizations
Conformity Assessments – Verify compliance with laws, regulations, voluntary standards, and contractual requirements
Performance Assessments – Measure AI systems against predefined quality and performance metrics
ISO 42001 – Standards referenced for governance around AI system use
NIST Risk Framework – Risk assessment methodology used in financial services
ISA 3000 – International audit and assurance standards (mentioned in context of establishing assurance methodologies)

Tools & Initiatives

Global Assurance Sandbox – 20+ active use cases testing AI systems in real-world contexts to inform policy development
AI Starter Kit for Testing LLM Applications – Practical toolkit covering reliability, hallucination, undesirable content, and security testing
Model Cards – Standardized documentation of model design and performance factors
Data Provenance Documentation – Mechanisms to track and explain data sources used in model training

Technical Testing Dimensions

Hallucination Detection – Preventing AI systems from generating false or misleading information
Bias & Fairness Testing – Identifying disparate impacts across demographic groups (e.g., computational fetal age estimation models underestimating Asian babies by 30-40%)
Security Assessment – Evaluating vulnerability to infiltration and adversarial attacks
Undesirable Content Filtering – Preventing offensive language, misinformation, hate speech (requires localized lexicons)
System Reliability – Ensuring consistent, accurate performance in diverse deployment contexts
Image Quality Control – Managing real-world measurement challenges (e.g., ultrasound quality affecting model predictions)

Methodological Concepts

Bottom-Up vs. Top-Down Approach – Empirical data from real deployments informing policy rather than policy preceding implementation
Practitioner's Expert Model – Non-technical professionals working alongside technical experts to conduct assurance
Human-in-the-Loop Evaluation – Integrating domain expert judgment into technical assessments
Cross-Jurisdiction Calibration – Making assessment results comparable across different regulatory regions

Geographic & Contextual Frameworks

Global South Considerations – Resource constraints, institutional capacity gaps, lack of trained evaluators, fragmented standards
Frontier Models – High-risk, unknown-risk systems (AI safety institutes, transparency requirements)
Mainstream Businesses – Mid-size AI adopters seeking compliance and competitive advantage
Sector-Specific Deployment – Healthcare, financial services, employment, education, agriculture, public administration

Structural Gaps & Barriers Identified

Category	Challenge
Human Capital	Severe shortage of trained AI evaluators globally; limited education in AI assurance discipline
Infrastructure	Lack of affordable assessment tools, especially for resource-constrained institutions
Standards	Fragmented standards across jurisdictions; missing benchmarks for non-English contexts (minority languages, cultural contexts)
Incentives	Unclear or absent incentives for companies to share assessment data; competitive concerns
Institutional	Public hospitals, state systems, small fintechs lack compliance infrastructure to engage assurance
Technical Resources	Missing foundational resources (e.g., offensive language dictionaries for minority languages) to even define test criteria
Interoperability	No cross-jurisdiction calibration mechanisms to ensure assessment results are comparable globally

Policy Implications & Recommendations

Blend regulation with market-driven assurance rather than relying on regulation alone
Establish competency standards for AI auditors/assessors (qualifications, independence, accountability)
Support open-source tooling and shared lexicons to reduce cost of entry for global south institutions
Invest in distributed training and education at universities, professional bodies, and civil society levels
Create sandbox environments to test assurance methodologies in real-world conditions before standardization
Maintain ongoing policy-industry dialogue rather than top-down mandate; allow standards to evolve as implementation knowledge accumulates
Embed ethics and human judgment as mandatory components alongside technical testing
Calibrate assessment across jurisdictions to enable interoperability while respecting local needs

Report References: EY-ACCA joint report on AI governance and assurance (referenced via QR code in original event materials).