Who Watches the Watchers? Building Trust in AI Governance
Contents
Executive Summary
This panel discussion examines the evolution of AI governance from 2023 to 2026, presenting a nuanced picture of technical progress in safety measures alongside persistent governance challenges. While technical safeguards have improved substantially—making model jailbreaking significantly harder—the core challenge has shifted from "can we make AI safe?" to "how do we ensure safe practices are uniformly adopted and verified across the entire ecosystem?" The panelists propose independent verification organizations (IVOs) as a novel governance mechanism to build trust among developers, deployers, regulators, and the public.
Key Takeaways
-
The AI governance challenge has matured from "can we make systems safe?" to "how do we ensure safe practices scale uniformly?" Technical solutions exist; institutional/governance solutions do not yet.
-
Trust is the missing infrastructure. Standardized, independent verification can serve the same role for AI that safety ratings serve for cars or certifications serve for aerospace—enabling non-experts to make informed choices and rewarding responsible actors.
-
Old regulatory frameworks (hard law vs. soft law, command-and-control vs. industry self-regulation) are inadequate for AI's speed and complexity. Outcome-based, marketplace-driven verification with continuous methodological improvement is more credible than either extreme.
-
Economic incentives matter more than legal mandates in emerging technology governance. Insurance, liability clarity, market advantage, and public procurement can drive compliance more effectively than regulatory sanctions, particularly in sectors where governments lack technical capacity.
-
Information asymmetry and frontier lab dominance in evaluation is a genuine governance risk that requires structural solutions (data sharing commitments, published safety frameworks, partnerships with external evaluators) rather than wishful thinking about self-regulation.
Key Topics Covered
- Technical safety progress — advances in jailbreak resistance, frontier safety frameworks, and safeguard implementation
- Global regulatory approaches — comparison of EU (holistic/hard law), US (exposed/liability-focused), and Japanese (sector-specific/soft law) models
- The trust problem — gaps in public, deployer, regulator, and developer confidence in AI system safety
- Independent verification organizations (IVOs) — marketplace-based model for third-party auditing and certification
- Evaluation gaps — limitations of current benchmarks and testing methodologies in capturing real-world risks
- Liability and standard of care — legal frameworks and how certification could clarify responsibility
- Incentive structures — economic drivers (insurance, procurement, market advantage) for compliance with safety standards
- Lessons from other industries — parallels to aerospace (AS9100), automotive (NHTSA), and underwriting standards
- Information asymmetry — challenges in accessing knowledge concentrated in frontier AI labs
- Societal resilience — adaptation to widespread AI deployment rather than prevention-only approaches
Key Points & Insights
-
Technical safeguards have matured significantly: Models are now much harder to jailbreak. In early 2025, UK AI Security Institute took 7–10 hours to break safeguards on latest models (vs. minutes previously). Simple techniques like Swahili translation no longer work.
-
The problem is no longer primarily technical—it's governance and adoption: Twelve leading AI developers have frontier safety frameworks, but they vary in scope and rigor. The toolkit exists; the challenge is ensuring it's applied consistently across the entire industry landscape.
-
Regulatory approaches are converging, not diverging: All countries use both hard law and soft law. The distinction is not binary (EU hard vs. others soft) but rather scope (holistic vs. sector-specific) and enforcement culture (liability-driven vs. preventive).
-
A critical "trust gap" exists across the entire ecosystem:
- Public: doesn't know what's safe
- Deployers: don't know what they can trust
- Regulators: don't know how to confer earned trust
- Developers: face adoption decline if trust erodes
-
Independent Verification Organizations offer a novel governance model:
- Government-authorized marketplace of independent verifiers
- Outcomes-based rather than procedural/checkbox compliance
- Creates democratic accountability while maintaining flexibility and speed
- Incentivizes continuous improvement in testing methodologies
-
Liability and "standard of care" are underutilized governance levers: Current tort system forces juries to assess complex technical questions retrospectively. IVO certification could establish a presumption of meeting heightened standard of care upfront, clarifying expectations before harm occurs.
-
Insurance plays a crucial but underdeveloped role: Major insurers are already excluding AI from enterprise risk policies. This creates de facto regulation through market mechanisms. Insurance companies could use IVO certification to price premiums and determine coverage eligibility.
-
Evaluation methodology is a critical bottleneck: Current benchmarks are too narrow and already outdated. Stochastic outputs, multi-turn interactions, deployment context variation, and intended vs. actual use cases make real-world risk assessment genuinely difficult. No single standardized approach exists.
-
Information asymmetry between frontier labs and external actors is persistent: Labs hold both technical capacity and detailed deployment data. Third-party verification requires either mandatory transparency or voluntary partnerships—neither yet institutionalized.
-
Public procurement can be a powerful incentive: Government recognition and procurement of verified models creates strong economic incentive for developers without requiring heavy-handed regulation.
Notable Quotes or Statements
"The rubber is really hitting the road. Risks that even a year or two ago might have been theoretical are now very real and we're seeing emerging empirical evidence."
— Stephen Clare (on current state of AI risk manifestation)
"If he [Heroki] doesn't write about it, nobody in Washington DC knows about it. So it's important, his work."
— Greg (moderator, on the importance of multilingual AI policy documentation)
"There is no end to the story... the technology advances so fast and even though there is advancement, the next day you may find another risk. So how regulators should design the regulations is the main question all countries are facing."
— Heroki Habuka (on the perpetual nature of AI governance challenges)
"The current approach to tech governance is not equipped to handle this trust problem very well."
— Shaina Mansbach (on inadequacy of traditional regulatory models)
"What this system would do is confer... a rebuttal presumption of having met a heightened standard of care. So what we're doing is clarifying and defining upfront before an actual harm happens what a developer or deployer is actually supposed to do."
— Shaina Mansbach (on how IVO certification addresses liability uncertainty)
"We need this like layered approach of just many different policies and practices at different parts of the stack... safety by degree where we want defense in depth."
— Stephen Clare (on ecosystem-wide responsibility distribution)
"The only way that you're going to solve this is to have an ecosystem where all of the actors are competing to have the best services, to have the best evaluations."
— Shaina Mansbach (on incentivizing evaluation quality)
"There's no single answer to that kind of question [about acceptable safety levels]. So we need to debate in a democratic manner as to what is our acceptable goal."
— Heroki Habuka (on the necessity of democratic deliberation in safety standards)
Speakers & Organizations Mentioned
Primary Panelists
- Stephen Clare — Co-lead author, International AI Safety Report; AI safety researcher
- Heroki Habuka — Research Professor, Kyoto University Graduate School of Law; former Japanese government policy maker; Non-resident Senior Associate at CSIS; expert on Japanese AI policy
- Shaina Mansbach — Vice President of Strategy and Communications, Fathom; organizer of Ashby conference series on AI
- Greg (moderator) — CSIS fellow; former aerospace/space industry professional
Organizations & Institutions
- International AI Safety Institute (implied, UK; conducts jailbreak testing)
- Bletchley AI Safety Summit (2023; precursor event)
- European Union — EU AI Act
- Japan — Japanese government AI policy framework
- United States — Exposed/liability-driven regulatory approach
- NHTSA (National Highway Transportation Safety Administration) — referenced for automotive safety rating model
- Underwriters Laboratories (UL) — referenced as model for independent certification
- Frontier AI labs (anthropic, OpenAI, DeepMind, etc. — implied but not named individually)
- Fathom — Think tank proposing IVO marketplace model
Referenced Concepts & Frameworks
- Frontier Safety Frameworks — documents describing developer risk management plans (now adopted by 12 leading companies)
- EU AI Act — hard law regulatory approach with technical standards
- AS9100 — aerospace safety certification standard
- Responsible Scaling Policy — mentioned as example of voluntary transparency commitment (Anthropic)
Technical Concepts & Resources
Safety & Evaluation Mechanisms
- Jailbreaking resistance — models now require 7–10 hours of effort vs. minutes in early 2025
- Safeguard evasion techniques (outdated):
- Emotional manipulation ("help me remember my grandmother")
- Language obfuscation (Swahili translation)
- Prompt injection variants
- Evaluations/benchmarks — narrow sets of test questions (e.g., biosecurity, cybersecurity) used to assess capability and risk
- Stochastic output problem — same query produces different outputs; difficult to assess consistent safety
- Multi-turn interactions — relationship-building with AI systems across multiple queries complicates safety assessment
Governance Mechanisms
- Frontier Safety Frameworks — standardized documents describing risk management approaches (institutionalized via EU AI Act, Code of Practice)
- Independent Verification Organizations (IVOs) — proposed marketplace of government-authorized third-party auditors
- Standard of Care (legal) — legal presumption that verified systems meet heightened standard of responsibility
- Outcomes-based compliance — government specifies goals (e.g., "children's safety," "data privacy") rather than procedures
- Defense-in-depth — layered security approach across developer training, deployment monitoring, ecosystem surveillance, and societal resilience
Data & Measurement Gaps
- Information asymmetry — labs control access to:
- Model weights and internals
- Usage data and deployment patterns
- Safety testing methodologies
- Real-world impact data
- Benchmark limitations — existing evals capture narrow capability slices, not real-world deployment context
- Test methodology questions — how to establish validity across diverse use cases, control for deployment context, and keep pace with capability advancement
Policy & Legal Concepts
- Hard law vs. soft law distinction — oversimplified; all countries use both (EU AI Act + technical standards; US liability + soft guidelines; Japan sector-specific + soft guidance)
- Sector-specific regulation — approach used by US, Japan vs. holistic approach (EU)
- Exposed vs. preventive enforcement:
- Exposed (US): rules are principle-based; enforcement via liability after harm
- Preventive (Japan): rules set in advance; emphasis on compliance with given frameworks
- Public procurement — government procurement decisions can incentivize safety investment
- Liability insurance — major insurers excluding AI from coverage; creates de facto regulation
End of Summary
