All sessions

Measuring Advanced AI: Science, Safety & Governance

Contents

Executive Summary

A panel of government AI safety institute leaders from the UK, US, Singapore, and private AI researcher Sarah Hooker discussed the global effort to standardize AI evaluation methods through the international Network for Advanced AI Measurement, Evaluation and Science (NAMES). The panel emphasized that rigorous, transparent measurement science is essential to understanding frontier AI capabilities, informing policy decisions, and enabling safe adoption—while acknowledging significant challenges in evaluating rapidly evolving systems like agentic AI.

Key Takeaways

  1. Evaluation science is foundational infrastructure: Without rigorous, standardized measurement, governments cannot understand AI capabilities, make informed policy decisions, or ensure safe deployment. The international network is building this infrastructure collaboratively.

  2. Open benchmarks are broken; private test sets are essential: The conflation of measurement with marketing (high commercial stakes) has made public benchmarks useless for serious evaluation. Moving forward requires unannounced private evaluations and holdout data, with careful norms about what findings are shared publicly.

  3. Agentic AI will require new evaluation paradigms: Current benchmarking and testing approaches designed for static models fail for autonomous agents with tool use, memory, and multi-step planning. This is an immature, high-priority research area requiring coordinated international effort.

  4. Evaluation only matters if it informs decisions: The ultimate test of AI Safety Institutes is whether their technical work influences government procurement, regulatory guidance, and deployment standards. Publishing reports without driving action is data collection, not governance.

  5. The network model works: Despite geopolitical tensions and different national priorities, technical experts across governments can collaborate effectively on shared measurement science, build trust, and exchange sensitive information when operating at the expert level insulated from diplomatic friction.

Key Topics Covered

  • International AI Safety Institutes (ACs): Establishment, structure, and mandate of government AI safety evaluation centers worldwide
  • Evaluation Science & Benchmarking: Challenges with static benchmarks, gaming, overfitting, and the need for private test sets and reproducible methodologies
  • The NAMES Network: How 10+ AI Safety Institutes collaborate on shared standards and information exchange
  • Agentic AI Evaluation: Emerging challenges in testing autonomous AI systems with tool use and multi-step reasoning
  • Pre-deployment vs. Post-deployment Assessment: Timing and access constraints for third-party evaluators
  • Public vs. Private Reporting: Balancing transparency with information hazards and operational security
  • Real-world Harm vs. Academic Benchmarks: Gap between benchmark performance and actual societal impact
  • Resource Disparities: How countries with smaller budgets can participate in and benefit from shared evaluation work
  • Geopolitics & Technical Exchange: Maintaining scientific collaboration amid political tensions
  • Policy Translation: Converting evaluation findings into actionable government decisions on procurement, deployment, and regulation

Key Points & Insights

  1. NAMES Network Composition & Function: The international network comprises 10+ expert organizations (UK AI Security Institute, US Center for AI Standards and Innovation, Singapore AI Safety Institute, Australia, Japan, Korea, Kenya, and others) that conduct joint technical evaluations while maintaining national policy independence. The value lies in building trusted relationships and shared methodologies rather than dictating uniform policy outcomes.

  2. The Benchmark Gaming Crisis: Open benchmarks are systematically compromised because high commercial stakes incentivize overfitting and gaming. Frontier labs train directly on public benchmarks, rendering them useless. Private test sets, unannounced evaluations, and holdout data are necessary but create tensions with transparency ideals. Gaming can be subtle—e.g., submitting 34 variants of a model and reporting the best score.

  3. Pre-deployment Access is Critical: Evaluators currently gain access mainly at or near model release, leaving limited intervention points. The US Casey and UK AC have negotiated agreements with frontier labs for pre-deployment evaluations, enabling measurement during training and post-training stages when course corrections are still possible.

  4. Agentic AI Requires New Evaluation Architectures: Testing autonomous AI agents is fundamentally different from testing static models. Challenges include:

    • Tracking tool calls, decision trees, and intermediate steps
    • Handling multilingual tool invocations (e.g., English tool names called from Japanese prompts)
    • Measuring emergent behaviors from agent-environment interactions
    • Cyber ranges and "capture-the-flag" approaches are insufficient for multi-agent scenarios

    The network is conducting joint multilingual agentic testing exercises but acknowledges this remains an immature field.

  5. Measurement ≠ Policy: A core tension acknowledged by the panel: governments must conduct rigorous, objective scientific evaluation while also making policy decisions informed by different risk tolerances and national priorities. Evaluation provides evidence; policy interprets it. ACs must demonstrate relevance by influencing actual government decisions on procurement, regulation, and deployment.

  6. Reproducibility & Transparency as Scientific Standards: Following presidential guidance on "gold standard science," the US Casey published NIST AI 8002 (best practices for automated benchmark evaluations), advocating for:

    • Clear specification of what is being measured and why
    • Transparent reporting of configuration, runs, and methodology
    • Publishing evaluation transcripts and raw data
    • Avoiding "chart crime"—misleading visualizations of results

    These practices are still emerging but represent best practice targets.

  7. Real-world Harms Underrepresented in Benchmarks: Current academic benchmarks focus on clever adversarial attacks but miss systemic harms: fraudulent voice calls at scale, fake resumes, AI-enabled phishing, vulnerability in AI companionship, and discrimination in hiring. System-level evaluation (model + deployment context + user interaction) is needed but underfunded and less prestigious than algorithmic robustness testing.

  8. Resource Sharing as Equity Strategy: Rather than expecting all nations to build capacity from scratch, the network publishes tools (e.g., UK's Inspect framework, cyber evaluation tools), methodologies, and findings (e.g., Frontier AI Trends Report summarizing 2 years of 30-model evaluations). This enables smaller ACs and developing nations to leverage shared knowledge.

  9. Trust Network as Primary Value: Onie (Singapore) emphasized that the network's core benefit is building institutions that trust each other and can exchange sensitive information. This enables rapid, coordinated response to emerging risks (e.g., picking up the phone when a novel threat emerges) and persistent collaboration despite changing policy priorities.

  10. Geopolitical Separation from Technical Collaboration: Despite noting that geopolitics exists, panelists stressed that the network operates primarily at the expert/technical level, insulating scientific exchange from diplomatic friction. Expert-to-expert Slack conversations and joint testing exercises happen independently of higher-level diplomatic channels, accelerating knowledge transfer.


Notable Quotes or Statements

  • Austin (US Casey): "Well, if I were wrong, you would have only needed one." (Albert Einstein on having 100 critics vs. one correct answer) — framing Casey's mission as providing the one correct scientific answer rather than blocking innovation.

  • Sarah (Adaption Labs): "How do we return to private test sets? How do we really maintain true calibrated estimates of performance?" — diagnosing the core benchmark crisis.

  • Sarah: "The whole point of benchmarking is to guide decisions and if you benchmark and you don't make a decision afterwards you're just collecting data." — questioning whether ACs have permanence without decision-making impact.

  • Onie (Singapore AC): "...this is the only network where we have a technical conversation with one another at a government level that allows us to exchange information in a trust manner." — emphasizing the network's unique value as a trust-based technical forum.

  • Adam (UK AC): "...there are things only governments can do because we hold either the data or the powers to be able to do something about different risks." — justifying government evaluation as distinct from industry or academic efforts.

  • Austin: "...the field is moving so quickly. We want our experts exchanging ideas... on Slack... How should this change the way we're doing our work?" — noting that rapid progress demands real-time expert coordination across borders.


Speakers & Organizations Mentioned

SpeakerTitleOrganization
Adam BowmontDirectorUK AI Security Institute (UKAC)
Onie (name partially unclear)Head of AI Governance & SafetySingapore Infocom Media Development Authority (IMDA) / Singapore AC
Austin MeronActing DirectorUS Center for AI Standards and Innovation (Casey) / NIST
Sarah HookerCo-founderAdaption Labs (private AI research startup)
Chris (moderator)(Hosting the panel discussion)
Japan AC (mentioned as network member)
Australia AC (mentioned as network member)
Korea AC (mentioned as network member)
Kenya AC (mentioned as network member)
EU (referenced in context of standards)
Bachand BKLead, Startups & InnovationKarnataka Digital Economy Mission

Key Government Officials (India): Sri Vishal Kumar Dave (Additional Chief Secretary, Odisha); Dr. Mangala (IAS 2002, Karnataka)

Network: National Institute for Standards and Technology (NIST), National Physical Laboratory (UK), Nanyang Technological University (Singapore Digital Trust Center)


Technical Concepts & Resources

Frameworks & Publications

  • NIST AI 8002: "Best Practices for Automated Benchmark Evaluations" — NIST publication establishing standards for evaluation reporting, reproducibility, and transparency
  • Inspect Framework: Open-source evaluation framework published by UK AC and adopted by frontier labs and government institutions
  • Frontier AI Trends Report: Aggregated findings from UK AC's evaluation of 30+ frontier AI models over two years, addressing capabilities and risks
  • European Etsy Standard for Secure AI Deployment: Standards document informed by UK AC's cyber evaluation findings

Evaluation Approaches

  • Cyber Ranges: Advanced environment for testing AI security capabilities (beyond basic "capture the flag" exercises)
  • Multilingual Agentic Evaluation: Joint testing exercise exploring how agents handle tool use and decision-making across languages
  • Private Test Sets: Unannounced, non-public datasets used to avoid gaming and maintain true performance calibration
  • Holdout Data: Portion of test data withheld from initial evaluation to detect overfitting post-hoc
  • Leaderboard Illusion: Research documenting how benchmark leaderboards are systematically gamed despite appearing competitive

Risk Categories Evaluated

  • Cyber Security: Model capability to identify vulnerabilities, develop exploits, or enable attacks
  • Chemical/Biological: Capability to assist in development of harmful substances
  • Societal Impact: Effects on employment, discrimination, misinformation
  • Data Disclosure: Risks of unintended information leakage
  • Agentic Deployment Risks: Autonomous AI behavior in uncontrolled environments
  • Language & Cultural Representation: Gaps in model performance across languages and cultural contexts

Emerging Evaluation Topics

  • Intervenability & Rollback: Ability to intervene in or reverse AI deployments (underdeveloped)
  • Societal Impact Assessment: Broader systemic effects beyond technical benchmarks (acknowledged as underfunded)
  • Third-party Auditing Access: Pre-deployment vs. post-deployment evaluation windows and intervention points
  • Model Checkpointing: Aggressive checkpointing during training to enable rollback or analysis
  • Polysemanticity & Superposition: Architectural properties complicating model interpretability (raised as future challenge)

Tools & Methods

  • Benchmarking: Graduate-level science questions, coding challenges, multilingual translation
  • Transcripts: Recording and publishing exact evaluation runs for reproducibility
  • Gold Standard Science Principles: Transparency, reproducibility, reliability
  • Joint Testing Exercises: Coordinated multi-country evaluations on shared threat models

Important Caveats & Limitations

  • The transcript becomes fragmented and difficult to parse in the final ~10 minutes, with audio glitches and incomplete speaker identification.
  • The moderator (Chris) is never fully identified; several speakers are identified by first name or title only.
  • The discussion of geopolitics is acknowledged but not deeply explored; the panel emphasizes insulation of technical work from political pressures but does not detail specific geopolitical tensions.
  • No specific information on budget allocations, staffing numbers for most ACs (except UK: ~250 staff, ~100 technical), or detailed timelines for network initiatives.
  • The transcript ends abruptly as a new panel is being introduced with India-based speakers.