All sessions

Driving Social Good with AI: Evaluation and Open Source at Scale

Contents

Executive Summary

This panel discussion examines the intersection of AI evaluation, open-source software development, and social good applications, emphasizing the critical need for contextual, multilingual, and culturally-sensitive safety evaluations. The speakers argue that open-sourcing evaluation tools and frameworks is essential for enabling organizations—especially nonprofits and those serving global majority regions—to responsibly deploy AI systems, while acknowledging significant challenges around maintainability, agentic AI contributions, and the scalability of human-in-the-loop evaluation processes.

Key Takeaways

  1. Red team before you benchmark: Don't build expensive benchmarks without first identifying failure modes through contextual red teaming with subject-matter experts. Generic metrics measure the wrong things.

  2. Open-source evaluation tools are infrastructure for equity: Releasing evaluation software openly is relatively low-risk and high-impact, enabling organizations without resources to build evaluation layers into their AI systems—essential for responsible AI in the Global Majority.

  3. The open-source community needs clearer policies on AI-generated contributions: Automated PRs threaten the human process and social capital that sustain open source. Projects need explicit governance guidelines, and community norms should push back against treating code as a commodity.

  4. Context is irreducible; humility is necessary: No evaluation framework will achieve comprehensive context coverage across geographies and cultures. Organizations must remain humble, document their limitations publicly, and iterate with community input.

  5. AI evaluation is a multi-stakeholder, non-technical discipline: Success requires collaboration between engineers, program staff, domain experts, and affected communities. The field is early—similar to software testing 50+ years ago—and everyone has a role to play.

Key Topics Covered

  • AI Safety & Evaluation Methodologies: Contextual evaluation vs. hyperparameter optimization; AI red teaming; multicultural and multilingual testing
  • Open Source Software in AI Evaluation: Benefits and risks of releasing evaluation tools as open-source; ecosystem dynamics; community sustainability
  • Agentic AI & Open Source Maintainability: Impact of AI-generated pull requests; governance challenges; the future of automated contributions
  • Benchmarking Challenges: Problems with standardized benchmarks; importance of problem-space definition; resource constraints in low-capacity organizations
  • Global Majority & Localization Issues: Contextual evaluation for diverse geographies (India, sub-Saharan Africa, etc.); language-specific challenges; cultural sensitivity in safety guardrails
  • Scaling Red Teaming & Human-in-the-Loop Processes: Automation opportunities; ontological approaches; the irreducibility of human judgment
  • Community & Social Capital in Open Source: Credential systems; the human process underlying open source; challenges from automated contributions
  • Nonprofits & Social Sector AI: Capacity constraints; the gap between app development ease and evaluation rigor; need to make evaluation accessible to non-technical stakeholders

Key Points & Insights

  1. Contextual evaluation is non-negotiable: Safety evaluations must account for geographic, cultural, and linguistic context. Generic safety guardrails can actively harm users (e.g., an HIV prevention chatbot where discussions of sexual health are flagged as unsafe by default models but are essential to the service's mission).

  2. Open-source evaluation infrastructure democratizes AI safety: By releasing evaluation software openly, organizations in resource-constrained regions don't need to "reinvent the wheel"—they can adopt tested frameworks and adapt them locally, multiplying the impact of safety work and reducing duplication of effort.

  3. Benchmarks are often built on false premises: Organizations frequently build benchmarks without first identifying what they actually need to measure. Red teaming should precede benchmarking to identify real failure modes; benchmarking a wrong criterion is computationally expensive and unhelpful.

  4. AI-generated code contributions create maintenance crises: Large automated pull requests (particularly during events like Hacktoberfest) overwhelm already under-resourced maintainers. The Mathplotlib incident showed how agentic AI can escalate conflict and damage community trust when its contributions are rejected.

  5. Human-in-the-loop is not eliminable at scale: LLM-as-judge approaches are susceptible to the same biases and language capability gaps as the models being evaluated. Spot-checking with humans—even at 0.5%—remains essential and cannot be fully automated.

  6. Open source and open data are decoupled concepts: Organizations often conflate "open-source software" with "open data." You can have proprietary software generating open data or open-source software generating closed data. This distinction is critical for evaluation transparency and data governance.

  7. Multicultural red teaming requires subject-matter expertise, not just linguistic diversity: Effective evaluation requires experts in domains like public health, food security, and education to design scenarios that reflect real-world stakes and use cases. Language mixing and adversarial prompting are useful techniques but insufficient without domain knowledge.

  8. The open-source community depends on social capital, not just code: Automated contributions erode the credentialing systems, badge systems, and human recognition that motivate participation. Maintainers and contributors value the social process; slop code threatens the ecosystem's fundamental incentive structure.

  9. Ontological approaches can scale red teaming methodology: Mapping problem spaces (human rights clauses, demographic structures, power relations) into ontologies before creating test scenarios ensures coverage is representative and replicable across model iterations and helps prioritize effort.

  10. Evaluation is not just a technical problem: Program staff, designers, nonprofit workers, and domain experts are equally important contributors as software engineers. Non-technical audiences must understand that building an AI chatbot is easy; evaluating whether it works safely for their use case is hard and requires their expertise.


Notable Quotes or Statements

"It's very hard to get comprehensive context coverage. You'd make the best effort that you can, but there would still be places where it would remain unaddressed. We keep working at it and improve it. Realizing and be humble about it—it's a work in progress." — Opening remarks on contextual evaluation challenges

"If we are trying to build safer applications, build more robust applications in the global majority—in India—like, we do think open source is actually a big part of doing that." — TIAL representative, on open source's role in equitable AI safety

"The idea of generating a bunch of slop code and throwing that into a pull request diminishes the idea [of open source] and makes the already difficult job of maintainers even more impossible." — Discussion of AI-generated contributions' impact on maintainers

"We actually don't want the safeguards that the default models are operating with. ... We think that's counterproductive to the kind of support we want to provide." — Nonprofit case study: HIV prevention chatbot challenging default safety guardrails

"If you take somebody who knows nothing about human rights and then they create a policy around whether an output about human rights is good or bad, I would say that's not a good thing for the world. But that's probably going to happen regardless." — Humane Intelligence panelist, on risks of unqualified evaluators

"In the west, we have additive architecture. But here in India and a lot of eastern cultures, you have reductive architecture. ... That's kind of what AI evaluations are." — Analogy comparing Western software development to AI evaluation processes

"The second step—figuring out whether that bot is working for your use case—is where there is actually less investment at the moment." — On the gap between AI app development ease and evaluation rigor


Speakers & Organizations Mentioned

Organization/EntityRole/Focus
Humane IntelligenceAI safety evaluation, red teaming, opening evaluation software as open source
NumFocusNonprofit fiscal sponsor for foundational AI/data science projects (NumPy, SciPy, Pandas, etc.); governance on open-source policy
TIALOnline harms research (6+ years); focus on global majority geographies; evaluation stack work
Google.orgSupporter of Humane Intelligence's open-source red teaming tool release
IIT Madras AI for BharatLaunched Indic LM Arena (adapted from Berkeley's LM Arena for Indian languages and context)
GitHubMentioned for considering AI-generated PR tagging; previous employment of one panelist
Tech for DevNonprofit cohort organization; subject of evaluation study
ML CommonsMentioned for benchmark-of-benchmarks research
University of Moratua (Sri Lanka)Historical leader in Google Summer of Code contributions before Indian IITs
Indian Institutes of Technology (IITs)Major contributors to open-source ecosystem; host of AI for Bharat labs

Identified Speakers:

  • Sank Dharma — Board member, NumFocus; open-source maintainer (focus on LLM age maintainability)
  • VA/Adar — Humane Intelligence (technical lead for red teaming software release)
  • Panelists from TIAL, IIT Madras, and other institutions (names not fully transcribed)

Technical Concepts & Resources

Methodologies & Approaches

  • AI Red Teaming: Structured scenario creation with subject-matter experts to probe model vulnerabilities; adapted from cybersecurity practices
  • Contextual Evaluation: Problem-space-focused evaluation vs. generic hyperparameter optimization or industry benchmarks
  • Ontological Approaches: Mapping domain knowledge (e.g., human rights clauses, power structures, demographics) to structure red teaming scenarios and ensure coverage
  • Human-in-the-Loop Evaluation: Mandatory spot-checking (even at 0.5%) to validate LLM-as-judge outputs; accounts for language capability gaps and bias amplification
  • Adversarial Machine Learning: Techniques from vision/ML (e.g., adversarial attacks, black-box/white-box red teaming) adapted for LLMs
  • Multicultural Red Teaming: Using language mixing, code-switching, different scripts, and culturally-specific prompts to identify vulnerabilities

Tools & Frameworks (Released/Announced)

  • Humane Intelligence's AI Red Teaming Software — Open source release planned for later in the year; aims to make evaluation accessible to organizations without in-house evaluation expertise
  • Indic LM Arena — IIT Madras community-driven adaptation of Berkeley's LM Arena for evaluating models on Indian languages and contexts
  • Model Cards / Eval Cards — Proposed interoperable standard for documenting evaluation methodologies and outputs; work in progress toward standardization

Datasets & Benchmarks

  • LM Arena (Berkeley) — Benchmark for evaluating LLM performance; basis for Indic LM Arena adaptation
  • Google Summer of Code — Mentioned as enabling open-source contribution culture in India

Open-Source Projects/Languages Mentioned

  • Ocaml (functional programming language for security) — Example of problematic AI-generated PR (13,000 lines, PR #14363)
  • Mathplotlib — Python plotting library; prominent example of agentic AI PR rejection and subsequent backlash
  • NumPy, SciPy, Pandas — Foundational scientific Python libraries (NumFocus fiscal sponsorship)
  • Hacktoberfest — Annual event where AI-generated PRs have created maintenance burden; example of unintended consequences of gamification

Governance & Policy Concepts

  • NumFocus Policy Development: Developing organizational-level policies on AI-generated contributions to the scientific open-source stack
  • GitHub PR Labeling: Under consideration to tag AI-generated PRs for transparency and maintainer filtering
  • CBSE (Central Board of Secondary Education) — Indian educational standard; mentioned as context for localized benchmarking (e.g., Class 5 mathematics learning outcomes)

Languages & Localization

  • Indic Languages (Hindi, Tamil, Telugu, etc.) — Specific focus; current LLMs lack natural generation capability; punctuation/script variations can jailbreak models
  • Code-Switching (multilingual prompts) — Recognized adversarial technique for red teaming
  • Yoruba (Nigerian language) — Case study example in healthcare benchmarking context

Domain-Specific Applications & Use Cases

  • Public Health / Primary Healthcare — Maternal health, sex determination bias, hallucinations
  • Food Security
  • Education — Girl-child education, Class 5 mathematics learning outcomes
  • HIV Prevention & Sexual Health — Nonprofit case study where default safety guardrails actively harm
  • Mental Health & Well-being
  • Human Rights — Ontological approaches; abuse and conflict contexts

Document Metadata:

  • Conference: AI Summit (location: India, based on context)
  • Talk Length: ~90+ minutes (full panel discussion)
  • Format: Panel discussion with Q&A
  • Key Themes: Social good + AI safety + open source + Global Majority perspectives