All sessions

Scaling Intelligence: NVIDIA NeMo and the Future of National AI | AI Impact Summit 2026

Contents

Executive Summary

Bernard Wyn, Director of Engineering at NVIDIA, presented a comprehensive overview of advancing large language model (LLM) training at scale through NVIDIA's open-source NeMo framework. The talk emphasized the evolution from pre-training scaling to post-training and test-time scaling, with particular focus on reinforcement learning (RL) techniques that enable reasoning capabilities and agentic AI systems. NVIDIA's open-source commitment—providing not just model weights but also datasets, training recipes, and libraries—aims to democratize advanced AI development.

Key Takeaways

  1. Post-training RL is now the primary scaling lever: Iterative GRPO loops on fixed-size models can yield 67% intelligence improvements—making efficient training and reasoning more important than raw parameter scaling.

  2. Agentic AI requires orchestrated ecosystems, not monolithic models: Success demands specialized agents using external tools, shared memory, and multi-agent coordination (e.g., NeMo Gym), not just a single large LLM.

  3. Long-horizon tasks remain unsolved: Models still fail catastrophically on tasks requiring 100+ sequential steps; solving self-correction and error accumulation is the next frontier.

  4. Open-source AI infrastructure is NVIDIA's competitive advantage: Providing complete training pipelines (data, code, recipes) on GitHub enables broader adoption and community contributions—more powerful than closed models alone.

  5. Hardware efficiency gains are orthogonal to algorithms: A 5x GPU upgrade (H100→GB300) compounds algorithmic improvements, but only if training recipes are co-optimized for new hardware (Megatron Bridge).

Key Topics Covered

  • Pre-training vs. Post-training Scaling: Evolution from scaling model size to scaling through fine-tuning and reasoning
  • Open-Source AI Philosophy: NVIDIA's commitment to releasing complete training artifacts (weights, datasets, recipes, libraries, blueprints)
  • NeMo Framework Architecture: End-to-end tools spanning data generation, pre-training, post-training, and inference
  • Reinforcement Learning for Reasoning: GRPO (Group Relative Policy Optimization) algorithm and test-time thinking
  • Agentic AI Systems: Multi-agent architectures with tool use, memory management, and specialized agents
  • Model Efficiency: Mixture of Experts (MoE) architecture enabling parameter efficiency at inference time
  • NeMo Gym: Orchestration framework for RL training with multiple environments and verifiers
  • NeMo RL: Post-training framework supporting multiple RL algorithms (GRPO, DPO, PPO, etc.)
  • NeMo 3 Models: Nano (3B active), Super, and Ultra variants with efficiency benchmarks
  • Hardware Optimization: Recipes for H100, Grace Blackwell GPUs with FP8, BF-16, and emerging NVFP4 precision

Key Points & Insights

  1. From Size Scaling to Refinement Scaling: The field has shifted from believing "bigger models = smarter models" to understanding that intelligent post-training iteration (via GRPO loops) can improve model capability by 67% without increasing parameter count, as demonstrated by NeMo V2→V3 improvement.

  2. Test-Time vs. Pre-Training Scaling Laws: Modern scaling laws now emphasize iterations in post-training rather than model size growth, allowing edge deployment of capable models that previously required massive parameter counts.

  3. Reasoning as Learnable Capability: Models can be trained to "think" (generate reasoning tokens) through RL, with a critical trade-off between thinking depth and inference speed—too much thinking becomes inefficient, but the right amount unlocks multi-step reasoning.

  4. Reward Model Diversity Drives Improvement: Training uses multiple reward models simultaneously (verifiable rewards for math/code, model-based LM-as-judge for subjective qualities), enabling the model to improve across multiple dimensions in a single GRPO loop.

  5. Long-Horizon Task Limitation: Current models struggle with tasks requiring 100+ sequential steps because error propagation (e.g., 99% per-step accuracy = ~37% total success rate) remains unsolved; future work requires models to detect off-course execution and self-correct.

  6. Mixture of Experts Efficiency: NeMo 3 Nano uses 30B total parameters but only activates 3B at inference (10%), achieving 350-360 tokens/second—significantly outperforming same-weight-class competitors (GPT-4o 20B, Qwen 32B).

  7. Agentic AI as Tool-Mediated Reasoning: Agents accomplish complex tasks by routing work to specialized LLMs (reasoning, vision, information retrieval), accessing external tools (search, code execution, APIs via MCP protocol), and maintaining context through shared GPU memory (NVLink).

  8. Specialized Small Models Over One Large Model: Domain-specific smaller models (trained via RL on specialized data) are more cost-effective, easier to adapt, and enable IP control when deployed in trusted data centers—supporting "sovereign AI" themes.

  9. Hardware Performance Scaling: Moving from H100 to Grace Blackwell (GB300) yields 3-4x efficiency gains in standard precisions (BF-16, FP8), with NVFP4 expected to deliver 5x+ gains, enabling 5-day training jobs to complete in 1 day without algorithmic changes.

  10. Reproducibility Through Complete Open-Source Release: Unlike competitors releasing only weights, NVIDIA provides datasets, training recipes, libraries (Megatron, Nemo RL/Gym), and blueprints on GitHub—enabling researchers to reproduce and extend models rather than merely use them.


Notable Quotes or Statements

  • "We try to provide everything that you need to learn and reproduce the models that come out of NVIDIA." — Bernard Wyn, on NVIDIA's open-source philosophy (contrasted with releasing only weights).

  • "If I want my model to be extremely intelligent, then it needs to get bigger and bigger and bigger. But that's not particularly practical because at some point you run out of resources." — On the fundamental limitation driving the shift from pre-training to post-training scaling.

  • "The models can actually think. You can ask it to think harder about a question or a prompt, and it'll give a more intelligent answer." — Describing test-time scaling and reasoning capabilities.

  • "At the end of 100 steps with 99% per-step accuracy, there's probably a 20% chance of success—which is a pretty low rate." — Illustrating the exponential error accumulation problem in long-horizon tasks.

  • "The algorithms in the RL space are evolving very quickly—what was state-of-the-art yesterday becomes outdated soon thereafter." — On the rapid evolution of post-training methods (SFT → PPO/DPO → GRPO).


Speakers & Organizations Mentioned

  • Bernard Wyn — Director of Engineering, NVIDIA; presenter and author of remarks on NeMo framework
  • NVIDIA — Primary organization; develops NeMo framework, produces models (NeMo 3 Nano/Super/Ultra)
  • OpenAI — Mentioned as non-open-source (ChatGPT), recent shift toward more openness
  • Implicit references: Competitors including GPT-4o, Claude Sonnet 4.6, DeepSeek V3, Qwen 3 (context for benchmarks)
  • CES (January 2026) — Jensen (CEO, presumed Huang) presented Alpha Mayio robotics work

Technical Concepts & Resources

Models & Architecture

  • NeMo 3 Nano/Super/Ultra — Efficiency-focused LLMs with Mixture of Experts (active parameters: 3B/10B/50B)
  • Hybrid Mamba 2 + Attention — Architecture combining efficient Mamba 2 blocks with sparse attention layers
  • Mixture of Experts (MoE) — Only activates a fraction of parameters at inference (e.g., 10% for 500B model)
  • 1 Million Token Context — Extended context window for NeMo 3

Training Frameworks & Tools

  • NeMo Framework — NVIDIA's unified training platform
    • NeMo RL — Post-training via reinforcement learning; supports GRPO, PPO, DPO, GSPO, DAPO
    • NeMo Gym — Orchestration layer managing environments, verifiers, and distributed RL rollouts
    • Megatron Bridge — Hardware-optimized training recipes (H100, Grace Blackwell)
    • Automodel — PyTorch-native backend for HuggingFace model compatibility
    • Data Designer — Synthetic data generation for augmenting training datasets

Training Algorithms

  • SFT (Supervised Fine-Tuning) — Traditional imitation learning with prompt-answer pairs
  • GRPO (Group Relative Policy Optimization) — Core RL algorithm: samples N generations per prompt, rewards them, updates policy via relative ranking
  • RLVR (RL with Verifiable Rewards) — GRPO using verifiable reward models (math, code, instruction-following)
  • RLHF (Reinforcement Learning from Human Feedback) — Preference optimization using human-labeled data
  • Older methods: PPO, DPO (referenced as preceding GRPO in timeline)

Reward Models

  • Verifiable Rewards — Rule-based, deterministic (math correctness, code execution, instruction compliance)
  • Model-Based Rewards — LLM-as-judge; subjective dimensions: helpfulness, coherence, accuracy, complexity, verbosity
  • Safety/Helpfulness Steering — Dataset with 5 categories (safety, helpfulness, accuracy, coherence, complexity)

Inference & Deployment

  • VLM (Vision Language Models) — Supported inference backend
  • Megatron Inference — Optimized inference engine
  • SGLang — Emerging inference backend (upcoming support)
  • Ray — Orchestration framework for distributed RL loops
  • MCP (Model Context Protocol) — Standard for exposing external tools/APIs to agents

Data & Datasets

  • 15 Categories of Pre-training Data — Diverse pre-training corpus (unspecified in transcript)
  • Open-source Benchmarks — Math reasoning, science reasoning, coding, instruction-following, agentic tool use, long context (provided via Nemo Gym)
  • HuggingFace Integration — Models published to HuggingFace hub; automodel backend supports any HuggingFace model

Precision & Quantization

  • BF-16 — Baseline 16-bit floating point
  • FP8 (MXFP8) — 8-bit quantization; ~2x efficiency vs. BF-16, with accuracy matching
  • NVFP4 — Emerging 4-bit format; expected 5x+ efficiency gains (recipes in development)
  • Post-Training Quantization — QAT (Quantization-Aware Training) to shrink BF-16 models to FP8 without retraining

Hardware & Optimization

  • NVIDIA H100 GPU — Baseline for benchmarks
  • Grace Blackwell (GB300) — Latest GPU; 3-4x efficiency improvements over H100
  • NVLink — High-speed GPU interconnect enabling shared memory across agent ensemble
  • Teraflops — Metric for GPU efficiency; FP8 and NVFP4 enable higher throughput than BF-16

Agentic AI Components

  • Tool Calling — Models invoke external functions (search, calculator, code execution, file access)
  • Memory Management — Agents share context via GPU memory (NVLink) rather than network passes
  • Multi-Agent Orchestration — Specialized agents (travel booking, rental cars, planning) communicate and collaborate
  • Perception-Reasoning-Action Cycle — Agent loop: understand prompt → reason through options → select action → execute tool
  • Blueprints — Pre-built agentic workflows (report generation with reasoning, video summarization, itinerary planning)
  • GitHub: NVIDIA NeMo — Central hub for all frameworks, models, and datasets
  • NeMo V2 → V3 Performance: 15 → 25 on Artificial Analysis Intelligence Index (67% improvement via RL post-training alone)
  • Verifiable Reward Environments: Math, science, coding, instruction-following, agentic tool use, long-context

Benchmarking & Evaluation

  • Artificial Analysis Intelligence Index — Leaderboard metric for model ranking
  • Comparison Models: GPT-4o (20B), Qwen 3 (32B), DeepSeek V3, Claude Sonnet 4.6, GPT-5.2
  • Throughput Metric: Tokens per second (NeMo 3 Nano: 350-360 tokens/sec vs. competitors)
  • Generation Length Tracking — Monitors "thinking token" usage to balance reasoning depth vs. latency

Additional Context

  • Conference: AI Impact Summit 2026
  • Presentation Date: ~February 2026 (references to GPT-5.2 and Claude Sonnet 4.6 released early February; transcript has minor inconsistencies suggesting live delivery with asides)
  • Scope: Practical guide to reproducing state-of-the-art model training using open-source NVIDIA tools; not theoretical research paper
  • Audience: Practitioners, researchers, engineers considering building/fine-tuning LLMs or agentic systems