The Intelligent Cloud: How AI Is Transforming the Cloud-Native World

Contents

Executive Summary

This talk explores how Kubernetes and cloud-native infrastructure are becoming the foundational operating system for AI workloads, addressing the complexity of deploying intelligent applications at scale. The speakers argue that modern AI infrastructure requires not just distributed computing capabilities but also composable, multi-cloud platforms with unified control planes, and present Cordant as an open-source solution that standardizes AI infrastructure provisioning.

Key Takeaways

Kubernetes is not just for containerized apps anymore—it is becoming the universal control plane for AI infrastructure, with 52% of cloud-native developers now being AI/ML practitioners.
Multi-tenancy and GPU optimization require orchestration at parity with cluster orchestration. You need multi-cluster GPU management alongside multi-cluster Kubernetes management to avoid waste and fragmentation.
Platform engineering for AI is a five-step discipline: developer experience → security → CI/CD → resilience → cost management. Skipping any step leads to operational debt and developer friction.
Composable, template-driven infrastructure reduces setup time from weeks to minutes, democratizing AI infrastructure and freeing developers to focus on models rather than platform configuration.
The future is integrated MCP servers: standardized interfaces that allow clusters and applications to autonomously interact with multiple services (observability, cost management, security) without custom integrations.

Key Topics Covered

Kubernetes as the OS of the Future: Kubernetes' role as the universal control plane for cloud-native and AI workloads
Cloud-Native AI Adoption: Current state and growth of AI/ML developers on Kubernetes (52% of 15.6M cloud-native developers)
Infrastructure Complexity: Multi-cloud, multi-cluster environments and the challenges they introduce
AI Infrastructure Challenges: Technical complexity, operational efficiency, multi-tenancy, and GPU resource optimization
Kubernetes AI Conformance: Standards ensuring portable AI workloads across different Kubernetes distributions
Open-Source AI Pillars: Training, inference, and agents as the three core components of modern AI stacks
Platform Engineering for AI: A five-step framework for building resilient, AI-native platforms
GPU Provisioning & Multi-Tenancy: Solutions for efficient GPU utilization across multiple tenants
Cordant Project: A composable, template-based approach to infrastructure as a service for AI-ready Kubernetes
MCP Servers: Standardized interfaces for integrating cloud-native tooling

Key Points & Insights

Kubernetes Adoption in AI is Accelerating: 36% of cloud-native developers already run AI workloads on Kubernetes, with 18% in active planning phases—demonstrating rapid mainstream adoption beyond traditional use cases.
Multi-Cloud Complexity Has Become the Norm: Organizations are no longer single-cloud or single-cluster; they manage dozens or hundreds of clusters across multiple regions and providers. Traditional configuration management (YAML, IaC tools) becomes unscalable and fragmented.
GPU Resource Management is a Critical Bottleneck: High-end GPUs cost over $20,000 each and face global supply constraints (6+ week provisioning on hyperscalers in some regions). Multi-tenancy, time-slicing, and fragmentation are major challenges without proper orchestration.
Kubernetes AI Conformance Addresses Portability: The CNCF-backed Kubernetes AI Conformance initiative ensures that AI workloads are portable across GKE, AKS, and private cloud environments through standardized hardware accelerator, operator, and security requirements.
The Five-Pillar Platform Engineering Framework: Successful AI platforms require: (1) developer experience, (2) security/compliance, (3) CI/CD foundation, (4) resilience engineering, and (5) cost management—not just infrastructure.
Composability Over Infrastructure-as-Code: Cordant avoids traditional IaC tools (Terraform, Ansible, Crossplane) and instead uses Helm charts and YAML templating, lowering the barrier for non-specialists and reducing maintenance burden.
AI Infrastructure Setup Still Takes Weeks: Manual provisioning—from GPU driver setup through Kubernetes deployment to service installation—remains a significant drag on development velocity. Template-driven provisioning can reduce this to 15–20 minutes.
Open-Source Dominance in AI Tooling: PyTorch holds 80% share in model training; tools like KServe, LLMD, Kubeflow, and AIBricks are becoming standard in production. The ecosystem is converging on CNCF-graduated projects for production readiness.
Regulation and Compliance Add Layers of Complexity: Data sovereignty, audit trails, and cybersecurity requirements (DORA metrics, CVE management) require visibility and consistency across multi-cloud environments—automation is non-negotiable.
India-Native AI Models Are Lacking: The speaker highlighted a critical gap: while global foundation models are advancing rapidly, models trained on Indian languages and data are scarce, representing both a challenge and opportunity for local AI development.

Notable Quotes or Statements

"Kubernetes is not just the future—it's the present of intelligent operations." — Satyam (on Kubernetes' readiness for production AI workloads)
"Modern infrastructure is harder, and without automation we can't iterate faster." — Satyam (on the necessity of automation in multi-cloud AI environments)
"Developers don't want to learn new tooling and infrastructure beyond their core work. They want to write code, not spend time on platform engineering." — Priti Raj (on why developer experience is a foundational pillar)
"GPU provisioning on hyperscalers can take 6 weeks, and in some countries, high-end GPUs aren't available at all." — Satyam (on the operational pain points in AI infrastructure)
"Composability is critical because each team—development, QA, chaos engineering, DevOps—requires different platform configurations. Without composability, you're custom-building for every team." — Bharat (on why template-driven approaches matter)
"We are lacking India-native AI models compared to the world. This is something we can take away from this session." — Bharat (on regional AI gaps)

Speakers & Organizations Mentioned

Role	Name	Organization
Senior Community Manager, CNCF Ambassador	Priti Raj	Mirantis, KCD Bangalore organizer
Open Source Program Office (OSPO) Lead	Bharat	Mirantis
Software Engineer, OSPO	Satyam	Mirantis
(Mentioned contributor)	Kevin	Nvidia

Organizations Referenced:

Mirantis: Private cloud pioneers (Docker Enterprise, OpenStack, Kubernetes, Lens, Cordant)
CNCF (Cloud Native Computing Foundation): Sets conformance standards, governs Kubernetes ecosystem
Cloud Providers: AWS, Azure, GCP, OpenStack
Hardware Vendors: Nvidia (primary), AMD
Survey Source: CNCF Cloud-Native Survey (15.6M developers, 52% AI/ML focused)

Technical Concepts & Resources

Key Open-Source Projects & Tools

Category	Tools	Notes
Model Training	PyTorch	80% market share on Hugging Face
Inference	KServe, LLMD, AIBricks	CNCF-affiliated; distributed inference, KV cache optimization
Workflow Orchestration	Kubeflow	Bridges cloud environments and Kubernetes for AI workflows
GPU Management	Nvidia GPU Operator	Handles GPU provisioning and multi-tenancy (time-slicing, multi-instance GPU)
Kubernetes Distributions	K0S, K3S, Kind, K3D	Container runtime options
Cluster Provisioning	Cluster API (Cappy)	Standardized cluster lifecycle management across clouds
State Management	Sweltos	GitOps-integrated state and service deployment
Observability	Prometheus, Grafana, OpenCost, OpenTelemetry	CNCF-standard monitoring, logging, tracing
Debugging	KsGPT	LLM-powered Kubernetes debugging and remediation
AI Agents	KAgent	Framework for orchestrating agents in Kubernetes
Packaging	KitOps	Standardizes DevOps/ML model packaging via OCI artifacts
Developer Portals	Backstage, Nebius	Internal developer platforms (IDPs) for standardized workflows
API Management	Kong	Cloud-native API gateway
Feature Flags	Feature (tool)	Dynamic feature toggling for applications
Cost Optimization	Cube Cost, Cast AI	Cloud cost management
Security/Policy	OPA, Kyverno	Policy-as-code for software supply chain
Platform as a Service	Cordant	Composable, template-driven AI infrastructure platform
Integrations	MCP Servers	Standardized interfaces for tool integration

Key Standards & Specifications

Kubernetes AI Conformance: Ensures portable AI workloads with standardized hardware accelerators, operators, scheduling, security (CVE/audit), and observability
Dynamic Resource Allocation (Kubernetes 1.34+): Extends dynamic volume provisioning to hardware accelerators (GPUs)
DORA Metrics: DevOps Research and Assessment framework for evaluating deployment frequency, lead time, mean time to recovery, change failure rate
OCI (Open Container Initiative): Artifact format for containerized models and DevOps practices
GitOps Principles: Infrastructure and application state defined in Git repositories with automated synchronization

Hardware & Infrastructure Concepts

GPU Time-Slicing & Multi-Instance GPU (MIG): Techniques for sharing GPUs among multiple tenants; introduce fragmentation and noisy-neighbor problems
Bare Metal Provisioning: Direct hardware allocation via infrastructure-as-service layers
Multi-Cloud/Multi-Cluster Orchestration: Managing workloads across AWS, Azure, GCP, and private cloud simultaneously
KV Cache Optimization: Inference optimization technique for large language models (memory-efficient serving)

AI Infrastructure Stack Layers (Cordant Example)

Infrastructure Layer: Bare metal (via AWS, Azure, OpenStack, or private cloud)
OS & Container Runtime: K0S + container runtime + GPU drivers
Kubernetes Layer: Cluster API provisioning, node management
GPU Provisioning Layer: Nvidia GPU Operator or AMD equivalents
Service Layer: KServe, KNative, service mesh, observability
Application Layer: User-facing AI applications, agents

Additional Context

Current State Statistics (as of talk date)

15.6 million cloud-native developers globally
52% are AI/ML developers (7.1 million)
36% already running AI workloads on Kubernetes
18% in active planning/development phases
PyTorch dominance: 80% of Hugging Face model training uses PyTorch

Future Outlook

Expectation of 75% of cloud-native engineers being AI/ML specialists within a few years
Growing emphasis on India-native AI models and localized solutions
Evolution toward autonomous clusters with integrated MCP servers for self-managed observability, cost optimization, and security

Potential Gaps or Limitations in Talk

Limited real-world case studies: The live demo was cut short; real enterprise examples of Cordant-based deployments were not provided
Pricing/cost comparisons: No quantitative comparison of manual vs. template-driven infrastructure costs
Performance benchmarks: No specific metrics on GPU utilization improvements, latency reductions, or TCO savings
Regulatory depth: Compliance discussion was high-level; no sector-specific (finance, healthcare) examples
Non-Nvidia GPU ecosystem: AMD GPU support was mentioned but not deeply explored