The Intelligent Cloud: How AI Is Transforming the Cloud-Native World
Contents
Executive Summary
This talk explores how Kubernetes and cloud-native infrastructure are becoming the foundational operating system for AI workloads, addressing the complexity of deploying intelligent applications at scale. The speakers argue that modern AI infrastructure requires not just distributed computing capabilities but also composable, multi-cloud platforms with unified control planes, and present Cordant as an open-source solution that standardizes AI infrastructure provisioning.
Key Takeaways
-
Kubernetes is not just for containerized apps anymore—it is becoming the universal control plane for AI infrastructure, with 52% of cloud-native developers now being AI/ML practitioners.
-
Multi-tenancy and GPU optimization require orchestration at parity with cluster orchestration. You need multi-cluster GPU management alongside multi-cluster Kubernetes management to avoid waste and fragmentation.
-
Platform engineering for AI is a five-step discipline: developer experience → security → CI/CD → resilience → cost management. Skipping any step leads to operational debt and developer friction.
-
Composable, template-driven infrastructure reduces setup time from weeks to minutes, democratizing AI infrastructure and freeing developers to focus on models rather than platform configuration.
-
The future is integrated MCP servers: standardized interfaces that allow clusters and applications to autonomously interact with multiple services (observability, cost management, security) without custom integrations.
Key Topics Covered
- Kubernetes as the OS of the Future: Kubernetes' role as the universal control plane for cloud-native and AI workloads
- Cloud-Native AI Adoption: Current state and growth of AI/ML developers on Kubernetes (52% of 15.6M cloud-native developers)
- Infrastructure Complexity: Multi-cloud, multi-cluster environments and the challenges they introduce
- AI Infrastructure Challenges: Technical complexity, operational efficiency, multi-tenancy, and GPU resource optimization
- Kubernetes AI Conformance: Standards ensuring portable AI workloads across different Kubernetes distributions
- Open-Source AI Pillars: Training, inference, and agents as the three core components of modern AI stacks
- Platform Engineering for AI: A five-step framework for building resilient, AI-native platforms
- GPU Provisioning & Multi-Tenancy: Solutions for efficient GPU utilization across multiple tenants
- Cordant Project: A composable, template-based approach to infrastructure as a service for AI-ready Kubernetes
- MCP Servers: Standardized interfaces for integrating cloud-native tooling
Key Points & Insights
-
Kubernetes Adoption in AI is Accelerating: 36% of cloud-native developers already run AI workloads on Kubernetes, with 18% in active planning phases—demonstrating rapid mainstream adoption beyond traditional use cases.
-
Multi-Cloud Complexity Has Become the Norm: Organizations are no longer single-cloud or single-cluster; they manage dozens or hundreds of clusters across multiple regions and providers. Traditional configuration management (YAML, IaC tools) becomes unscalable and fragmented.
-
GPU Resource Management is a Critical Bottleneck: High-end GPUs cost over $20,000 each and face global supply constraints (6+ week provisioning on hyperscalers in some regions). Multi-tenancy, time-slicing, and fragmentation are major challenges without proper orchestration.
-
Kubernetes AI Conformance Addresses Portability: The CNCF-backed Kubernetes AI Conformance initiative ensures that AI workloads are portable across GKE, AKS, and private cloud environments through standardized hardware accelerator, operator, and security requirements.
-
The Five-Pillar Platform Engineering Framework: Successful AI platforms require: (1) developer experience, (2) security/compliance, (3) CI/CD foundation, (4) resilience engineering, and (5) cost management—not just infrastructure.
-
Composability Over Infrastructure-as-Code: Cordant avoids traditional IaC tools (Terraform, Ansible, Crossplane) and instead uses Helm charts and YAML templating, lowering the barrier for non-specialists and reducing maintenance burden.
-
AI Infrastructure Setup Still Takes Weeks: Manual provisioning—from GPU driver setup through Kubernetes deployment to service installation—remains a significant drag on development velocity. Template-driven provisioning can reduce this to 15–20 minutes.
-
Open-Source Dominance in AI Tooling: PyTorch holds 80% share in model training; tools like KServe, LLMD, Kubeflow, and AIBricks are becoming standard in production. The ecosystem is converging on CNCF-graduated projects for production readiness.
-
Regulation and Compliance Add Layers of Complexity: Data sovereignty, audit trails, and cybersecurity requirements (DORA metrics, CVE management) require visibility and consistency across multi-cloud environments—automation is non-negotiable.
-
India-Native AI Models Are Lacking: The speaker highlighted a critical gap: while global foundation models are advancing rapidly, models trained on Indian languages and data are scarce, representing both a challenge and opportunity for local AI development.
Notable Quotes or Statements
-
"Kubernetes is not just the future—it's the present of intelligent operations." — Satyam (on Kubernetes' readiness for production AI workloads)
-
"Modern infrastructure is harder, and without automation we can't iterate faster." — Satyam (on the necessity of automation in multi-cloud AI environments)
-
"Developers don't want to learn new tooling and infrastructure beyond their core work. They want to write code, not spend time on platform engineering." — Priti Raj (on why developer experience is a foundational pillar)
-
"GPU provisioning on hyperscalers can take 6 weeks, and in some countries, high-end GPUs aren't available at all." — Satyam (on the operational pain points in AI infrastructure)
-
"Composability is critical because each team—development, QA, chaos engineering, DevOps—requires different platform configurations. Without composability, you're custom-building for every team." — Bharat (on why template-driven approaches matter)
-
"We are lacking India-native AI models compared to the world. This is something we can take away from this session." — Bharat (on regional AI gaps)
Speakers & Organizations Mentioned
| Role | Name | Organization |
|---|---|---|
| Senior Community Manager, CNCF Ambassador | Priti Raj | Mirantis, KCD Bangalore organizer |
| Open Source Program Office (OSPO) Lead | Bharat | Mirantis |
| Software Engineer, OSPO | Satyam | Mirantis |
| (Mentioned contributor) | Kevin | Nvidia |
Organizations Referenced:
- Mirantis: Private cloud pioneers (Docker Enterprise, OpenStack, Kubernetes, Lens, Cordant)
- CNCF (Cloud Native Computing Foundation): Sets conformance standards, governs Kubernetes ecosystem
- Cloud Providers: AWS, Azure, GCP, OpenStack
- Hardware Vendors: Nvidia (primary), AMD
- Survey Source: CNCF Cloud-Native Survey (15.6M developers, 52% AI/ML focused)
Technical Concepts & Resources
Key Open-Source Projects & Tools
| Category | Tools | Notes |
|---|---|---|
| Model Training | PyTorch | 80% market share on Hugging Face |
| Inference | KServe, LLMD, AIBricks | CNCF-affiliated; distributed inference, KV cache optimization |
| Workflow Orchestration | Kubeflow | Bridges cloud environments and Kubernetes for AI workflows |
| GPU Management | Nvidia GPU Operator | Handles GPU provisioning and multi-tenancy (time-slicing, multi-instance GPU) |
| Kubernetes Distributions | K0S, K3S, Kind, K3D | Container runtime options |
| Cluster Provisioning | Cluster API (Cappy) | Standardized cluster lifecycle management across clouds |
| State Management | Sweltos | GitOps-integrated state and service deployment |
| Observability | Prometheus, Grafana, OpenCost, OpenTelemetry | CNCF-standard monitoring, logging, tracing |
| Debugging | KsGPT | LLM-powered Kubernetes debugging and remediation |
| AI Agents | KAgent | Framework for orchestrating agents in Kubernetes |
| Packaging | KitOps | Standardizes DevOps/ML model packaging via OCI artifacts |
| Developer Portals | Backstage, Nebius | Internal developer platforms (IDPs) for standardized workflows |
| API Management | Kong | Cloud-native API gateway |
| Feature Flags | Feature (tool) | Dynamic feature toggling for applications |
| Cost Optimization | Cube Cost, Cast AI | Cloud cost management |
| Security/Policy | OPA, Kyverno | Policy-as-code for software supply chain |
| Platform as a Service | Cordant | Composable, template-driven AI infrastructure platform |
| Integrations | MCP Servers | Standardized interfaces for tool integration |
Key Standards & Specifications
- Kubernetes AI Conformance: Ensures portable AI workloads with standardized hardware accelerators, operators, scheduling, security (CVE/audit), and observability
- Dynamic Resource Allocation (Kubernetes 1.34+): Extends dynamic volume provisioning to hardware accelerators (GPUs)
- DORA Metrics: DevOps Research and Assessment framework for evaluating deployment frequency, lead time, mean time to recovery, change failure rate
- OCI (Open Container Initiative): Artifact format for containerized models and DevOps practices
- GitOps Principles: Infrastructure and application state defined in Git repositories with automated synchronization
Hardware & Infrastructure Concepts
- GPU Time-Slicing & Multi-Instance GPU (MIG): Techniques for sharing GPUs among multiple tenants; introduce fragmentation and noisy-neighbor problems
- Bare Metal Provisioning: Direct hardware allocation via infrastructure-as-service layers
- Multi-Cloud/Multi-Cluster Orchestration: Managing workloads across AWS, Azure, GCP, and private cloud simultaneously
- KV Cache Optimization: Inference optimization technique for large language models (memory-efficient serving)
AI Infrastructure Stack Layers (Cordant Example)
- Infrastructure Layer: Bare metal (via AWS, Azure, OpenStack, or private cloud)
- OS & Container Runtime: K0S + container runtime + GPU drivers
- Kubernetes Layer: Cluster API provisioning, node management
- GPU Provisioning Layer: Nvidia GPU Operator or AMD equivalents
- Service Layer: KServe, KNative, service mesh, observability
- Application Layer: User-facing AI applications, agents
Additional Context
Current State Statistics (as of talk date)
- 15.6 million cloud-native developers globally
- 52% are AI/ML developers (7.1 million)
- 36% already running AI workloads on Kubernetes
- 18% in active planning/development phases
- PyTorch dominance: 80% of Hugging Face model training uses PyTorch
Future Outlook
- Expectation of 75% of cloud-native engineers being AI/ML specialists within a few years
- Growing emphasis on India-native AI models and localized solutions
- Evolution toward autonomous clusters with integrated MCP servers for self-managed observability, cost optimization, and security
Potential Gaps or Limitations in Talk
- Limited real-world case studies: The live demo was cut short; real enterprise examples of Cordant-based deployments were not provided
- Pricing/cost comparisons: No quantitative comparison of manual vs. template-driven infrastructure costs
- Performance benchmarks: No specific metrics on GPU utilization improvements, latency reductions, or TCO savings
- Regulatory depth: Compliance discussion was high-level; no sector-specific (finance, healthcare) examples
- Non-Nvidia GPU ecosystem: AMD GPU support was mentioned but not deeply explored
