The Engines of Intelligence – Scaling AI Infrastructure & Security
Contents
Executive Summary
This panel discussion from an AI summit addresses the critical infrastructure, networking, security, and operational challenges required to scale AI systems safely and sustainably. Panelists emphasize that networking, security, and organizational operating models are frequently overlooked yet essential components of AI infrastructure—equally important as compute resources—and that India must develop sovereign AI capabilities through open standards, skilled workforce development, and balanced policy frameworks rather than complete isolation from global technologies.
Key Takeaways
-
Networking and security are engines, not afterthoughts: Treat them as first-class infrastructure concerns from day one, not optional additions. They directly impact training speed, inference latency, and attack resilience.
-
India must invest in three areas immediately: (a) Skill development for AI networking and infrastructure engineering, (b) Power and cooling infrastructure with inter-state policy coordination, (c) Open-standards-based platform development (Ethernet, RoCE, modular APIs) to avoid vendor lock-in.
-
Observability is the foundation of operational resilience: Without visibility across network and ML layers, you cannot diagnose or fix problems at scale. Invest in telemetry first, keep it domestic, and integrate network + ML observability, not separately.
-
Data sovereignty requires regulatory discipline and trusted supply chains: Develop homegrown LLMs and infrastructure where sensitive workloads run, but acknowledge a 5–10 year timeline for full self-sufficiency. Use government verification portals for supply chain trust while maintaining interoperability with open standards.
-
Security requires behavioral intelligence and organizational discipline: Move beyond signature-based detection to AI/ML anomaly detection. Align organizational culture, operating models, and governance to enforce secure-by-design principles. This is a people and process problem, not just a technology problem.
Key Topics Covered
- Networking as a Critical Layer: How networking has evolved from "plumbing" to "engine" for AI training and inference at scale
- Security as a Design Principle: The necessity of "secure by design" rather than treating security as an afterthought
- Infrastructure Challenges in India/South Asia: Adoption gaps, skill deficits, power/cooling constraints, and lack of standardized policy frameworks
- Sovereignty & Data Localization: Balancing national data sovereignty with practical reliance on global supply chains and open standards
- Observability & Telemetry: The critical importance of visibility across network and ML systems to detect and troubleshoot issues
- Threat Models for AI Systems: New attack surfaces including prompt injection, data poisoning, API vulnerabilities, and behavioral anomalies
- Operating Models & Governance: Organizational structure, risk ownership, and coordination across IT, legal, infrastructure, and leadership teams
- Hardware & Supply Chain Constraints: The reality of component sourcing, manufacturing timelines, and the path toward "Make in India" infrastructure
Key Points & Insights
-
Networking is no longer ancillary infrastructure: Studies show 20–50% of AI training time is spent on network traffic. Network architecture, topology, and cabling failures are now primary bottlenecks, not afterthoughts.
-
Ethernet + RDMA standardization is replacing proprietary protocols: The industry is converging on Ethernet with technologies like RDMA and RoCEv2. This enables interoperability, redeployment across infrastructure, and avoids vendor lock-in—unlike legacy systems built on proprietary InfiniBand.
-
Security requires behavioral/anomaly detection, not signature matching: AI-driven attacks are evolving and camouflaging themselves. Traditional signature-based detection fails. AI/ML-based behavioral baseline models must detect even subtle deviations and intentions before attacks manifest.
-
Observability is non-negotiable at multi-node scale: As soon as workloads span multiple nodes, combined network and ML observability becomes mandatory. "If you can't see it, you can't troubleshoot it." Packet loss, packet buffering, latency, and tail latency are critical signals.
-
Operating model and risk ownership are major blockers: Organizations struggle with unclear responsibility assignment (IT, legal, infrastructure, leadership). Unclear ownership causes organizations to avoid taking on risk, delaying deployment and security hardening.
-
India lags in adoption of evolved standards: While the US has moved to Ethernet, many Indian organizations still hesitate to adopt it or cling to legacy InfiniBand. Skill gaps in AI networking and infrastructure engineering persist.
-
Power and cooling are India's binding constraint: AI datacenters require extreme per-rack kilowatt density. Most existing Indian datacenters cannot meet these requirements. Few geographic pockets exist where large-scale AI workloads can be hosted. Policy-level coordination on power generation (especially renewable) is needed.
-
Data sovereignty and API security are emerging imperatives: Every system is now a network of APIs. Undocumented ("shadow") and zombie APIs create attack surfaces. Data exfiltration, prompt injection, and model poisoning happen silently. Telemetry must remain within national borders; dependence on foreign LLMs for sensitive workloads represents an unquantified risk.
-
Sovereignty ≠ isolation; it means control, auditability, and standards-based design: Practical sovereignty requires transparent architecture, auditable supply chains (via government portals), firmware verification, and open standards—not complete self-sufficiency. Manufacturing timelines for chipsets take 2–3 years; current dependence on global components is unavoidable short-term.
-
Secure-by-design culture and dual-speed execution matter: Security cannot be bolted on later. It requires organizational culture (training, empowerment, quality mindset) and running two workstreams simultaneously—sprints (monetizable quick wins) and marathons (long-term hardening)—unless regulation mandates otherwise.
Notable Quotes or Statements
-
"Networking stops being plumbing and starts being the engine" – Reflects the fundamental shift in how AI scales; network performance directly limits training and inference speed.
-
"If you can't see it, you can't troubleshoot it. And you're troubleshooting all the time." – Clay's statement on observability as a non-negotiable foundation.
-
"Security cannot be a gate at the end. It has to be woven into the fabric from the grassroots level." – Articulates the shift from perimeter defense to secure-by-design culture.
-
"20 to 50% of training time is spent on network traffic" – Quantifies why networking is now a primary bottleneck, not secondary.
-
"Cables were probably the largest problem we have with builds" – A 10,000 GPU cluster has 30,000 cables; if 50% fail, you lose half your capacity. Physical infrastructure discipline matters.
-
"You can't isolate yourself from the world right away" – Acknowledges that complete self-sufficiency is unrealistic; sovereignty is about control, audit, and open standards, not isolation.
-
"Attacks don't bring down infrastructure in one go; they operate silently—prompt injection, data poisoning, data exfiltration happening in the back end without anybody noticing." – Highlights new threat models requiring behavioral detection.
-
"Why are we giving our code to ChatGPT? That's data exploitation." – Points to the risk of sensitive workloads running on foreign LLMs.
-
"Manufacturing chipsets takes 2–3 years and requires dependent components. We have to be phased and strategic, not isolated." – Pragmatic stance on why India cannot achieve full self-sufficiency overnight.
-
"Sovereignty threat is real—we've seen it before (1990s nuclear tests and denied party lists). We need to act, but it should have started 20 years ago." – Historical warning that geopolitical risks are concrete, not theoretical.
Speakers & Organizations Mentioned
- Clay – Network/infrastructure expert (Arista Networks implied based on context; 5-year tenure)
- Laxmi/Lakshmi – Security/threat detection expert; discussed real incidents involving Indian airports and financial institutions during state-sponsored attacks
- Renault/Rahul – Infrastructure/India & South Asia perspective; discussed adoption challenges in the region
- Midnesh – Works for an American company with Indian CEO; involved in "Make in India" manufacturing; mentioned setting up factories and component sourcing
- Moderator – Referred to as a consultant helping clients navigate AI scaling; identified as Rahul at the end
- IBM – Mentioned as exhibiting quantum computing hardware at the summit
- Arista Networks – Implied as a panelist's organization; known for Ethernet-based networking
- Broadcom – Referenced as a chipset manufacturer enabling modern Ethernet capabilities (RDMA, RoCEv2)
- Tesla, Oracle, Meta, Microsoft – Cited as large-scale AI deployment leaders who have adopted Ethernet
- Government of India bodies:
- Trusted telecom portal (supply chain verification)
- DPDP (Data Protection) compliance mentioned
- RBI (Reserve Bank of India) and SEBI (Securities & Exchange Board of India) guidelines referenced
- European Union – Referenced for GDPR compliance and stricter regulatory frameworks
- Pakistan – Mentioned in context of cyber attack attribution and geopolitical threats
Technical Concepts & Resources
Networking & Infrastructure
- Ethernet + RDMA (Remote Direct Memory Access): Standard replacing proprietary InfiniBand for low-latency GPU-to-GPU communication
- RoCEv2 (RDMA over Converged Ethernet v2): Modern standard enabling high-performance networking over Ethernet
- InfiniBand (legacy): Proprietary protocol historically used in HPC; being phased out in favor of Ethernet
- 1:1 network (no oversubscription): Full bisection bandwidth architecture; sometimes overprovisioned to add redundancy
- Packet buffering and packet loss: Primary metrics for network health at scale
- Ultra Ethernet Consortium: Industry standardization body for next-generation Ethernet standards (open architecture)
- Telemetry & observability: Network packet analysis, GPU metrics, latency/tail latency, floating-point anomalies
- Multi-node training: Training spanning more than one node; marks the threshold at which observability becomes mandatory
Security & Threat Detection
- Behavioral/anomaly detection: ML-based baseline modeling to detect deviations from normal traffic patterns (vs. signature matching)
- Prompt injection: Malicious input to LLMs to manipulate reasoning or extract sensitive information
- Data poisoning: Corrupting training data to degrade model performance or inject backdoors
- Data exfiltration: Unauthorized extraction of sensitive data or model reasoning chains
- Zombie/shadow APIs: Undocumented or legacy API endpoints creating uncontrolled attack surfaces
- DDoS (Distributed Denial of Service): State-sponsored or activist group attacks; incident cited: 2023 April attacks on 6 Indian airports
- Secure-by-design: Architectural principle of embedding security at all layers (network, API gateway, application, model) from inception
- Firmware verification: Auditing firmware to detect backdoors or unauthorized modifications
- API gateway security: Deep inspection of HTTP traffic and API payloads for malicious intent
- Network as single source of truth: All traffic flows through the network, making it ideal vantage point for security observability
Governance & Compliance
- DPDP (Digital Personal Data Protection): India's emerging data protection regulation analogous to GDPR
- GDPR (General Data Protection Regulation): European standard; organizations responsible even for third-party vendor breaches
- SEBI/RBI guidelines: India's financial sector compliance frameworks
- Trusted Telecom Portal: Government supply chain verification mechanism for telecom/infrastructure components
- Operating model: Organizational structure defining role clarity, risk ownership, and cross-functional accountability
AI-Specific Concepts
- LLM (Large Language Model): ChatGPT, Gemini, homegrown Indian models discussed as alternatives to foreign dependency
- Job Completion Time (JCT): Metric used to measure training/inference performance; network and security directly impact JCT
- Multi-node low-latency training: Distributed training across GPUs requiring synchronization and high-bandwidth interconnects
- Model serving APIs: REST/gRPC endpoints exposing trained models; now the primary attack surface in production AI systems
Regional/Geopolitical Context
- Make in India: Government initiative to develop domestic manufacturing and intellectual property
- Denied Party List: Trade restriction mechanism (referenced from 1990s nuclear tests incident) preventing exports to sanctioned entities
- Pokran II: 1998 Indian nuclear test; immediately triggered US sanctions and export restrictions
- GDPR vs. India's liberal stance: European regulators enforce strict compliance; India historically more permissive but evolving toward stricter governance
- "One District One Product": Referenced as a neighbor's (Pakistan's implied) successful regional specialization model that India could emulate for AI infrastructure clusters
Additional Context
- Summit Location: India (evident from references to Indian organizations, policy bodies, and government initiatives; likely held in New Delhi or Bangalore)
- Audience: Mix of infrastructure engineers, security professionals, policymakers, and enterprise IT leaders
- Event Scale: Multiple parallel sessions, exhibition floor, international attendees from major global tech companies
- Tone: Pragmatic, acknowledging both India's potential and current constraints; advocates for strategic investment rather than isolationism
