All sessions

AI in Modern Drug Discovery: From Genes to Clinical Trials | India AI Impact Summit 2026

Contents

Executive Summary

This panel discussion explores how AI and open standards can accelerate drug discovery in India, with emphasis on leveraging India's unique population diversity for genomic research and clinical trial optimization. The speakers argue that India possesses structural advantages—including consanguinous populations and a massive software talent pool—that position it uniquely to democratize AI-driven drug discovery while maintaining ethical data governance through federated learning approaches.

Key Takeaways

  1. India can lead genomic drug discovery if it acts now: Leverage endogamous populations, build federated data networks, and hire world-class bioinformaticians from India's software talent pool—this is a structural advantage equivalent to rare earth elements but vastly more valuable.

  2. Data collection and governance are more important than algorithm innovation: Multiple speakers emphasized that the bottleneck is not AI capability but access to well-structured, ethically-sourced, diverse data. Solve the data problem first; the AI will follow.

  3. Open standards enable scale without centralization: Federated learning, MCP protocols, and standards-based architectures allow hospitals, pharma companies, and researchers to collaborate without surrendering proprietary data or intellectual property—but this requires regulatory incentives and clear business models.

  4. Clinical trials can be accelerated through AI-powered patient discovery: Treating eligibility matching as a semantic search/embedding problem can solve the 80% trial failure rate. This is immediately actionable for Indian hospitals and trial networks.

  5. Regulation should promote, not hinder, collaborative AI in health: Current frameworks (DPDP, informed consent, anonymization) are incomplete. Regulators must clarify pathways for federated learning, synthetic data validation, and cross-institutional data sharing to avoid chilling innovation while protecting patient rights.

India AI Impact Summit 2026


Key Topics Covered

  • Open standards and democratization of AI — How open standards (Docker, Kubernetes, MCP) enable broader participation in AI development beyond large corporations
  • Identity and security in agentic AI systems — OAuth 2.1, authorization scopes, traceability, and preventing prompt injection attacks
  • Data governance and federated learning — Addressing data silos, privacy concerns, and the DPDP Act 2023 while enabling collaborative research
  • Genomic diversity in drug discovery — Why India's endogamous and consanguineous populations provide 100-1000x data efficiency advantages
  • Clinical trial challenges and AI solutions — Using AI to identify eligible patients from unstructured EHR data; 80% of trials fail to meet enrollment targets
  • AI's role vs. human expertise — Why AI augments rather than replaces physicians and domain experts
  • Synthetic data in medical research — Benefits and limitations; purpose-built datasets vs. general synthetic data
  • Regulatory frameworks — FDA/EMA requirements, informed consent, ABDM compliance, ethical oversight
  • Data infrastructure challenges — Unstructured medical notes, acronym disambiguation, and DPDP anonymization gaps

Key Points & Insights

  1. Open standards as equalizers: Docker, Kubernetes, and emerging standards like MCP (Model Context Protocol) from Anthropic enable smaller organizations and developers to build on established platforms without vendor lock-in. This mirrors how Linux ecosystem enabled diverse organizations to build custom flavors.

  2. India's demographic advantage for genomics: India's 1.4 billion population, combined with endogamous pockets and consanguinity, provides 100-1000x greater statistical power to detect gene variants. This means India needs 1/100th to 1/1000th the sample size required in genetically diverse populations like Europe—a massive structural advantage.

  3. Global genomic data is severely skewed: Over 80% of genomic and medical data comes from people of European (primarily North European) ancestry. This homogeneity makes it harder to identify causal variants and creates regulatory/ethical problems—FDA and EMA now require diverse populations for drug approvals.

  4. Clinical trial enrollment is a critical bottleneck: 80% of clinical trials fail to meet enrollment targets. Current workflow relies on manual floor manager queries to identify eligible patients. AI can solve this as a semantic search problem by embedding patient medical records and matching them against eligibility criteria vectors.

  5. Unstructured EHR data is the real obstacle: 80%+ of clinical information in India exists as unstructured narrative notes, not structured databases. Acronyms are overloaded (MI = myocardial infarction OR mitral insufficiency). Large language models show promise but this remains unsolved at scale.

  6. Federated learning works but lacks incentives: Seven major pharma companies tested federated approaches where they trained locally and shared only metadata/meta-analysis results—protecting proprietary libraries while advancing collective knowledge. But individual hospitals lack incentives to participate without clear benefit or regulatory requirements.

  7. Synthetic data is purpose-built, not universal: Synthetic patient records cannot be created as general-purpose datasets because they require sampling from high-dimensional distributions (20,000-30,000 parameters). Synthetic data only works when constrained to specific prediction problems (mortality, myocardial infarction, etc.).

  8. Data collection is expensive but necessary: Meaningful drug discovery requires ~100 million well-phenotyped and genotyped individuals. Purpose-collected data (with informed consent for specific biomarkers) is preferable to retrospective repurposing, though much more expensive upfront.

  9. Identity and traceability in agentic systems: As AI agents make tool calls on behalf of users, maintaining audit trails of who initiated requests and which permissions were granted becomes critical for security and compliance (especially for sensitive actions like e-commerce purchases or medical data access).

  10. DPDP Act 2023 lacks enforcement teeth: India's data protection law is "a statement of good intent" without regulatory guidance on practical implementation—particularly around anonymization techniques and consent mechanisms. Government initiatives like ABDM have low hospital compliance due to operational barriers.


Notable Quotes or Statements

"If this was a debate and we were having a debate, I would say this house believes that India has two significant advantages over the rest of the world in doing drug discovery in AI... endogamous pockets of population... a hundred to a thousand times [data efficiency advantage]." — Parag Parikh (Venture Capitalist)

"We need to think about how do we not just get the data but how can we meaningfully use it in a way that protects the participants and the country... a federated educational approach where we bring the AI in." — Dr. Jonathan Picker (Harvard Geneticist)

"80% of clinical trials do not make their enrollment targets... there is no way that you can accelerate drug development or try to do this for a 100 therapies in one go... this is really where AI can step in and be a really big game changer." — Vibhu Agarwal (Mimansa AI, Clinical Trials)

"As soon as open telemetry came in as an open standard, the whole observability ecosystem is now stabilizing... the power of open source and open standards that they bring in." — Sam Bartuk (vCluster, Head of Developer Relations)

"Synthetic data is one of the most poorly understood notions to have arrived on the scene... synthetic data sets are purpose-built... those kind of data sets are good." — Vibhu Agarwal

"AI is going to make a fantastic partner and do a number of things. It's going to decrease the error rate significantly... Where I think AI needs physicians is in relating the data to the reality of the person." — Dr. Jonathan Picker

"I think synthetic data is... [a problem because] you're actually sampling from a ginormous multivaried distribution... It is non-trivial which is why a synthetic data set which mirrors a patient... doesn't exist." — Vibhu Agarwal


Speakers & Organizations Mentioned

Primary Speakers:

  • Parag Parikh — Venture capitalist, fifth decade of investing; background in biotech and tech
  • Dr. Jonathan Picker — Professor of Genetics, Harvard University; trained surgeon and clinical/molecular geneticist
  • Vibhu Agarwal — Founder, Mimansa AI; computational linguist background, Stanford PhD, clinical trials focus; previously worked on text-to-speech for Indian languages
  • Sam Bartuk — Head of Developer Relations, vCluster; founder of Cube Simplify
  • Shvai — AI/ML background, CNCF code contributor, worked on Docker MCP gateway
  • James Lovegrove — Public Policy Lead, Red Hat
  • Simon — Sibo (company name; discussed data center infrastructure)
  • Parag Parikh — Host/moderator for AI in Drug Discovery panel

Organizations & Initiatives Mentioned:

  • Anthropic — Creator of MCP (Model Context Protocol); placing it in a foundation
  • Red Hat — Involved in Instruct Lab and data science pipeline projects
  • CNCF (Cloud Native Computing Foundation) — Maintains cloud-native ecosystem standards
  • Docker — Container standardization; contributed to Open Container Initiative
  • FDA (US Food & Drug Administration) — ~1,300-1,400 AI software as medical device approvals; requires diverse populations for drug approval
  • EMA (European Medicines Agency) — Similar requirements; emphasis on diverse population representation in synthetic data
  • Vanderbilt University — Launched NASH data-sharing initiative with Amgen, Regeneron
  • Harvard University — Dr. Picker's institution
  • Stanford University — Vibhu Agarwal's PhD institution
  • OECD — Discussed in context of open-source AI tools
  • DeepMind/Alphabet — AlphaFold reference (protein folding)
  • Genentech/David Baker Lab — Zyra startup (founded by Nobel laureate David Baker)
  • Government of India — National Health Digital Mission (ABDM); Digital Personal Data Protection Act (DPDP) 2023
  • Hospital chains mentioned: Midanta, Fortis, 100+ government medical colleges
  • Tempest AI — Data-sharing model company
  • Origin Life Sciences — Hyderabad-based company building retrosynthesis and NMR models
  • Hugging Face — Referenced as open-source model platform enabler

Technical Concepts & Resources

AI/ML Concepts:

  • Federated Learning — Training models locally, sharing only weights/metadata, not raw data; tested by 7 major pharmas
  • Embeddings — 768-dimensional floating-point vectors representing patient medical records in semantic space
  • Large Language Models (LLMs) — New approach to NLP tasks (unstructured medical notes) with "rich dense semantic representations"
  • Synthetic Data Generation — Sampling from high-dimensional multivariate distributions; purpose-built vs. general-purpose
  • Semantic Search — Matching patient records to eligibility criteria using vector similarity
  • Optimization Problems — Framing drug discovery as multi-objective optimization (specificity, off-target activity, absorption, distribution, metabolism)
  • MCP (Model Context Protocol) — Open standard for agentic AI tool calling; supports OAuth 2.1 authorization

Data & Genomics:

  • Genotypes — Genetic variants/code
  • Phenotypes — Observable characteristics (height, weight, BMI, clinical conditions, radiology)
  • DNA/RNA extraction — From blood samples
  • Proteomics — Protein-level analysis
  • Consanguinity — Marriage between cousins; increases frequency of recessive gene variants
  • Endogamy — High degree of intermarriage within population groups
  • Linkage studies — Identifying genes through family inheritance patterns
  • Gene variants/pathways — Disrupted biological pathways as drug targets
  • Rare diseases — Classified as rare in one population but affecting large absolute numbers in India (1.4B population)

Clinical & Regulatory:

  • EHR (Electronic Health Records) — Patient medical records; 80% unstructured in India
  • PHI (Protected Health Information) — Subject to DPDP and privacy regulations
  • Clinical trial registry — Mandatory protocol documentation (endpoints, objectives, design)
  • Equipoise — Ethical principle that control arm is justified only when uncertainty exists about treatment superiority
  • Real-world data (RWD) — Data from routine clinical care, not from controlled trials
  • Informed consent — Recommendation: <1 page, explained by medical professional
  • Anonymization — DPDP requirement; still legally/ethically ambiguous in practice
  • DPDP Act 2023 — India's data protection law; emphasizes consent but lacks implementation guidance
  • ABDM (Ayushman Bharat Digital Mission) — India's national health digital framework; low hospital compliance currently
  • Software as Medical Device (SaMD) — FDA-regulated AI tools; mostly approved as adjuncts to human decision-making, not autonomous

Standards & Infrastructure:

  • Docker/Kubernetes — Container orchestration standards enabling portability
  • Open Container Initiative (OCI) — Open standard for containers
  • OpenTelemetry — Open standard for observability that stabilized the market after DataDog/New Relic proprietary approaches
  • OAuth 2.1 — Authorization standard supported by MCP for fine-grained scopes
  • SPIFFE/SPIRE — Security protocols for service-to-service authentication (A2A protocol)
  • FDA Approval Page for SaMD — Reference for regulatory landscape

Datasets & Resources:

  • PubChem — Chemical compound database
  • USPTO — Patent database (chemistry/retrosynthesis)
  • NMR shift databases — Nuclear magnetic resonance data
  • Linux — Open-source OS enabling ecosystem diversity

Historical References:

  • Human Genome Project (2003) — Completed first annotated genome; did not transform medicine as expected
  • AlphaFold — AI protein folding (David Baker won Nobel Prize for related work)
  • Docker revolution (2010s) — Transformed containerization, spawned Kubernetes and cloud-native ecosystem
  • Tuskegee Syphilis Study / Nazi experiments — Cautionary examples of unethical human experimentation; justify strong ethical regulatory frameworks

Limitations & Open Questions

  • Data standardization roadmap unclear: While ABDM exists, no concrete timeline for hospital compliance or enforcement mechanisms
  • Incentive alignment unsolved: How to motivate individual hospitals to share data without regulatory mandate or financial benefit
  • Synthetic data validation: No consensus on how to validate synthetic patient data meets clinical standards
  • Scaling challenges: Unclear how federated learning can work across >100 Indian government medical colleges with varying IT infrastructure
  • Regulatory guidance gaps: DPDP Act lacks clarity on anonymization techniques, consent mechanisms, and cross-institutional data flows