AI in Modern Drug Discovery: From Genes to Clinical Trials | India AI Impact Summit 2026

Contents

Executive Summary

This panel discussion explores how AI and open standards can accelerate drug discovery in India, with emphasis on leveraging India's unique population diversity for genomic research and clinical trial optimization. The speakers argue that India possesses structural advantages—including consanguinous populations and a massive software talent pool—that position it uniquely to democratize AI-driven drug discovery while maintaining ethical data governance through federated learning approaches.

Key Takeaways

India can lead genomic drug discovery if it acts now: Leverage endogamous populations, build federated data networks, and hire world-class bioinformaticians from India's software talent pool—this is a structural advantage equivalent to rare earth elements but vastly more valuable.
Data collection and governance are more important than algorithm innovation: Multiple speakers emphasized that the bottleneck is not AI capability but access to well-structured, ethically-sourced, diverse data. Solve the data problem first; the AI will follow.
Open standards enable scale without centralization: Federated learning, MCP protocols, and standards-based architectures allow hospitals, pharma companies, and researchers to collaborate without surrendering proprietary data or intellectual property—but this requires regulatory incentives and clear business models.
Clinical trials can be accelerated through AI-powered patient discovery: Treating eligibility matching as a semantic search/embedding problem can solve the 80% trial failure rate. This is immediately actionable for Indian hospitals and trial networks.
Regulation should promote, not hinder, collaborative AI in health: Current frameworks (DPDP, informed consent, anonymization) are incomplete. Regulators must clarify pathways for federated learning, synthetic data validation, and cross-institutional data sharing to avoid chilling innovation while protecting patient rights.

India AI Impact Summit 2026

Key Topics Covered

Open standards and democratization of AI — How open standards (Docker, Kubernetes, MCP) enable broader participation in AI development beyond large corporations
Identity and security in agentic AI systems — OAuth 2.1, authorization scopes, traceability, and preventing prompt injection attacks
Data governance and federated learning — Addressing data silos, privacy concerns, and the DPDP Act 2023 while enabling collaborative research
Genomic diversity in drug discovery — Why India's endogamous and consanguineous populations provide 100-1000x data efficiency advantages
Clinical trial challenges and AI solutions — Using AI to identify eligible patients from unstructured EHR data; 80% of trials fail to meet enrollment targets
AI's role vs. human expertise — Why AI augments rather than replaces physicians and domain experts
Synthetic data in medical research — Benefits and limitations; purpose-built datasets vs. general synthetic data
Regulatory frameworks — FDA/EMA requirements, informed consent, ABDM compliance, ethical oversight
Data infrastructure challenges — Unstructured medical notes, acronym disambiguation, and DPDP anonymization gaps

Key Points & Insights

Open standards as equalizers: Docker, Kubernetes, and emerging standards like MCP (Model Context Protocol) from Anthropic enable smaller organizations and developers to build on established platforms without vendor lock-in. This mirrors how Linux ecosystem enabled diverse organizations to build custom flavors.
India's demographic advantage for genomics: India's 1.4 billion population, combined with endogamous pockets and consanguinity, provides 100-1000x greater statistical power to detect gene variants. This means India needs 1/100th to 1/1000th the sample size required in genetically diverse populations like Europe—a massive structural advantage.
Global genomic data is severely skewed: Over 80% of genomic and medical data comes from people of European (primarily North European) ancestry. This homogeneity makes it harder to identify causal variants and creates regulatory/ethical problems—FDA and EMA now require diverse populations for drug approvals.
Clinical trial enrollment is a critical bottleneck: 80% of clinical trials fail to meet enrollment targets. Current workflow relies on manual floor manager queries to identify eligible patients. AI can solve this as a semantic search problem by embedding patient medical records and matching them against eligibility criteria vectors.
Unstructured EHR data is the real obstacle: 80%+ of clinical information in India exists as unstructured narrative notes, not structured databases. Acronyms are overloaded (MI = myocardial infarction OR mitral insufficiency). Large language models show promise but this remains unsolved at scale.
Federated learning works but lacks incentives: Seven major pharma companies tested federated approaches where they trained locally and shared only metadata/meta-analysis results—protecting proprietary libraries while advancing collective knowledge. But individual hospitals lack incentives to participate without clear benefit or regulatory requirements.
Synthetic data is purpose-built, not universal: Synthetic patient records cannot be created as general-purpose datasets because they require sampling from high-dimensional distributions (20,000-30,000 parameters). Synthetic data only works when constrained to specific prediction problems (mortality, myocardial infarction, etc.).
Data collection is expensive but necessary: Meaningful drug discovery requires ~100 million well-phenotyped and genotyped individuals. Purpose-collected data (with informed consent for specific biomarkers) is preferable to retrospective repurposing, though much more expensive upfront.
Identity and traceability in agentic systems: As AI agents make tool calls on behalf of users, maintaining audit trails of who initiated requests and which permissions were granted becomes critical for security and compliance (especially for sensitive actions like e-commerce purchases or medical data access).
DPDP Act 2023 lacks enforcement teeth: India's data protection law is "a statement of good intent" without regulatory guidance on practical implementation—particularly around anonymization techniques and consent mechanisms. Government initiatives like ABDM have low hospital compliance due to operational barriers.

Notable Quotes or Statements

"If this was a debate and we were having a debate, I would say this house believes that India has two significant advantages over the rest of the world in doing drug discovery in AI... endogamous pockets of population... a hundred to a thousand times [data efficiency advantage]." — Parag Parikh (Venture Capitalist)

"We need to think about how do we not just get the data but how can we meaningfully use it in a way that protects the participants and the country... a federated educational approach where we bring the AI in." — Dr. Jonathan Picker (Harvard Geneticist)

"80% of clinical trials do not make their enrollment targets... there is no way that you can accelerate drug development or try to do this for a 100 therapies in one go... this is really where AI can step in and be a really big game changer." — Vibhu Agarwal (Mimansa AI, Clinical Trials)

"As soon as open telemetry came in as an open standard, the whole observability ecosystem is now stabilizing... the power of open source and open standards that they bring in." — Sam Bartuk (vCluster, Head of Developer Relations)

"Synthetic data is one of the most poorly understood notions to have arrived on the scene... synthetic data sets are purpose-built... those kind of data sets are good." — Vibhu Agarwal

"AI is going to make a fantastic partner and do a number of things. It's going to decrease the error rate significantly... Where I think AI needs physicians is in relating the data to the reality of the person." — Dr. Jonathan Picker

"I think synthetic data is... [a problem because] you're actually sampling from a ginormous multivaried distribution... It is non-trivial which is why a synthetic data set which mirrors a patient... doesn't exist." — Vibhu Agarwal

Speakers & Organizations Mentioned

Primary Speakers:

Parag Parikh — Venture capitalist, fifth decade of investing; background in biotech and tech
Dr. Jonathan Picker — Professor of Genetics, Harvard University; trained surgeon and clinical/molecular geneticist
Vibhu Agarwal — Founder, Mimansa AI; computational linguist background, Stanford PhD, clinical trials focus; previously worked on text-to-speech for Indian languages
Sam Bartuk — Head of Developer Relations, vCluster; founder of Cube Simplify
Shvai — AI/ML background, CNCF code contributor, worked on Docker MCP gateway
James Lovegrove — Public Policy Lead, Red Hat
Simon — Sibo (company name; discussed data center infrastructure)
Parag Parikh — Host/moderator for AI in Drug Discovery panel

Organizations & Initiatives Mentioned:

Anthropic — Creator of MCP (Model Context Protocol); placing it in a foundation
Red Hat — Involved in Instruct Lab and data science pipeline projects
CNCF (Cloud Native Computing Foundation) — Maintains cloud-native ecosystem standards
Docker — Container standardization; contributed to Open Container Initiative
FDA (US Food & Drug Administration) — ~1,300-1,400 AI software as medical device approvals; requires diverse populations for drug approval
EMA (European Medicines Agency) — Similar requirements; emphasis on diverse population representation in synthetic data
Vanderbilt University — Launched NASH data-sharing initiative with Amgen, Regeneron
Harvard University — Dr. Picker's institution
Stanford University — Vibhu Agarwal's PhD institution
OECD — Discussed in context of open-source AI tools
DeepMind/Alphabet — AlphaFold reference (protein folding)
Genentech/David Baker Lab — Zyra startup (founded by Nobel laureate David Baker)
Government of India — National Health Digital Mission (ABDM); Digital Personal Data Protection Act (DPDP) 2023
Hospital chains mentioned: Midanta, Fortis, 100+ government medical colleges
Tempest AI — Data-sharing model company
Origin Life Sciences — Hyderabad-based company building retrosynthesis and NMR models
Hugging Face — Referenced as open-source model platform enabler

Technical Concepts & Resources

AI/ML Concepts:

Federated Learning — Training models locally, sharing only weights/metadata, not raw data; tested by 7 major pharmas
Embeddings — 768-dimensional floating-point vectors representing patient medical records in semantic space
Large Language Models (LLMs) — New approach to NLP tasks (unstructured medical notes) with "rich dense semantic representations"
Synthetic Data Generation — Sampling from high-dimensional multivariate distributions; purpose-built vs. general-purpose
Semantic Search — Matching patient records to eligibility criteria using vector similarity
Optimization Problems — Framing drug discovery as multi-objective optimization (specificity, off-target activity, absorption, distribution, metabolism)
MCP (Model Context Protocol) — Open standard for agentic AI tool calling; supports OAuth 2.1 authorization

Data & Genomics:

Genotypes — Genetic variants/code
Phenotypes — Observable characteristics (height, weight, BMI, clinical conditions, radiology)
DNA/RNA extraction — From blood samples
Proteomics — Protein-level analysis
Consanguinity — Marriage between cousins; increases frequency of recessive gene variants
Endogamy — High degree of intermarriage within population groups
Linkage studies — Identifying genes through family inheritance patterns
Gene variants/pathways — Disrupted biological pathways as drug targets
Rare diseases — Classified as rare in one population but affecting large absolute numbers in India (1.4B population)

Clinical & Regulatory:

EHR (Electronic Health Records) — Patient medical records; 80% unstructured in India
PHI (Protected Health Information) — Subject to DPDP and privacy regulations
Clinical trial registry — Mandatory protocol documentation (endpoints, objectives, design)
Equipoise — Ethical principle that control arm is justified only when uncertainty exists about treatment superiority
Real-world data (RWD) — Data from routine clinical care, not from controlled trials
Informed consent — Recommendation: <1 page, explained by medical professional
Anonymization — DPDP requirement; still legally/ethically ambiguous in practice
DPDP Act 2023 — India's data protection law; emphasizes consent but lacks implementation guidance
ABDM (Ayushman Bharat Digital Mission) — India's national health digital framework; low hospital compliance currently
Software as Medical Device (SaMD) — FDA-regulated AI tools; mostly approved as adjuncts to human decision-making, not autonomous

Standards & Infrastructure:

Docker/Kubernetes — Container orchestration standards enabling portability
Open Container Initiative (OCI) — Open standard for containers
OpenTelemetry — Open standard for observability that stabilized the market after DataDog/New Relic proprietary approaches
OAuth 2.1 — Authorization standard supported by MCP for fine-grained scopes
SPIFFE/SPIRE — Security protocols for service-to-service authentication (A2A protocol)
FDA Approval Page for SaMD — Reference for regulatory landscape

Datasets & Resources:

PubChem — Chemical compound database
USPTO — Patent database (chemistry/retrosynthesis)
NMR shift databases — Nuclear magnetic resonance data
Linux — Open-source OS enabling ecosystem diversity

Historical References:

Human Genome Project (2003) — Completed first annotated genome; did not transform medicine as expected
AlphaFold — AI protein folding (David Baker won Nobel Prize for related work)
Docker revolution (2010s) — Transformed containerization, spawned Kubernetes and cloud-native ecosystem
Tuskegee Syphilis Study / Nazi experiments — Cautionary examples of unethical human experimentation; justify strong ethical regulatory frameworks

Limitations & Open Questions

Data standardization roadmap unclear: While ABDM exists, no concrete timeline for hospital compliance or enforcement mechanisms
Incentive alignment unsolved: How to motivate individual hospitals to share data without regulatory mandate or financial benefit
Synthetic data validation: No consensus on how to validate synthetic patient data meets clinical standards
Scaling challenges: Unclear how federated learning can work across >100 Indian government medical colleges with varying IT infrastructure
Regulatory guidance gaps: DPDP Act lacks clarity on anonymization techniques, consent mechanisms, and cross-institutional data flows