UC Berkeley · MIDS Capstone · 2025

DrugPredict

Informed Drug Development

A machine learning framework for predicting how effective a drug will be across different patient populations — before clinical trials conclude. GLP-1 receptor agonists for Type 2 Diabetes serve as our proof of concept, demonstrating a generalizable approach to personalized drug efficacy prediction.

Explore the Project → ▶ View Demo

GLP-1

Drug Class

XGBoost

Core Model

SHAP

Interpretability

T2D

Target Disease

01 — Problem

Drug development is expensive, slow, and unpredictable

Bringing a new drug to market costs over $2.5 billion on average and takes more than a decade. Despite advances in clinical research, the majority of drug candidates fail in late-stage trials — often due to insufficient understanding of which patient populations will respond to treatment.

For GLP-1 receptor agonists in Type 2 Diabetes, this challenge is especially acute. Heterogeneous patient profiles mean that efficacy varies widely, yet traditional trial designs lack the tools to predict responders early and precisely.

"If we can identify patients who will strongly respond to a GLP-1 agonist based on the drug's molecular structure, we can design smarter trials, reduce waste, and get effective treatments to patients faster."

$2.5B

Average cost to bring a new drug to market

10–15 yr

Typical drug development timeline from discovery to approval

>90%

Drug candidates that fail in clinical trials

02 — Approach

A multi-source ML pipeline for efficacy prediction

We built an end-to-end pipeline combining chemical structure, pharmacokinetic properties, and clinical outcomes data to train and interpret an XGBoost model that predicts GLP-1 agonist response in Type 2 Diabetes patients.

⬡

Molecular Fingerprints

Molecular structure fingerprints encoding drug chemical structure as binary feature vectors for ML compatibility.

RDKit ECFP Morgan

⚗

ADMET Properties

Absorption, distribution, metabolism, excretion & toxicity features characterizing pharmacokinetic, physiological behavior.

ADMET-AI SwissADME

◈

XGBoost Model

Gradient-boosted trees trained on combined feature sets, tuned with cross-validation for clinical predictive accuracy.

XGBoost scikit-learn Python

◉

SHAP Analysis

SHapley Additive exPlanations reveal which molecular and ADMET features most influence efficacy predictions.

SHAP interpretability

03 — Results

Model performance & key findings

Our XGBoost model achieved strong predictive performance across held-out test sets, with SHAP analysis revealing interpretable, clinically meaningful feature drivers.

Model Metrics

AUC-ROC

0.830

R² (Regressor)

0.453

Regressor RMSE: ±0.488% ΔHbA1c · Trained on 231 GLP-1 trial arms (GroupShuffleSplit by trial ID)

Top SHAP Features — Efficacy Drivers

Age

top 1

Baseline HbA1c

top 2

Obesity flag

top 3

Weight (kg)

top 4

BMI

top 5

Diabetes duration

top 6

Patient demographics had the most influence, followed by molecular footprints

04 — Conclusion

What we found, and where this goes

DrugPredict demonstrates that combining molecular cheminformatics with real clinical outcomes data can produce a meaningful, interpretable signal for drug efficacy prediction — even at the scale of publicly available trial data.

Key Takeaways

An AUC-ROC of 0.830 shows the classifier reliably separates likely responders from non-responders across held-out trial arms — a strong signal for a proof-of-concept trained on aggregate public data.

Patient demographics — particularly age and baseline HbA1c — dominate predictions, consistent with clinical literature. The model learned clinically plausible patterns without domain-specific supervision.

Demonstrated a proof of concept that GLP-1 efficacy is predictable from publicly available data alone. Previous work in this area has been limited to predicting the likelihood of a drug being approved in a clinical trial based on approval rates. We have built upon, improved, and differentiated our project from this previous approach - estimating the impact a drug could actually have on patients.

The pipeline generalizes to hypothetical molecules: novel drug candidates designed via RDKit reaction SMARTS can be scored end-to-end using the same inference function, with no retraining required.

Future Directions

Individual patient-level data

Replacing trial-arm averages with individual EHR records would unlock far richer heterogeneity modeling and substantially improve regression R².

Expanded drug classes

Applying the same pipeline to SGLT-2 inhibitors, DPP-4 inhibitors, or insulin analogs would test generalizability beyond GLP-1 agonists.

3D molecular representations

Incorporate the sequence of molecular structures — GLP1s are peptides, where the sequence of structures can have varying impacts. Our molecular bits captured local substructures but not the full order of the molecule which matters for GLP-1 effectiveness and could unlock utilizing transformer-based architectures to take into account the dependencies of sequential groups.

Prospective clinical validation

Partnering with a clinical site to prospectively score patients before randomization would provide ground-truth validation of the responder predictions.

05 — Team

The researchers

Farhan Quadri

Data Scientist · UC Berkeley MIDS

Data scientist with expertise in real-world evidence, longitudinal analysis, and ML for health data. Currently at Abbott analyzing FreeStyle Libre CGM data.

LinkedIn →

Kevin Coppa

Data Scientist · UC Berkeley MIDS

Data scientist with 8+ years of experience in healthcare and expertise in applying NLP. Currently at Northwell Health managing the Data Science team.

LinkedIn →

DrugPredict

Drug development is expensive, slow, and unpredictable

A multi-source ML pipeline for efficacy prediction

Model performance & key findings

Model Metrics

What we found, and where this goes

Key Takeaways

Future Directions

The researchers

Explore the work