UC Berkeley · MIDS Capstone · 2025

DrugPredict

Informed Drug Development

A machine learning framework for predicting how effective a drug will be across different patient populations — before clinical trials conclude. GLP-1 receptor agonists for Type 2 Diabetes serve as our proof of concept, demonstrating a generalizable approach to personalized drug efficacy prediction.

GLP-1
Drug Class
XGBoost
Core Model
SHAP
Interpretability
T2D
Target Disease

Drug development is expensive, slow, and unpredictable

Bringing a new drug to market costs over $2.5 billion on average and takes more than a decade. Despite advances in clinical research, the majority of drug candidates fail in late-stage trials — often due to insufficient understanding of which patient populations will respond to treatment.

For GLP-1 receptor agonists in Type 2 Diabetes, this challenge is especially acute. Heterogeneous patient profiles mean that efficacy varies widely, yet traditional trial designs lack the tools to predict responders early and precisely.

"If we can identify patients who will strongly respond to a GLP-1 agonist based on the drug's molecular structure, we can design smarter trials, reduce waste, and get effective treatments to patients faster."

$2.5B
Average cost to bring a new drug to market
10–15 yr
Typical drug development timeline from discovery to approval
>90%
Drug candidates that fail in clinical trials

A multi-source ML pipeline for efficacy prediction

We built an end-to-end pipeline combining chemical structure, pharmacokinetic properties, and clinical outcomes data to train and interpret an XGBoost model that predicts GLP-1 agonist response in Type 2 Diabetes patients.

01
Molecular Fingerprints
Molecular structure fingerprints encoding drug chemical structure as binary feature vectors for ML compatibility.
RDKit ECFP Morgan
02
ADMET Properties
Absorption, distribution, metabolism, excretion & toxicity features characterizing pharmacokinetic, physiological behavior.
ADMET-AI SwissADME
03
XGBoost Model
Gradient-boosted trees trained on combined feature sets, tuned with cross-validation for clinical predictive accuracy.
XGBoost scikit-learn Python
04
SHAP Analysis
SHapley Additive exPlanations reveal which molecular and ADMET features most influence efficacy predictions.
SHAP interpretability

Model performance & key findings

Our XGBoost model achieved strong predictive performance across held-out test sets, with SHAP analysis revealing interpretable, clinically meaningful feature drivers.

Model Metrics

AUC-ROC
0.830
R² (Regressor)
0.453

Regressor RMSE: ±0.488% ΔHbA1c · Trained on 231 GLP-1 trial arms (GroupShuffleSplit by trial ID)

Top SHAP Features — Efficacy Drivers
Age
top 1
Baseline HbA1c
top 2
Obesity flag
top 3
Weight (kg)
top 4
BMI
top 5
Diabetes duration
top 6

Patient demographics had the most influence, followed by molecular footprints

What we found, and where this goes

DrugPredict demonstrates that combining molecular cheminformatics with real clinical outcomes data can produce a meaningful, interpretable signal for drug efficacy prediction — even at the scale of publicly available trial data.

Key Takeaways

01

An AUC-ROC of 0.830 shows the classifier reliably separates likely responders from non-responders across held-out trial arms — a strong signal for a proof-of-concept trained on aggregate public data.

02

Patient demographics — particularly age and baseline HbA1c — dominate predictions, consistent with clinical literature. The model learned clinically plausible patterns without domain-specific supervision.

03

Demonstrated a proof of concept that GLP-1 efficacy is predictable from publicly available data alone. Previous work in this area has been limited to predicting the likelihood of a drug being approved in a clinical trial based on approval rates. We have built upon, improved, and differentiated our project from this previous approach - estimating the impact a drug could actually have on patients.

04

The pipeline generalizes to hypothetical molecules: novel drug candidates designed via RDKit reaction SMARTS can be scored end-to-end using the same inference function, with no retraining required.

Future Directions

Individual patient-level data

Replacing trial-arm averages with individual EHR records would unlock far richer heterogeneity modeling and substantially improve regression R².

Expanded drug classes

Applying the same pipeline to SGLT-2 inhibitors, DPP-4 inhibitors, or insulin analogs would test generalizability beyond GLP-1 agonists.

3D molecular representations

Incorporate the sequence of molecular structures — GLP1s are peptides, where the sequence of structures can have varying impacts. Our molecular bits captured local substructures but not the full order of the molecule which matters for GLP-1 effectiveness and could unlock utilizing transformer-based architectures to take into account the dependencies of sequential groups.

Prospective clinical validation

Partnering with a clinical site to prospectively score patients before randomization would provide ground-truth validation of the responder predictions.

The researchers

F
Farhan Quadri
Data Scientist · UC Berkeley MIDS

Data scientist with expertise in real-world evidence, longitudinal analysis, and ML for health data. Currently at Abbott analyzing FreeStyle Libre CGM data.

LinkedIn →
K
Kevin Coppa
Data Scientist · UC Berkeley MIDS

Data scientist with 8+ years of experience in healthcare and expertise in applying NLP. Currently at Northwell Health managing the Data Science team.

LinkedIn →