Informed Drug Development
A machine learning framework for predicting how effective a drug will be across different patient populations — before clinical trials conclude. GLP-1 receptor agonists for Type 2 Diabetes serve as our proof of concept, demonstrating a generalizable approach to personalized drug efficacy prediction.
Bringing a new drug to market costs over $2.5 billion on average and takes more than a decade. Despite advances in clinical research, the majority of drug candidates fail in late-stage trials — often due to insufficient understanding of which patient populations will respond to treatment.
For GLP-1 receptor agonists in Type 2 Diabetes, this challenge is especially acute. Heterogeneous patient profiles mean that efficacy varies widely, yet traditional trial designs lack the tools to predict responders early and precisely.
"If we can identify patients who will strongly respond to a GLP-1 agonist based on the drug's molecular structure, we can design smarter trials, reduce waste, and get effective treatments to patients faster."
We built an end-to-end pipeline combining chemical structure, pharmacokinetic properties, and clinical outcomes data to train and interpret an XGBoost model that predicts GLP-1 agonist response in Type 2 Diabetes patients.
Our XGBoost model achieved strong predictive performance across held-out test sets, with SHAP analysis revealing interpretable, clinically meaningful feature drivers.
Regressor RMSE: ±0.488% ΔHbA1c · Trained on 231 GLP-1 trial arms (GroupShuffleSplit by trial ID)
Patient demographics had the most influence, followed by molecular footprints
DrugPredict demonstrates that combining molecular cheminformatics with real clinical outcomes data can produce a meaningful, interpretable signal for drug efficacy prediction — even at the scale of publicly available trial data.
An AUC-ROC of 0.830 shows the classifier reliably separates likely responders from non-responders across held-out trial arms — a strong signal for a proof-of-concept trained on aggregate public data.
Patient demographics — particularly age and baseline HbA1c — dominate predictions, consistent with clinical literature. The model learned clinically plausible patterns without domain-specific supervision.
Demonstrated a proof of concept that GLP-1 efficacy is predictable from publicly available data alone. Previous work in this area has been limited to predicting the likelihood of a drug being approved in a clinical trial based on approval rates. We have built upon, improved, and differentiated our project from this previous approach - estimating the impact a drug could actually have on patients.
The pipeline generalizes to hypothetical molecules: novel drug candidates designed via RDKit reaction SMARTS can be scored end-to-end using the same inference function, with no retraining required.
Replacing trial-arm averages with individual EHR records would unlock far richer heterogeneity modeling and substantially improve regression R².
Applying the same pipeline to SGLT-2 inhibitors, DPP-4 inhibitors, or insulin analogs would test generalizability beyond GLP-1 agonists.
Incorporate the sequence of molecular structures — GLP1s are peptides, where the sequence of structures can have varying impacts. Our molecular bits captured local substructures but not the full order of the molecule which matters for GLP-1 effectiveness and could unlock utilizing transformer-based architectures to take into account the dependencies of sequential groups.
Partnering with a clinical site to prospectively score patients before randomization would provide ground-truth validation of the responder predictions.
Data scientist with expertise in real-world evidence, longitudinal analysis, and ML for health data. Currently at Abbott analyzing FreeStyle Libre CGM data.
LinkedIn →Data scientist with 8+ years of experience in healthcare and expertise in applying NLP. Currently at Northwell Health managing the Data Science team.
LinkedIn →