Skip to content

junclemente/heart-disease-prediction-multi-cohort

 
 

Repository files navigation

Machine Learning Beyond the Cleveland Dataset: Cross-Cohort Coronary Disease Prediction using Expanded Clinical Features

ADS503 - Applied Predictive Modeling

Team 1

Installation

To get started with this project, please clone the repository into your local machine using the commands below:

> git clone https://github.com/gw-00/ads503_project_g1.git  
> cd ads503_project_g1

Contributors

Methods

  • Pre-processing
  • Exploratory Data Analysis
  • Data visualization
  • Statistical Modeling and Machine Learning
    • Logistic Regression
    • Random Forest
    • PLS Discriminant Analysis
    • k-Nearest Neighbors
    • Penalized Regression
      • Lasso Penalization
      • Ridge Penalization
      • Elastic Net Model

Technologies

  • RStudio
  • Quarto
  • R
  • Generative AI
    • ChatGPT

Abstract

The goal of this study was to develop a predictive model for identifying coronary heart disease using patient data from four different medical centers around the globe. Leveraging a complete 76-feature heart disease data set from the UCI Machine Learning Repository, records from the Veterans Administration in Long Beach, the Hungarian Insititute of Cardiology, the University Hospital in Zurich, and the Cleveland Clinic underwent merging, pre-processing, and then underwent rigorous modeling. A comprehensive exploratory data analysis (EDA), data cleaning process, and imputation procedures were performed to handle extensive missing values and features with high correlations to avoid impacting model performance and minimizing the amount of bias and variance the models produce. Multiple classification models were developed to include Logistic Regression, Random Forest, Partial Least Squares Discriminant Analysis (PLS-DA), K-Nearest Neighbors (KNN), Penalized Logistic Regression (Lasso, Ridge, and ElasticNet).

Problem Statement

Previously, predictive model development for coronary heart disease has focused on simplified data sets of 14 features and typically have centered around performing the work on just the Cleveland subset of data. These previous approaches offer the benefit of accessibility and a complete data set for modeling purposes but omit 62 potential valuable predictor information from the entire data set.

Goal

Enhance predictive accuracy of coronary heart disease by employing a richer and detailed feature set, which will lead to improved performance metrics across the multiple classification machine learning algorithms developed

Non-goals

  1. Individual Health Tracking: Data collected will not involve personally identifiable health data.
  2. Medical or Clinical Recommendations: Medical treatments, vaccination protocols, or individual health intervention will not be prescribed or evaluated.

Data Sources

Acknowledgements

Portions of this codebase and documentation were developed with assistance from Generative AI, ChatGPT (OpenAI), June 2025.

References

Presentations and Projects

  1. Project Presentation:
  2. Project Slides:
  3. Document Link:
  4. Project Repo: https://github.com/gw-00/ads503_project_g1

About

Comparative modeling of coronary heart disease using expanded clinical features across multiple international patient cohorts.

Resources

Stars

Watchers

Forks

Contributors