End-to-end machine learning projects spanning tabular ML, computer vision feature engineering, and retrieval-augmented generation. Each sub-directory is a self-contained project with its own data, code, and documentation.
Three end-to-end scikit-learn pipelines covering the canonical supervised learning tasks on real-world tabular data:
- California Housing — regression on the 1990 census housing block dataset (20K rows). Custom feature engineering (distance-from-anchor city centers), log-scaling, KNN imputation, KNN regression with grid-searched hyperparameters. Test MAE ≈ $44,084.
- Diabetes Health Indicators — binary classification on a 70K-row CDC BRFSS subset. Decision tree, KNN, logistic regression baselines + GridSearchCV. Test accuracy ≈ 0.74.
- Dry Bean — 7-class classification on UCI dry bean morphological features. Same model family + GridSearchCV. Test accuracy ≈ 0.90.
Highlights: EDA, train/val/test splitting, ColumnTransformer preprocessing pipelines, custom transformers, model comparison, and joblib-serialized end-to-end pipelines.
Multi-class classification of semiconductor wafer-map failure patterns (Center, Edge-Loc, Scratch, Donut, Near-full) on a subset of the WM-811K dataset (2,746 real-world wafers).
Hand-engineered 10 morphological features from the salient defect region using scikit-image: area ratio, perimeter ratio, max/min distance from center, major/minor axis ratios, solidity, eccentricity, yield loss, edge yield loss. Trained Decision Tree + RBF-kernel SVC (with StandardScaler).
Validation accuracy: 92.7% (Decision Tree), 96.0% (SVC). Both models pass the 80% test-set threshold.
Local retrieval-augmented generation app for PDF question-answering. Drop a PDF into the sidebar and ask questions; the app retrieves the most relevant page-level snippets and grounds the LLM's answer in them.
Stack:
- UI — Streamlit
- Embeddings —
sentence-transformers/all-mpnet-base-v2(768-d) - Vector store — in-memory dot-product retrieval
- Generation — Mistral 7B via Ollama, with token streaming
Demonstrates a complete RAG pipeline: parse → chunk → embed → retrieve → augment prompt → stream response. Modular design with abstract base classes for the database, encoder, and LLM, so swapping in alternatives (e.g. an OpenAI encoder, a different local model) is straightforward.
| Layer | Tools |
|---|---|
| Core | Python 3.10, NumPy, pandas |
| Classical ML | scikit-learn, scikit-image |
| Deep / NLP | PyTorch, Hugging Face Transformers, sentence-transformers |
| RAG infra | Ollama (Mistral 7B), Streamlit, pdfminer.six |
| Tooling | Jupyter, joblib, matplotlib, seaborn, plotly |
Each project has its own README.md with project-specific setup, but the common steps are:
git clone https://github.com/Beeferky/applied-ml-projects.git
cd applied-ml-projects
# Recommended: one virtual env per project to keep dependencies isolated
python3 -m venv .venv
source .venv/bin/activate
pip install -r <project-name>/requirements.txtFor the tabular and wafer projects, opening the notebook in Jupyter / VS Code and running cells top-to-bottom is enough — data is included in each project's data/ directory.
For the RAG app, additional setup is needed (Ollama install + model pull). See its README.
All datasets used in these projects are publicly available:
- California Housing Prices — Kaggle (originally from Pace & Barry, 1997)
- Diabetes Health Indicators (CDC BRFSS 2015) — Kaggle
- Dry Bean Dataset — UCI ML Repository
- WM-811K Wafer Map Dataset — MIR Lab