Applied ML Projects

End-to-end machine learning projects spanning tabular ML, computer vision feature engineering, and retrieval-augmented generation. Each sub-directory is a self-contained project with its own data, code, and documentation.

Projects

`tabular-ml-pipelines/`

Three end-to-end scikit-learn pipelines covering the canonical supervised learning tasks on real-world tabular data:

California Housing — regression on the 1990 census housing block dataset (20K rows). Custom feature engineering (distance-from-anchor city centers), log-scaling, KNN imputation, KNN regression with grid-searched hyperparameters. Test MAE ≈ $44,084.
Diabetes Health Indicators — binary classification on a 70K-row CDC BRFSS subset. Decision tree, KNN, logistic regression baselines + GridSearchCV. Test accuracy ≈ 0.74.
Dry Bean — 7-class classification on UCI dry bean morphological features. Same model family + GridSearchCV. Test accuracy ≈ 0.90.

Highlights: EDA, train/val/test splitting, ColumnTransformer preprocessing pipelines, custom transformers, model comparison, and joblib-serialized end-to-end pipelines.

`wafer-defect-classification/`

Multi-class classification of semiconductor wafer-map failure patterns (Center, Edge-Loc, Scratch, Donut, Near-full) on a subset of the WM-811K dataset (2,746 real-world wafers).

Hand-engineered 10 morphological features from the salient defect region using scikit-image: area ratio, perimeter ratio, max/min distance from center, major/minor axis ratios, solidity, eccentricity, yield loss, edge yield loss. Trained Decision Tree + RBF-kernel SVC (with StandardScaler).

Validation accuracy: 92.7% (Decision Tree), 96.0% (SVC). Both models pass the 80% test-set threshold.

`rag-pdf-assistant/`

Local retrieval-augmented generation app for PDF question-answering. Drop a PDF into the sidebar and ask questions; the app retrieves the most relevant page-level snippets and grounds the LLM's answer in them.

Stack:

UI — Streamlit
Embeddings — sentence-transformers/all-mpnet-base-v2 (768-d)
Vector store — in-memory dot-product retrieval
Generation — Mistral 7B via Ollama, with token streaming

Demonstrates a complete RAG pipeline: parse → chunk → embed → retrieve → augment prompt → stream response. Modular design with abstract base classes for the database, encoder, and LLM, so swapping in alternatives (e.g. an OpenAI encoder, a different local model) is straightforward.

Tech Stack

Layer	Tools
Core	Python 3.10, NumPy, pandas
Classical ML	scikit-learn, scikit-image
Deep / NLP	PyTorch, Hugging Face Transformers, sentence-transformers
RAG infra	Ollama (Mistral 7B), Streamlit, pdfminer.six
Tooling	Jupyter, joblib, matplotlib, seaborn, plotly

Running the Projects

Each project has its own README.md with project-specific setup, but the common steps are:

git clone https://github.com/Beeferky/applied-ml-projects.git
cd applied-ml-projects

# Recommended: one virtual env per project to keep dependencies isolated
python3 -m venv .venv
source .venv/bin/activate
pip install -r <project-name>/requirements.txt

For the tabular and wafer projects, opening the notebook in Jupyter / VS Code and running cells top-to-bottom is enough — data is included in each project's data/ directory.

For the RAG app, additional setup is needed (Ollama install + model pull). See its README.

Datasets

All datasets used in these projects are publicly available:

California Housing Prices — Kaggle (originally from Pace & Barry, 1997)
Diabetes Health Indicators (CDC BRFSS 2015) — Kaggle
Dry Bean Dataset — UCI ML Repository
WM-811K Wafer Map Dataset — MIR Lab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Applied ML Projects

Projects

`tabular-ml-pipelines/`

`wafer-defect-classification/`

`rag-pdf-assistant/`

Tech Stack

Running the Projects

Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
rag-pdf-assistant		rag-pdf-assistant
tabular-ml-pipelines		tabular-ml-pipelines
wafer-defect-classification		wafer-defect-classification
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Applied ML Projects

Projects

tabular-ml-pipelines/

wafer-defect-classification/

rag-pdf-assistant/

Tech Stack

Running the Projects

Datasets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`tabular-ml-pipelines/`

`wafer-defect-classification/`

`rag-pdf-assistant/`

Packages