Skip to content

Beeferky/applied-ml-projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Applied ML Projects

End-to-end machine learning projects spanning tabular ML, computer vision feature engineering, and retrieval-augmented generation. Each sub-directory is a self-contained project with its own data, code, and documentation.

Projects

Three end-to-end scikit-learn pipelines covering the canonical supervised learning tasks on real-world tabular data:

  • California Housing — regression on the 1990 census housing block dataset (20K rows). Custom feature engineering (distance-from-anchor city centers), log-scaling, KNN imputation, KNN regression with grid-searched hyperparameters. Test MAE ≈ $44,084.
  • Diabetes Health Indicators — binary classification on a 70K-row CDC BRFSS subset. Decision tree, KNN, logistic regression baselines + GridSearchCV. Test accuracy ≈ 0.74.
  • Dry Bean — 7-class classification on UCI dry bean morphological features. Same model family + GridSearchCV. Test accuracy ≈ 0.90.

Highlights: EDA, train/val/test splitting, ColumnTransformer preprocessing pipelines, custom transformers, model comparison, and joblib-serialized end-to-end pipelines.

Multi-class classification of semiconductor wafer-map failure patterns (Center, Edge-Loc, Scratch, Donut, Near-full) on a subset of the WM-811K dataset (2,746 real-world wafers).

Hand-engineered 10 morphological features from the salient defect region using scikit-image: area ratio, perimeter ratio, max/min distance from center, major/minor axis ratios, solidity, eccentricity, yield loss, edge yield loss. Trained Decision Tree + RBF-kernel SVC (with StandardScaler).

Validation accuracy: 92.7% (Decision Tree), 96.0% (SVC). Both models pass the 80% test-set threshold.

Local retrieval-augmented generation app for PDF question-answering. Drop a PDF into the sidebar and ask questions; the app retrieves the most relevant page-level snippets and grounds the LLM's answer in them.

Stack:

  • UI — Streamlit
  • Embeddingssentence-transformers/all-mpnet-base-v2 (768-d)
  • Vector store — in-memory dot-product retrieval
  • Generation — Mistral 7B via Ollama, with token streaming

Demonstrates a complete RAG pipeline: parse → chunk → embed → retrieve → augment prompt → stream response. Modular design with abstract base classes for the database, encoder, and LLM, so swapping in alternatives (e.g. an OpenAI encoder, a different local model) is straightforward.

Tech Stack

Layer Tools
Core Python 3.10, NumPy, pandas
Classical ML scikit-learn, scikit-image
Deep / NLP PyTorch, Hugging Face Transformers, sentence-transformers
RAG infra Ollama (Mistral 7B), Streamlit, pdfminer.six
Tooling Jupyter, joblib, matplotlib, seaborn, plotly

Running the Projects

Each project has its own README.md with project-specific setup, but the common steps are:

git clone https://github.com/Beeferky/applied-ml-projects.git
cd applied-ml-projects

# Recommended: one virtual env per project to keep dependencies isolated
python3 -m venv .venv
source .venv/bin/activate
pip install -r <project-name>/requirements.txt

For the tabular and wafer projects, opening the notebook in Jupyter / VS Code and running cells top-to-bottom is enough — data is included in each project's data/ directory.

For the RAG app, additional setup is needed (Ollama install + model pull). See its README.

Datasets

All datasets used in these projects are publicly available:

  • California Housing PricesKaggle (originally from Pace & Barry, 1997)
  • Diabetes Health Indicators (CDC BRFSS 2015)Kaggle
  • Dry Bean DatasetUCI ML Repository
  • WM-811K Wafer Map DatasetMIR Lab

About

Three applied ML projects: tabular ML pipelines, semiconductor wafer defect classification, and a local RAG PDF assistant.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors