Skip to content
View chonzadaniel's full-sized avatar

Block or report chonzadaniel

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
chonzadaniel/README.md

Salary Classification Web Application (Flask + Machine Learning)

A machine learning web application that predicts whether a person's salary is greater than $50K or less/equal to $50K based on demographic, education, and employment attributes.

The model is trained using Scikit-learn pipelines and deployed locally using Flask, allowing users to input their information through a web interface and receive real-time predictions.


Project Overview

This project demonstrates an end-to-end machine learning workflow, including:

  • Data exploration and visualization
  • Feature preprocessing
  • Machine learning model training
  • Hyperparameter tuning
  • Model comparison
  • Pipeline serialization
  • Web deployment using Flask
  • Interactive user interface for predictions

The final deployed model is a tuned Random Forest Classifier integrated within a preprocessing pipeline.


Problem Statement

Predict whether an individual's annual salary exceeds $50K based on demographic and employment attributes.

This is a binary classification problem where the target variable is:

salary ∈ {<=50K, >50K}


Machine Learning Pipeline

The project implements a Scikit-learn pipeline to ensure consistent preprocessing during both training and prediction.

Pipeline Components

  1. Feature separation

    • Numerical features
    • Categorical features
  2. Numerical preprocessing

    • StandardScaler
  3. Categorical preprocessing

    • OneHotEncoder
  4. Feature transformation

    • ColumnTransformer
  5. Model training

    • RandomForestClassifier
  6. Pipeline serialization

    • Saved using joblib

Dataset Features

Feature Description
age Age of the individual
workclass Type of employment
fnlwgt Final sampling weight
education Highest education level
education_num Total years of education
marital_status Marital status
occupation Job occupation
relationship Family relationship
race Race category
sex Gender
capital_gain Income from investments
capital_loss Loss from investments
hours_per_week Working hours per week
native_country Country of origin

Target Variable

salary: <=50K or >50K


Model Comparison

Two machine learning models were evaluated:

Model Description
Logistic Regression Linear baseline classifier
Random Forest Ensemble tree-based classifier

Evaluation Metrics

  • Accuracy
  • Precision
  • Recall
  • F1 Score
  • ROC-AUC

Final Model

The tuned Random Forest model achieved the best performance and was selected for deployment.


Project Structure

Salary-Classification-Flask-App/ │ ├── training.py ├── salary_classification_app.py │ Flask application that loads the trained ML pipeline │ and serves predictions through a web interface. │ ├── model_artifacts/ │ Saved machine learning artifacts. │ │ │ ├── random_forest_tuned.pkl │ │ Final trained pipeline containing preprocessing + model. │ │ │ └── random_forest_tuned_pickle.pkl │ Alternate serialized model object. │ ├── templates/ │ HTML templates rendered by Flask using Jinja2. │ │ │ ├── index.html │ │ User interface form for entering input features. │ │ │ └── model_results.html │ Displays salary prediction results. │ ├── static/ │ Static assets used by the web interface. │ │ │ └── style.css │ CSS styling for the application layout. │ ├── notebooks/ │ Jupyter notebooks used during experimentation. │ │ │ └── salary_classification_pipeline.ipynb │ Data exploration, visualization, model training, │ and pipeline serialization. │ ├── requirements.txt │ Python dependencies required to run the project. │ └── README.md


Application Workflow

User Input (Web Form) │ ▼ Flask Server │ ▼ Input Converted to Pandas DataFrame │ ▼ Saved ML Pipeline (Preprocessing + Random Forest) │ ▼ Prediction │ ▼ Render Result Page


Installation

Clone the Repository

git clone https://github.com/chonzadaniel/salary-classification-flask-app.git


Install Dependencies

pip install -r requirements.txt


Running the Application

Start the Flask server:

python app.py

Open your browser and navigate to:

http://127.0.0.1:5000/

Enter the required information and click Predict Salary.


Example Prediction Output

Predicted Salary: >50K Probability: 82.47%


Technologies Used

Backend

  • Python
  • Flask

Machine Learning

  • Scikit-learn
  • RandomForestClassifier
  • LogisticRegression
  • Pipeline
  • ColumnTransformer

Data Processing

  • Pandas
  • NumPy

Visualization

  • Matplotlib
  • Seaborn

Frontend

  • HTML5
  • CSS3
  • Jinja2 Templates

Future Improvements

Possible enhancements include:

  • Deploying the application on AWS / Render / Heroku
  • Containerizing the application using Docker
  • Adding input validation
  • Implementing feature importance visualization
  • Integrating SHAP explainability
  • Creating a REST API endpoint

Author

Emmanuel Daniel Chonza

Data Scientist | Monitoring & Evaluation Expert | Generative AI Enthusiast

GitHub:
https://github.com/chonzadaniel


License

This project is licensed under the MIT License.

Popular repositories Loading

  1. ChatGPT-repository ChatGPT-repository Public

  2. ChatGPT ChatGPT Public

  3. notebook notebook Public

    Forked from jupyter/notebook

    Jupyter Interactive Notebook

    Jupyter Notebook

  4. MLproject MLproject Public

    Project Coding

    Jupyter Notebook

  5. khu-FinalProject khu-FinalProject Public

    Jupyter Notebook

  6. Credit-card-FraudDetection Credit-card-FraudDetection Public

    Submission of Project

    Jupyter Notebook