Emmanuel Daniel Chonza chonzadaniel

Salary Classification Web Application (Flask + Machine Learning)

A machine learning web application that predicts whether a person's salary is greater than $50K or less/equal to $50K based on demographic, education, and employment attributes.

The model is trained using Scikit-learn pipelines and deployed locally using Flask, allowing users to input their information through a web interface and receive real-time predictions.

Project Overview

This project demonstrates an end-to-end machine learning workflow, including:

Data exploration and visualization
Feature preprocessing
Machine learning model training
Hyperparameter tuning
Model comparison
Pipeline serialization
Web deployment using Flask
Interactive user interface for predictions

The final deployed model is a tuned Random Forest Classifier integrated within a preprocessing pipeline.

Problem Statement

Predict whether an individual's annual salary exceeds $50K based on demographic and employment attributes.

This is a binary classification problem where the target variable is:

salary ∈ {<=50K, >50K}

Machine Learning Pipeline

The project implements a Scikit-learn pipeline to ensure consistent preprocessing during both training and prediction.

Pipeline Components

Feature separation
- Numerical features
- Categorical features
Numerical preprocessing
- StandardScaler
Categorical preprocessing
- OneHotEncoder
Feature transformation
- ColumnTransformer
Model training
- RandomForestClassifier
Pipeline serialization
- Saved using joblib

Dataset Features

Feature	Description
age	Age of the individual
workclass	Type of employment
fnlwgt	Final sampling weight
education	Highest education level
education_num	Total years of education
marital_status	Marital status
occupation	Job occupation
relationship	Family relationship
race	Race category
sex	Gender
capital_gain	Income from investments
capital_loss	Loss from investments
hours_per_week	Working hours per week
native_country	Country of origin

Target Variable

salary: <=50K or >50K

Model Comparison

Two machine learning models were evaluated:

Model	Description
Logistic Regression	Linear baseline classifier
Random Forest	Ensemble tree-based classifier

Evaluation Metrics

Accuracy
Precision
Recall
F1 Score
ROC-AUC

Final Model

The tuned Random Forest model achieved the best performance and was selected for deployment.

Project Structure

Salary-Classification-Flask-App/ │ ├── training.py ├── salary_classification_app.py │ Flask application that loads the trained ML pipeline │ and serves predictions through a web interface. │ ├── model_artifacts/ │ Saved machine learning artifacts. │ │ │ ├── random_forest_tuned.pkl │ │ Final trained pipeline containing preprocessing + model. │ │ │ └── random_forest_tuned_pickle.pkl │ Alternate serialized model object. │ ├── templates/ │ HTML templates rendered by Flask using Jinja2. │ │ │ ├── index.html │ │ User interface form for entering input features. │ │ │ └── model_results.html │ Displays salary prediction results. │ ├── static/ │ Static assets used by the web interface. │ │ │ └── style.css │ CSS styling for the application layout. │ ├── notebooks/ │ Jupyter notebooks used during experimentation. │ │ │ └── salary_classification_pipeline.ipynb │ Data exploration, visualization, model training, │ and pipeline serialization. │ ├── requirements.txt │ Python dependencies required to run the project. │ └── README.md

Application Workflow

User Input (Web Form) │ ▼ Flask Server │ ▼ Input Converted to Pandas DataFrame │ ▼ Saved ML Pipeline (Preprocessing + Random Forest) │ ▼ Prediction │ ▼ Render Result Page

Installation

Clone the Repository

git clone https://github.com/chonzadaniel/salary-classification-flask-app.git

Install Dependencies

pip install -r requirements.txt

Running the Application

Start the Flask server:

python app.py

Open your browser and navigate to:

http://127.0.0.1:5000/

Enter the required information and click Predict Salary.

Example Prediction Output

Predicted Salary: >50K Probability: 82.47%

Technologies Used

Backend

Python
Flask

Machine Learning

Scikit-learn
RandomForestClassifier
LogisticRegression
Pipeline
ColumnTransformer

Data Processing

Pandas
NumPy

Visualization

Matplotlib
Seaborn

Frontend

HTML5
CSS3
Jinja2 Templates

Future Improvements

Possible enhancements include:

Deploying the application on AWS / Render / Heroku
Containerizing the application using Docker
Adding input validation
Implementing feature importance visualization
Integrating SHAP explainability
Creating a REST API endpoint

Author

Emmanuel Daniel Chonza

Data Scientist | Monitoring & Evaluation Expert | Generative AI Enthusiast

GitHub:
https://github.com/chonzadaniel

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly