A machine learning web application that predicts whether a person's salary is greater than $50K or less/equal to $50K based on demographic, education, and employment attributes.
The model is trained using Scikit-learn pipelines and deployed locally using Flask, allowing users to input their information through a web interface and receive real-time predictions.
This project demonstrates an end-to-end machine learning workflow, including:
- Data exploration and visualization
- Feature preprocessing
- Machine learning model training
- Hyperparameter tuning
- Model comparison
- Pipeline serialization
- Web deployment using Flask
- Interactive user interface for predictions
The final deployed model is a tuned Random Forest Classifier integrated within a preprocessing pipeline.
Predict whether an individual's annual salary exceeds $50K based on demographic and employment attributes.
This is a binary classification problem where the target variable is:
salary ∈ {<=50K, >50K}
The project implements a Scikit-learn pipeline to ensure consistent preprocessing during both training and prediction.
-
Feature separation
- Numerical features
- Categorical features
-
Numerical preprocessing
StandardScaler
-
Categorical preprocessing
OneHotEncoder
-
Feature transformation
ColumnTransformer
-
Model training
RandomForestClassifier
-
Pipeline serialization
- Saved using
joblib
- Saved using
| Feature | Description |
|---|---|
| age | Age of the individual |
| workclass | Type of employment |
| fnlwgt | Final sampling weight |
| education | Highest education level |
| education_num | Total years of education |
| marital_status | Marital status |
| occupation | Job occupation |
| relationship | Family relationship |
| race | Race category |
| sex | Gender |
| capital_gain | Income from investments |
| capital_loss | Loss from investments |
| hours_per_week | Working hours per week |
| native_country | Country of origin |
salary: <=50K or >50K
Two machine learning models were evaluated:
| Model | Description |
|---|---|
| Logistic Regression | Linear baseline classifier |
| Random Forest | Ensemble tree-based classifier |
- Accuracy
- Precision
- Recall
- F1 Score
- ROC-AUC
The tuned Random Forest model achieved the best performance and was selected for deployment.
Salary-Classification-Flask-App/ │ ├── training.py ├── salary_classification_app.py │ Flask application that loads the trained ML pipeline │ and serves predictions through a web interface. │ ├── model_artifacts/ │ Saved machine learning artifacts. │ │ │ ├── random_forest_tuned.pkl │ │ Final trained pipeline containing preprocessing + model. │ │ │ └── random_forest_tuned_pickle.pkl │ Alternate serialized model object. │ ├── templates/ │ HTML templates rendered by Flask using Jinja2. │ │ │ ├── index.html │ │ User interface form for entering input features. │ │ │ └── model_results.html │ Displays salary prediction results. │ ├── static/ │ Static assets used by the web interface. │ │ │ └── style.css │ CSS styling for the application layout. │ ├── notebooks/ │ Jupyter notebooks used during experimentation. │ │ │ └── salary_classification_pipeline.ipynb │ Data exploration, visualization, model training, │ and pipeline serialization. │ ├── requirements.txt │ Python dependencies required to run the project. │ └── README.md
User Input (Web Form) │ ▼ Flask Server │ ▼ Input Converted to Pandas DataFrame │ ▼ Saved ML Pipeline (Preprocessing + Random Forest) │ ▼ Prediction │ ▼ Render Result Page
git clone https://github.com/chonzadaniel/salary-classification-flask-app.git
pip install -r requirements.txt
Start the Flask server:
python app.py
Open your browser and navigate to:
Enter the required information and click Predict Salary.
Predicted Salary: >50K Probability: 82.47%
- Python
- Flask
- Scikit-learn
- RandomForestClassifier
- LogisticRegression
- Pipeline
- ColumnTransformer
- Pandas
- NumPy
- Matplotlib
- Seaborn
- HTML5
- CSS3
- Jinja2 Templates
Possible enhancements include:
- Deploying the application on AWS / Render / Heroku
- Containerizing the application using Docker
- Adding input validation
- Implementing feature importance visualization
- Integrating SHAP explainability
- Creating a REST API endpoint
Emmanuel Daniel Chonza
Data Scientist | Monitoring & Evaluation Expert | Generative AI Enthusiast
GitHub:
https://github.com/chonzadaniel
This project is licensed under the MIT License.
