This project focuses on predicting air ticket base fare prices using machine learning techniques. We analyze flight itinerary data and build regression models to estimate ticket prices based on flight features such as airports, airlines, travel duration, cabin type, and flight dates.
The goal is to compare multiple machine learning models and identify the best-performing model for price prediction.
- 242UT2449P SEE CHWAN KAI
- 242UT24490 TEO JING AN
- 242UT244B2 TEE KIAN HAO
- 242UT2449Z KHO WEI CONG
- Source: Kaggle Flight Prices Dataset
- Link: https://www.kaggle.com/datasets/justinmitchel/flightprices-min
- Records: 50,000 rows
- Target Variable: baseFare
- Python
- Pandas, NumPy
- Matplotlib, Seaborn
- Scikit-learn
Machine Learning Models:
- Linear Regression
- K-Nearest Neighbors (KNN)
- Random Forest Regressor
We performed data cleaning and feature engineering:
-
Removed missing values
-
Extracted flight date features:
- flightMonth
- flightDayOfWeek
-
Converted travel duration (ISO format → minutes)
-
Extracted fare class from fareBasisCode
-
Estimated number of stops
-
Encoded categorical variables using OneHotEncoder
-
Standardized numerical features using StandardScaler
-
Removed outliers (1st–99th percentile)
- R² Score: 0.6307
- MAE: 63.84
- RMSE: 88.88
- R² Score: 0.8341
- MAE: 20.11
- RMSE: 59.57
- R² Score: 0.8722
- MAE: 29.15
- RMSE: 52.30
- Random Forest captures non-linear relationships best
- Flight duration, airline, and cabin type strongly affect price
- KNN performs well but is sensitive to tuning
- Linear Regression underfits complex patterns
- Actual vs Predicted plots
- Feature importance (Random Forest)
git clone https://github.com/chwankai/Flight-Ticket-Fares-Prediction-using-Machine-Learning.git
cd air-ticket-price-predictionpip install pandas numpy matplotlib seaborn scikit-learnOpen:
Air_Ticket_Price_Prediction.ipynb