Disaster Tweet Classification (NLP)

📝 Overview

This project implements a disaster tweet classification pipeline based on TF-IDF features and Logistic Regression.The goal is to determine whether a tweet refers to a real disaster event.

Two versions were developed: an initial baseline pipeline and an improved version with enhanced preprocessing, validation strategy, and threshold optimization.

📍 Project Topic

Natural Language Processing with Disaster Tweets

📈 Data Source

Kaggle NLP Getting Started Competition Dataset: https://www.kaggle.com/competitions/nlp-getting-started/data

🧪 Methodology

1. Baseline Pipeline (Initial Version)

The initial implementation focuses on building a simple and reproducible NLP classification workflow:

Load the Kaggle disaster tweets dataset
Extract the "text" field as input features and "target" as labels
Apply basic text preprocessing:
- (1) Convert text to lowercase
- (2) Remove leading and trailing whitespace
Convert text into TF-IDF feature vectors
- (1) Limit vocabulary size to control feature dimensionality
Split data into training and validation sets using "train_test_split"
Train a Logistic Regression classifier
Evaluate performance using F1 score

This baseline establishes a straightforward classical NLP pipeline.

2. Improved Pipeline (Enhanced Version)

The improved version introduces several optimizations to enhance model robustness and evaluation reliability:

Improved text preprocessing
- (1) Replace URLs with a placeholder token URL
- (2) Replace user mentions with a placeholder token USER
- (3) Remove hashtags symbol while keeping content
- (4) Normalize whitespace using regular expressions
  
  This helps reduce noise and improve generalization.
Better feature extraction
- (1) Add minimum document frequency min_df=2
- (2) Control vocabulary size max_features=20000
  
  This reduces rare noisy tokens.
More reliable validation

Instead of a single train–validation split:
- (1) Use Stratified K-Fold cross-validation
- (2) Preserve class distribution across folds
- (3) Collect probability predictions across folds
  
  This provides more stable evaluation.
Threshold optimization

Rather than using the default 0.5 decision threshold:
- (1) Search thresholds from 0.1 to 0.9
- (2) Select threshold maximizing F1 score
  
  This improves classification performance, especially under class imbalance.
Final model training and prediction
- (1) Retrain Logistic Regression on the full dataset
- (2) Apply optimized threshold
- (3) Generate Kaggle submission file

3. Summary of Improvements

Aspect	Baseline	Improved Version
Preprocessing	Lowercase + strip	Regex cleaning + normalization
Validation	Single split	Stratified K-Fold
Features	TF-IDF	TF-IDF with frequency filtering
Decision threshold	Default 0.5	Optimized via F1
Robustness	Basic	Improved generalization

4. Results

Example evaluation metrics:

Validation F1 score: 0.7618595825426945
Optimal classification threshold: 0.44999999999999984

🚀 Future improvements include:

Transformer-based models (e.g., BERT fine-tuning)
Data augmentation techniques
Error analysis and calibration

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data/raw		data/raw
outputs		outputs
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disaster Tweet Classification (NLP)

📝 Overview

📍 Project Topic

📈 Data Source

🧪 Methodology

1. Baseline Pipeline (Initial Version)

2. Improved Pipeline (Enhanced Version)

3. Summary of Improvements

4. Results

🚀 Future improvements include:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Disaster Tweet Classification (NLP)

📝 Overview

📍 Project Topic

📈 Data Source

🧪 Methodology

1. Baseline Pipeline (Initial Version)

2. Improved Pipeline (Enhanced Version)

3. Summary of Improvements

4. Results

🚀 Future improvements include:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages