This project implements a disaster tweet classification pipeline based on TF-IDF features and Logistic Regression.The goal is to determine whether a tweet refers to a real disaster event.
Two versions were developed: an initial baseline pipeline and an improved version with enhanced preprocessing, validation strategy, and threshold optimization.
Natural Language Processing with Disaster Tweets
Kaggle NLP Getting Started Competition Dataset: https://www.kaggle.com/competitions/nlp-getting-started/data
The initial implementation focuses on building a simple and reproducible NLP classification workflow:
- Load the Kaggle disaster tweets dataset
- Extract the "text" field as input features and "target" as labels
- Apply basic text preprocessing:
- (1) Convert text to lowercase
- (2) Remove leading and trailing whitespace
- Convert text into TF-IDF feature vectors
- (1) Limit vocabulary size to control feature dimensionality
- Split data into training and validation sets using "train_test_split"
- Train a Logistic Regression classifier
- Evaluate performance using F1 score
This baseline establishes a straightforward classical NLP pipeline.
The improved version introduces several optimizations to enhance model robustness and evaluation reliability:
-
Improved text preprocessing
-
(1) Replace URLs with a placeholder token
URL -
(2) Replace user mentions with a placeholder token
USER -
(3) Remove hashtags symbol while keeping content
-
(4) Normalize whitespace using regular expressions
This helps reduce noise and improve generalization.
-
-
Better feature extraction
-
(1) Add minimum document frequency
min_df=2 -
(2) Control vocabulary size
max_features=20000This reduces rare noisy tokens.
-
-
More reliable validation
Instead of a single train–validation split:
-
(1) Use Stratified K-Fold cross-validation
-
(2) Preserve class distribution across folds
-
(3) Collect probability predictions across folds
This provides more stable evaluation.
-
-
Threshold optimization
Rather than using the default
0.5decision threshold:-
(1) Search thresholds from
0.1to0.9 -
(2) Select threshold maximizing F1 score
This improves classification performance, especially under class imbalance.
-
-
Final model training and prediction
- (1) Retrain Logistic Regression on the full dataset
- (2) Apply optimized threshold
- (3) Generate Kaggle submission file
| Aspect | Baseline | Improved Version |
|---|---|---|
| Preprocessing | Lowercase + strip | Regex cleaning + normalization |
| Validation | Single split | Stratified K-Fold |
| Features | TF-IDF | TF-IDF with frequency filtering |
| Decision threshold | Default 0.5 | Optimized via F1 |
| Robustness | Basic | Improved generalization |
Example evaluation metrics:
- Validation F1 score:
0.7618595825426945 - Optimal classification threshold:
0.44999999999999984
- Transformer-based models (e.g., BERT fine-tuning)
- Data augmentation techniques
- Error analysis and calibration