Skip to content

WEN-YF960/Disaster-Tweet-Classification-Based-on-Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Disaster Tweet Classification (NLP)

📝 Overview

This project implements a disaster tweet classification pipeline based on TF-IDF features and Logistic Regression.The goal is to determine whether a tweet refers to a real disaster event.

Two versions were developed: an initial baseline pipeline and an improved version with enhanced preprocessing, validation strategy, and threshold optimization.


📍 Project Topic

Natural Language Processing with Disaster Tweets

📈 Data Source

Kaggle NLP Getting Started Competition Dataset: https://www.kaggle.com/competitions/nlp-getting-started/data


🧪 Methodology

1. Baseline Pipeline (Initial Version)

The initial implementation focuses on building a simple and reproducible NLP classification workflow:

  • Load the Kaggle disaster tweets dataset
  • Extract the "text" field as input features and "target" as labels
  • Apply basic text preprocessing:
    • (1) Convert text to lowercase
    • (2) Remove leading and trailing whitespace
  • Convert text into TF-IDF feature vectors
    • (1) Limit vocabulary size to control feature dimensionality
  • Split data into training and validation sets using "train_test_split"
  • Train a Logistic Regression classifier
  • Evaluate performance using F1 score

This baseline establishes a straightforward classical NLP pipeline.


2. Improved Pipeline (Enhanced Version)

The improved version introduces several optimizations to enhance model robustness and evaluation reliability:

  • Improved text preprocessing

    • (1) Replace URLs with a placeholder token URL

    • (2) Replace user mentions with a placeholder token USER

    • (3) Remove hashtags symbol while keeping content

    • (4) Normalize whitespace using regular expressions

      This helps reduce noise and improve generalization.

  • Better feature extraction

    • (1) Add minimum document frequency min_df=2

    • (2) Control vocabulary size max_features=20000

      This reduces rare noisy tokens.

  • More reliable validation

    Instead of a single train–validation split:

    • (1) Use Stratified K-Fold cross-validation

    • (2) Preserve class distribution across folds

    • (3) Collect probability predictions across folds

      This provides more stable evaluation.

  • Threshold optimization

    Rather than using the default 0.5 decision threshold:

    • (1) Search thresholds from 0.1 to 0.9

    • (2) Select threshold maximizing F1 score

      This improves classification performance, especially under class imbalance.

  • Final model training and prediction

    • (1) Retrain Logistic Regression on the full dataset
    • (2) Apply optimized threshold
    • (3) Generate Kaggle submission file

3. Summary of Improvements

Aspect Baseline Improved Version
Preprocessing Lowercase + strip Regex cleaning + normalization
Validation Single split Stratified K-Fold
Features TF-IDF TF-IDF with frequency filtering
Decision threshold Default 0.5 Optimized via F1
Robustness Basic Improved generalization

4. Results

Example evaluation metrics:

  • Validation F1 score: 0.7618595825426945
  • Optimal classification threshold: 0.44999999999999984

🚀 Future improvements include:

  • Transformer-based models (e.g., BERT fine-tuning)
  • Data augmentation techniques
  • Error analysis and calibration

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages