A deep learning implementation of the Disaster Tweets classification task using a Transformer-based model (BERT).
This project is an upgraded version of my previous classical machine learning solution, moving from traditional feature engineering to pretrained language models.
The goal is to classify whether a tweet describes a real disaster event or not.
Compared with the earlier ML pipeline (TF-IDF + Logistic Regression / Embedding + MLP), this version:
- Uses pretrained Transformer encoder (BERT)
- Applies minimal but effective text cleaning
- Implements a full PyTorch training pipeline
- Supports evaluation metrics and model checkpoint saving
Encoder:
bert-base-uncased(HuggingFace Transformers)- Pretrained contextual embeddings
Classification Head:
- Dropout layer (regularization)
- Linear classification layer (binary classification)
Tweet Text
↓
Tokenizer
↓
Transformer Encoder
↓
Pooling
↓
Dropout
↓
Linear Layer
↓
Prediction
- Replace URLs →
URL - Replace user mentions →
USER - Normalize whitespace
This keeps semantic information while reducing noise.
Framework:
- PyTorch
- HuggingFace Transformers
Key configurations:
- Batch size:
- Train:
16 - Validation:
32
- Train:
- Max sequence length:
128 - Optimizer: AdamW
- Learning rate:
2e-5 - Scheduler: Linear warmup + decay
- Loss function: CrossEntropyLoss
- Epochs:
3
- Gradient clipping
- Stratified train/validation split
- Best model checkpoint saving (based on
F1 score)
The model is evaluated using:
- Accuracy
- F1 Score (primary metric)
- Validation loss
Example output format:
Epoch 1/3
Train loss: 0.4845
Val loss: 0.3791
Val acc: 0.8444
Val f1: 0.8121
Epoch 2/3
Train loss: 0.3344
Val loss: 0.3727
Val acc: 0.8536
Val f1: 0.8200
Epoch 3/3
Train loss: 0.2481
Val loss: 0.4566
Val acc: 0.8424
Val f1: 0.8148
The best model will be saved as: best_model.pt
data/
└── raw/
├── sample_submission.csv
├── train.csv
└── test.csv
src/
└── training scripts
best_model.pt
README.md
pip install torch transformers scikit-learn pandas numpyPlace dataset:
data/raw/train.csv
data/raw/test.csv
(Kaggle Disaster Tweets dataset)
python train.py
Through this project I:
- Implemented Transformer fine-tuning in PyTorch
- Learned tokenizer–encoder integration
- Understood training dynamics of pretrained LMs
- Practiced evaluation and checkpointing
- Compared classical ML vs deep learning NLP pipelines
Potential next steps:
- Hyperparameter tuning
- Larger pretrained models (RoBERTa, DeBERTa)
- Data augmentation
- LoRA / parameter-efficient fine-tuning
- Error analysis & explainability
Kaggle:
Natural Language Processing with Disaster Tweets
https://www.kaggle.com/competitions/nlp-getting-started/data