Natural Language Processing with Disaster Tweets
This project is licensed under The GNU GPL v3
Please have a look at the version history of each notebook.
Statistical models:
Deep learning models:
- RNN
- RNN with Attention
- CNN
- Multi-channel CNN with RNN - unidirectional & bidirectional
- Multi-channel CNN with RNN (concat) - unidirectional & bidirectional
- LLMs
View training & testing script's help with this command:
python <script>.py --help
Note: use those scripts at your own risk, since I don't normally re-train models on my personal PC.
Different text preprocessing methods used in my implementations, but most methods following these steps
- Removing emojis
- Removing html
- Removing URLs
- Removing punctuations
- Lowercase and remove multiple spaces.
However there are some exceptions where a specific preprocessing method of the pretrained model is applied:
- BERTweet using TweetTokenizer to mask and replace some tokens
- Twitter RoBERTa Sentiment requires masking username and url as specific tokens.
- Training data: Using full training set.
- Hyperparameters: using
sklearn.model_selection.GridSearchCVto automatically pick best combinations.
Click to view
| Hyperparameter | Value |
|---|---|
| Train:test | 8:2 |
| Batch size (train/test) | 64/32 |
| Learning rate | 1e-4 |
| Embedding dim | 64 |
| Epochs | 10 |
| Vocab size | 10000 |
Click to view
| Hyperparameter | Value |
|---|---|
| Train:test | 8:2 |
| Batch size (Train/test) | 64/32 |
| Filter size | 100 |
| Window size | [3, 4, 5] |
| L2 regularization | 3 |
| Dropout rate | 0.5 |
| Dense unit | 64 |
| Learning rate | 1e-4 |
| Epochs | 100 |
| Vocab size | 10000 |
| Early stopping | 20 epochs |
| Classification threshold | 0.5 |
Click to view
CNN & RNN feed model:
| Hyperparameter | Value |
|---|---|
| Train:test | 8:2 |
| Batch size (Train/test) | 64/32 |
| Recurrent units | 512 |
| Filter size | 200 |
| Window size | [1, 2, 3] |
| Dropout rate | 0.5 |
| Dense unit | 64 |
| Learning rate | 1e-4 |
| Epochs | 100 |
| Vocab size | 10000 |
| Early stopping | 20 epochs |
| Classification threshold | 0.5 |
CNN & BiRNN feed model:
| Hyperparameter | Value |
|---|---|
| Train:test | 8:2 |
| Batch size (Train/test) | 64/32 |
| Recurrent units | 512 |
| Filter size | 200 |
| Window size | [1, 2, 3] |
| Dropout rate | 0.5 |
| Dense unit | 64 |
| Learning rate | 1e-4 |
| Epochs | 100 |
| Vocab size | 10000 |
| Early stopping | 10 epochs |
| Classification threshold | 0.5 |
CNN & RNN concat model:
| Hyperparameter | Value |
|---|---|
| Train:test | 8:2 |
| Batch size (Train/test) | 64/32 |
| Recurrent units | 512 |
| Filter size | 200 |
| Window size | [1, 2, 3] |
| Dropout rate | 0.5 |
| Dense unit | 64 |
| Learning rate | 1e-4 |
| Epochs | 100 |
| Vocab size | 10000 |
| Early stopping | 5 epochs |
| Classification threshold | 0.5 |
CNN & BiRNN concat model:
| Hyperparameter | Value |
|---|---|
| Train:test | 8:2 |
| Batch size (Train/test) | 64/32 |
| Recurrent units | 512 |
| Filter size | 200 |
| Window size | [1, 2, 3] |
| Dropout rate | 0.5 |
| Dense unit | 64 |
| Learning rate | 1e-4 |
| Epochs | 100 |
| Vocab size | 10000 |
| Early stopping | 10 epochs |
| Classification threshold | 0.5 |
Click to view
| Hyperparameter | Value |
|---|---|
| Train:dev:test ratio | 6:2:2 |
| Batch size | 64 |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
| Epochs | 50 |
| Early stopping | 5 epochs |
Click to view
Some large LLMs cannot be trained with hyperparameters in the LLMs section. In order to fit those models to Kaggle GPU's RAM, I reduced the batch size and learning rate to following values:
| Hyperparameter | Value |
|---|---|
| Train:dev:test ratio | 6:2:2 |
| Batch size | 32 |
| Learning rate | 1e-5 |
| Weight decay | 0.01 |
| Epochs | 50 |
| Early stopping | 5 epochs |
All remaining hyperparametes stay the same as LLMs.
Experiment setup: All experiments were conducted under the same Kaggle environment:
| Configuration | Value |
|---|---|
| CPU | Intel Xeon 2.20 GHz CPU, 4vCPU cores |
| Memory | 32 GB |
| GPU | NVIDIA Tesla T4 (x2) (LLMs) or P100 (RNNs, CNNs) |
| Random seed | 42 |
Click to view
| Model | Vectorizer | Training configurations | Public F1 | |
|---|---|---|---|---|
| KMean | TFIDF | [1] | 0.50658 | |
| Linear Models | Logistic Regression | TFIDF | [1] | 0.80171 |
| Stochastic Gradient Descent | TFIDF | [1] | 0.80386 | |
| Support Vector Machine | TFIDF | [1] | 0.80140 | |
| Random Forest | TFIDF | [1] | 0.78792 | |
| AdaBoost | Decision Tree | TFIDF | [1] | 0.72847 |
| Bagging | Decision Tree | TFIDF | [1] | 0.74348 |
| Decision Tree | TFIDF | [1] | 0.71069 | |
| Gradient Boosting | TFIDF | [1] | 0.73889 | |
| XGBoost | TFIDF | [1] | 0.74992 | |
| Naive Bayes | Multinomial Naive Bayes | TFIDF | [1] | 0.80447 |
| Complement Naive Bayes | TFIDF | [1] | 0.79589 | |
| Multilayer Perceptrons | TFIDF | [1] | 0.75911 | |
Click to view
| Model (with paper link) | Pretrain parameters | Training configurations | Public F1 | Notes | |
|---|---|---|---|---|---|
| RNN | 1-layer Bidirectional LSTM | 714,369 | [3] | 0.77352 | |
| 2-layers stacked Bidirectional LSTM | 751,489 | [3] | 0.78026 | ||
| 1-layer Bidirectional GRU | 698,241 | [3] | 0.77536 | ||
| 2-layers stacked Bidirectional GRU | 725,249 | [3] | 0.77566 | ||
| RNN + Attention | 1-layer Bidirectional LSTM + Dot Attention | 714,369 | [3] | 0.76892 | |
| 1-layer Bidirectional GRU + Dot Attention | 698,241 | [3] | 0.78516 | ||
| 1-layer Bidirectional LSTM + General Attention | 730,881 | [3] | 0.77995 | ||
| 1-layer Bidirectional GRU + General Attention | 714,753 | [3] | 0.77719 | ||
| 1-layer Bidirectional LSTM + Concatenate Attention | 730,946 | [3] | 0.78148 | ||
| 1-layer Bidirectional GRU + Concatenate Attention | 714,818 | [3] | 0.77873 | ||
| Deep CNN (random + pretrained embedding) | CNN non-static (random embedding) | 299,629 | [3] | 0.71345 | Embedding dimension = 25 (equals to GloVe vector size) |
| CNN static (glove-twitter-25) | 299,629 | [3] | 0.77689 | ||
| CNN static (glove-twitter-50) | 579,629 | [3] | 0.78700 | ||
| CNN static (glove-twitter-100) | 1,139,629 | [3] | 0.79374 | ||
| CNN static (glove-twitter-200) | 2,259,629 | [3] | 0.79711 | ||
| CNN static (fasttext-wiki-news-subwords-300) | 3,379,629 | [3] | 0.57033 | ||
| CNN non-static (glove-twitter-25) | 299,629 | [3] | 0.80478 | ||
| CNN non-static (glove-twitter-50) | 579,629 | [3] | 0.79619 | ||
| CNN non-static (glove-twitter-100) | 1,139,629 | [3] | 0.79987 | ||
| CNN non-static (glove-twitter-200) | 2,259,629 | [3] | 0.80140 | ||
| CNN non-static (fasttext-wiki-news-subwords-300) | 3,379,629 | [3] | 0.73980 | ||
| Multi-channel CNN and RNN | Random embedding (static) + Unidirectional LSTM | 3,326,169 | [3] | 0.67391 | |
| Random embedding (static) + Bidirectional LSTM | 4,411,609 | [3] | 0.68709 | ||
| Random embedding (static) + Unidirectional GRU | (todo) | [3] | (todo) | ||
| Random embedding (static) + Bidirectional GRU | (todo) | [3] | (todo) | ||
| GloVe (glove-twitter-25, static) + Unidirectional LSTM | 1,366,169 | [3] | 0.68372 | ||
| GloVe (glove-twitter-25, static) + Bidirectional LSTM | 2,451,609 | [3] | 0.78976 | ||
| GloVe (glove-twitter-50, static) + Unidirectional LSTM | 1,646,169 | [3] | 0.77781 | ||
| GloVe (glove-twitter-50, static) + Bidirectional LSTM | 2,731,609 | [3] | 0.78148 | ||
| GloVe (glove-twitter-100, static) + Unidirectional LSTM | 2,206,169 | [3] | 0.73460 | ||
| GloVe (glove-twitter-100, static) + Bidirectional LSTM | 3,291,609 | [3] | 0.78700 | ||
| GloVe (glove-twitter-200, static) + Unidirectional LSTM | 3,326,169 | [3] | 0.71835 | ||
| GloVe (glove-twitter-200, static) + Bidirectional LSTM | 4,411,609 | [3] | 0.76310 | ||
| Random embedding (nonstatic) + Unidirectional LSTM | 3,326,169 | [3] | 0.71284 | ||
| Random embedding (nonstatic) + Bidirectional LSTM | 4,411,609 | [3] | 0.75390 | ||
| Random embedding (nonstatic) + Unidirectional GRU | (todo) | [3] | (todo) | ||
| Random embedding (nonstatic) + Bidirectional GRU | (todo) | [3] | (todo) | ||
| GloVe (glove-twitter-25, nonstatic) + Unidirectional LSTM | 1,366,169 | [3] | 0.75942 | ||
| Glove (glove-twitter-25, nonstatic) + Bidirectional LSTM | 2,451,609 | [3] | 0.79436 | ||
| GloVe (glove-twitter-50, nonstatic) + Unidirectional LSTM | 1,646,169 | [3] | 0.78240 | ||
| GloVe (glove-twitter-50, nonstatic) + Bidirectional LSTM | 2,731,609 | [3] | 0.79957 | ||
| GloVe (glove-twitter-100, nonstatic) + Unidirectional LSTM | 2,206,169 | [3] | 0.78700 | ||
| GloVe (glove-twitter-100, nonstatic) + Bidirectional LSTM | 3,291,609 | [3] | 0.76064 | ||
| GloVe (glove-twitter-200, nonstatic) + Unidirectional LSTM | 3,326,169 | [3] | 0.78179 | ||
| GloVe (glove-twitter-200, nonstatic) + Bidirectional LSTM | 4,411,609 | [3] | 0.77474 | ||
| Multi-channel CNN and RNN (concat) | Random embedding (static) + Unidirectional LSTM | 3,772,121 | [3] | 0.78394 | Embedding dimension = 200 |
| Random embedding (static) + Bidirectional LSTM | 5,265,113 | [3] | 0.78700 | ||
| Random embedding (static) + Unidirectional GRU | 3,408,601 | [3] | 0.78302 | ||
| Random embedding (static) + Bidirectional GRU | 4,538,073 | [3] | 0.77627 | ||
| GloVe (glove-twitter-25, static) + Unidirectional LSTM | 1,453,721 | [3] | 0.80110 | ||
| GloVe (glove-twitter-25, static) + Bidirectional LSTM | 2,588,313 | [3] | 0.79436 | ||
| GloVe (glove-twitter-25, static) + Unidirectional GRU | 1,179,801 | [3] | 0.80294 | ||
| GloVe (glove-twitter-25, static) + Bidirectional GRU | 2,040,473 | [3] | 0.79528 | ||
| GloVe (glove-twitter-50, static) + Unidirectional LSTM | 1,784,921 | [3] | 0.81091 | ||
| GloVe (glove-twitter-50, static) + Bidirectional LSTM | 2,970,713 | [3] | 0.81366 | ||
| GloVe (glove-twitter-50, static) + Unidirectional GRU | 1,498,201 | [3] | 0.80907 | ||
| GloVe (glove-twitter-50, static) + Bidirectional GRU | 2,397,273 | [3] | 0.80937 | ||
| GloVe (glove-twitter-100, static) + Unidirectional LSTM | 2,447,321 | [3] | 0.80539 | ||
| GloVe (glove-twitter-100, static) + Bidirectional LSTM | 3,735,513 | [3] | 0.81305 | ||
| GloVe (glove-twitter-100, static) + Unidirectional GRU | (todo) | [3] | (todo) | ||
| GloVe (glove-twitter-100, static) + Bidirectional GRU | 3,110,873 | [3] | 0.80907 | ||
| GloVe (glove-twitter-200, static) + Unidirectional LSTM | 3,772,121 | [3] | 0.80723 | ||
| GloVe (glove-twitter-200, static) + Bidirectional LSTM | 5,265,113 | [3] | 0.81152 | ||
| GloVe (glove-twitter-200, static) + Unidirectional GRU | 3,408,601 | [3] | 3,408,601 | ||
| GloVe (glove-twitter-200, static) + Bidirectional GRU | 4,538,073 | [3] | 0.80815 | ||
| Random embedding (nonstatic) + Unidirectional LSTM | 3,772,121 | [3] | 0.74164 | ||
| Random embedding (nonstatic) + Bidirectional LSTM | 5,265,113 | [3] | 0.77444 | ||
| Random embedding (nonstatic) + Unidirectional GRU | 3,408,601 | [3] | 0.80171 | ||
| Random embedding (nonstatic) + Bidirectional GRU | 4,538,073 | [3] | 0.80049 | ||
| GloVe (glove-twitter-25, nonstatic) + Unidirectional LSTM | 1,453,721 | [3] | 0.80876 | ||
| GloVe (glove-twitter-25, nonstatic) + Bidirectional LSTM | 2,588,313 | [3] | 0.79834 | ||
| GloVe (glove-twitter-25, nonstatic) + Unidirectional GRU | 1,179,801 | [3] | 0.80815 | ||
| GloVe (glove-twitter-25, nonstatic) + Bidirectional GRU | 2,040,473 | [3] | 0.79650 | ||
| GloVe (glove-twitter-50, nonstatic) + Unidirectional LSTM | 1,784,921 | [3] | 0.80539 | ||
| GloVe (glove-twitter-50, nonstatic) + Bidirectional LSTM | 2,970,713 | [3] | 0.81213 | ||
| GloVe (glove-twitter-50, nonstatic) + Unidirectional GRU | 1,498,201 | [3] | 0.80968 | ||
| GloVe (glove-twitter-50, nonstatic) + Bidirectional GRU | 2,397,273 | [3] | 0.80386 | ||
| GloVe (glove-twitter-100, nonstatic) + Unidirectional LSTM | 2,447,321 | [3] | 0.81029 | ||
| GloVe (glove-twitter-100, nonstatic) + Bidirectional LSTM | 3,735,513 | [3] | 0.80968 | ||
| GloVe (glove-twitter-100, nonstatic) + Unidirectional GRU | 2,135,001 | [3] | 0.80570 | ||
| GloVe (glove-twitter-100, nonstatic) + Bidirectional GRU | 3,110,873 | [3] | 0.80815 | ||
| GloVe (glove-twitter-200, nonstatic) + Unidirectional LSTM | 3,772,121 | [3] | 0.80508 | ||
| GloVe (glove-twitter-200, nonstatic) + Bidirectional LSTM | 5,265,113 | [3] | 0.81182 | ||
| GloVe (glove-twitter-200, nonstatic) + Unidirectional GRU | 3,408,601 | [3] | 0.81244 | ||
| GloVe (glove-twitter-200, nonstatic) + Bidirectional GRU | 4,538,073 | [3] | 0.80999 | ||
Click to view
| Model (with paper link) | Pretrain parameters | Training configurations | Public F1 | Notes | |
|---|---|---|---|---|---|
| ALBERT | base-v1 | 11M (huggingface) | [2] | 0.80907 | View list of parameters by huggingface here |
| large-v1 | 17M (huggingface) | [2] | 0.80416 | ||
| xlarge-v1 | 58M (huggingface) | [4] | 0.81182 | ||
| xxlarge-v1 | 223M (huggingface) | [4] | 0.78853 | ||
| base-v2 | 11M (huggingface) | [2] | 0.79528 | ||
| large-v2 | 17M (huggingface) | [2] | 0.81520 | ||
| xlarge-v2 | 58M (huggingface) | [4] | 0.81703 | ||
| xxlarge-v2 | 223M (huggingface) | [4] | 0.80570 | ||
| BART | base | 140M (facebook-research) | [2] | 0.82684 | View list of parameters by facebook-research here |
| large | 400M (facebook-research) | [2] | 0.83726 | ||
| large-mnli | 400M (facebook-research) | [2] | 0.83450 | ||
| large-cnn | 400M (facebook-research) | [2] | 0.82347 | ||
| BERT | base uncased | 110M (huggingface) | [2] | 0.82899 | View list of parameters by huggingface here |
| base cased | 110M (huggingface) | [2] | 0.81060 | ||
| large uncased | 340M (huggingface) | [2] | 0.83052 | ||
| large cased | 340M (huggingface) | [2] | 0.82194 | ||
| large uncased whole word masking | 335M (huggingface) | [2] | 0.82255 | ||
| large cased whole word masking | 336M (huggingface) | [2] | 0.81244 | ||
| multilingual uncased | 168M (huggingface) | [2] | 0.81887 | ||
| multilingual cased | 179M (huggingface) | [2] | 0.81918 | ||
| BERTweet | base | 135M (vinai) | [2] | 0.83726 | View list of parameters by vinai here |
| covid19-base-uncased | 135M (vinai) | [2] | 0.84002 | ||
| covid19-base-cased | 135M (vinai) | [2] | 0.82960 | ||
| large | 335M (vinai) | [2] | 0.82899 | ||
| BORT | base | 56.1M (amazon) | [2] | 0.74563 | Parameters from the original paper |
| DeBERTa | base | 100M (microsoft) | [2] | 0.81642 | View list of parameters by microsoft here |
| base-mnli | 86M (microsoft) | [2] | 0.80661 | ||
| large | 350M (microsoft) | [4] | 0.84308 | ||
| large-mnli | 350M (microsoft) | [4] | 0.83757 | ||
| DeBERTa v3 | xsmall | 22M (microsoft) | [2] | 0.80815 | View list of parameters by microsoft here |
| small | 44M (microsoft) | [2] | 0.82408 | ||
| base | 86M (microsoft) | [2] | 0.83205 | ||
| large | 304M (microsoft) | [4] | 0.82745 | ||
| mdeberta-v3-base | 86M (microsoft) | [2] | 0.82929 | ||
| DistilBERT | base uncased | 66M (huggingface) | [2] | 0.82439 | View list of parameters by huggingface here |
| base cased | 65M (huggingface) | [2] | 0.82163 | ||
| multilingual cased | 134M (huggingface) | [2] | 0.80049 | ||
| ELECTRA (discriminator) | small | 14M (google) | [2] | 0.81887 | View list of parameters by google here |
| base | 110M (google) | [2] | 0.82776 | ||
| large | 335M (google) | [2] | 0.83726 | ||
| RoBERTa | base | 125M (huggingface) | [2] | 0.82868 | View list of parameters by huggingface here |
| large | 335M (huggingface) | [2] | 0.84033 | ||
| large | 355M (huggingface) | [2] | 0.84033 | ||
| distilroberta-large | 82M (huggingface) | [2] | 0.82960 | ||
| SqueezeBERT | uncased | 51M (huggingface) | [2] | 0.80324 | View list of parameters by huggingface here |
| mnli | 51M (huggingface) | [2] | 0.79987 | ||
| mnli-headless | 51M (huggingface) | [2] | 0.80416 | ||
| Twitter RoBERTa Sentiment | base | N/A | [2] | 0.83389 | CardiffNLP has a huge list of Twitter pretrained models and these are just 3 of them. Try finetuning others (if you have time). |
| base latest | N/A | [2] | 0.82776 | ||
| base 2021 | 124M (cardiffnlp) | [2] | 0.83083 | ||
| XLM-RoBERTa | base | 270M (huggingface) | [2] | 0.82439 | View list of parameters by huggingface here |
| large | 550M (huggingface) | [2] | 0.82500 | ||
| XLNet | base cased | 110M (huggingface) | [2] | 0.82592 | View list of parameters by huggingface here |
| large cased | 340M (huggingface) | [4] | 0.81612 | ||