E-commerce platforms receive millions of product reviews daily. Manually reading and categorising them for quality control, product feedback, and customer satisfaction monitoring is not scalable. This project builds an automated sentiment classifier that processes reviews at scale, helping product and operations teams prioritise customer feedback efficiently.
Business question: Can we automatically classify product reviews as positive or negative with high enough accuracy to replace manual review tagging at scale?
- Source: Amazon Fine Food Reviews (Kaggle)
- Link: https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews
- Full size: 568,454 reviews (2002–2012)
- Used: 20,000 reviews (stratified sample for speed)
- Target variable: Sentiment derived from star rating
- Positive: 4-5 stars
- Negative: 1-2 stars
- Neutral (3 stars): removed from analysis
| Class | Count | % of sample |
|---|---|---|
| Positive | 16,842 | 84.2% |
| Negative | 3,158 | 15.8% |
Class imbalance noted — F1-score used as primary metric, not accuracy.
nlp-sentiment-analysis/
|
|-- app/
| |-- api.py # FastAPI backend for prediction requests
| `-- streamlit_app.py # Streamlit interface for interactive testing
|
|-- data/
| |-- raw/
| | `-- Reviews.csv # Original Amazon Fine Food Reviews data
| `-- processed/
| `-- cleaned_reviews.csv
|
|-- models/
| |-- lr_model.pkl # Saved Logistic Regression model
| `-- tfidf.pkl # Saved TF-IDF vectorizer
|
|-- notebooks/
| |-- 01_data_cleaning.ipynb
| |-- 02_eda.ipynb
| |-- 03_model_training.ipynb
| `-- 04_transformer_comparison.ipynb
|
|-- outputs/
| `-- figures/
| |-- class_distribution.png
| |-- pos_wc.png
| `-- neg_wc.png
|
|-- src/
| |-- preprocessing.py # Text cleaning
| |-- features.py # TF-IDF feature extraction
| |-- train.py # Model training
| |-- evaluate.py # Model evaluation
| `-- predict.py # Inference helper used by API and Streamlit
|
|-- requirements.txt
|-- requirements-dev.txt
|-- setup.py
`-- README.md
Raw review text
-> Convert to string
-> Lowercase
-> Remove HTML tags
-> Expand n't to not
-> Remove non-alphabetic characters
-> Return cleaned text
Example:
ORIGINAL: "This is <b>AMAZING</b>!! Best dog food I've ever bought. 5/5 stars!!!"
CLEANED: "this is amazing best dog food ive ever bought stars"
The preprocessing logic is implemented in src/preprocessing.py.
TF-IDF converts cleaned review text into numerical features that machine learning models can use.
Settings used:
max_features=5000- top 5,000 vocabulary termsngram_range=(1,2)- single words and two-word phrasesmin_df=5- ignore terms appearing in fewer than 5 reviews
The vectorizer is built in src/features.py and saved as models/tfidf.pkl.
The main model is Logistic Regression trained on TF-IDF features.
Training details:
class_weight='balanced'to handle class imbalancemax_iter=1000GridSearchCVoverC=[0.1, 1, 5]- 3-fold cross-validation
- F1-score optimization
The trained model is saved as models/lr_model.pkl.
Because the dataset is imbalanced, accuracy alone is misleading. F1-score, precision, recall, and ROC-AUC are used to evaluate model performance.
The inference helper in src/predict.py:
- loads the saved Logistic Regression model
- loads the saved TF-IDF vectorizer
- cleans new review text
- converts it to TF-IDF features
- returns the predicted sentiment label and confidence score
app/api.py exposes a /predict endpoint that accepts review text and
returns the model prediction as JSON.
Run the backend:
uvicorn app.api:app --reloadExample request body:
{
"text": "This product is fresh and delicious."
}Example response:
{
"label": "Positive",
"confidence": 0.97
}app/streamlit_app.py provides a simple browser interface where users can
enter a review and instantly view the predicted sentiment.
Run the app:
streamlit run app/streamlit_app.pyResults from notebooks/03_model_training.ipynb using an 80/20 stratified
train-test split:
| Model | Accuracy | Weighted Precision | Weighted Recall | Weighted F1 | ROC-AUC |
|---|---|---|---|---|---|
| Logistic Regression + TF-IDF | 91.0% | 0.92 | 0.91 | 0.91 | 0.954 |
Class-level performance:
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Negative | 0.66 | 0.83 | 0.74 | 620 |
| Positive | 0.97 | 0.92 | 0.94 | 3,380 |
The transformer comparison notebook uses
distilbert-base-uncased-finetuned-sst-2-english on a 1,000-review sample
and reports 85% accuracy with weighted F1-score of 0.87.
Logistic Regression selected as production model because it is fast, lightweight, interpretable, and already saved for inference through the Python helper, FastAPI backend, and Streamlit app.
excellent, love, perfect, great, fresh, delicious, best, wonderful, amazing, highly, recommend, fantastic, quality, pleased, satisfied
terrible, awful, disappointed, waste, return, horrible, disgusting, never, unfortunately, bad, worse, bland, stale, rancid, useless
The model struggles most with:
- Sarcasm — "Oh great, another broken product" (classified as positive)
- Conditional praise — "Not bad for the price" (ambiguous)
- Short reviews — "Ok." or "Fine." (insufficient signal)
These are known limitations of bag-of-words approaches. The DistilBERT notebook explores a transformer baseline, but the saved production model remains the TF-IDF + Logistic Regression pipeline.
At 100,000 reviews/month, the Logistic Regression model can:
- Auto-tag ~91,000 reviews correctly based on test accuracy
- Flag ~9,000 reviews for manual review or confidence-based checks
- Process the entire batch in under 2 minutes
- Reduce manual review time by an estimated 85%
Test the model on any custom review text:
predict("This coffee is absolutely amazing, best I have ever had!")
# Positive (confidence: 96%)
predict("Terrible product, broke after one use. Complete waste of money.")
# Negative (confidence: 99%)
predict("It is okay, nothing special but does the job.")
# Negative (confidence: 75%)git clone https://github.com/Sumant40/NLP-Sentiment-Analysis.git
cd nlp-sentiment-analysis
pip install -r requirements.txt
# Download NLTK stopwords (first time only)
python -c "import nltk; nltk.download('stopwords')"
jupyter notebook
# Run notebooks in order: 01 → 02 → 03 → 04pandas==2.0.3
numpy==1.24.3
matplotlib==3.7.2
seaborn==0.12.2
scikit-learn==1.3.0
nltk==3.8.1
wordcloud==1.9.2
transformers==4.30.0
torch==2.0.1
jupyter==1.0.0
| Factor | TF-IDF + Logistic Regression | DistilBERT |
|---|---|---|
| Evaluation used | 4,000-review stratified test split | 1,000-review sample |
| Accuracy | 91.0% | 85.0% |
| Weighted F1-score | 0.91 | 0.87 |
| Inference speed | Very fast | Slower on CPU |
| Memory usage | Low | Higher |
| Explainability | High (feature weights) | Low (black box) |
| Project role | Saved production model | Baseline comparison notebook |
- Training data is from 2002 to 2012 language patterns may have evolved
- Model trained on food reviews — may not generalise to other domains without fine-tuning
- Does not handle multi-language reviews
- Future work: fine-tune DistilBERT on domain-specific data, add aspect- based sentiment (e.g. packaging vs taste vs value), deploy as REST API using FastAPI
Sumant Jadiyappagoudar Bioengineering graduate | Data Science & Computational Biology LinkedIn | GitHub | Email
Part of my data science portfolio. Other projects: HR Attrition ML | SQL + Dashboard | A/B Testing | Pharma Analytics