Glassdoor Reviews Mining: Topic Discovery, Clustering & Association Rules

An unsupervised learning analysis of ~77,000 Glassdoor employee reviews to identify what drives negative workplace experiences and actionable patterns for HR teams.

📄 View Full Report on RPubs | 🔗 Dataset on Kaggle

Business Question

What drives negative employee reviews on Glassdoor, and what actionable patterns can HR teams use to improve employee satisfaction?

Data

Attribute	Value
Source	Kaggle Glassdoor Job Reviews
Time Period	2020 (subset)
Raw Size	700,000+ reviews
After Cleaning	77,397 reviews
Features Used	`pros`, `cons`, `headline` (text); `overall_rating`, `recommend`, sub-ratings

Preprocessing:

Text cleaning (lowercase, punctuation, numbers, whitespace removal)
Stopword removal + lemmatization
Z-score outlier removal (|z| > 3 on text length)
TF-IDF vectorization → 10,823 vocabulary terms

Methods

Dimensionality Reduction

LDA (Latent Dirichlet Allocation): 10 topics extracted from TF-IDF matrix
Reduces 10,000+ word dimensions → 10 interpretable topic probabilities per document

Clustering

Spherical K-Means (cosine distance) - k=11 clusters
Hierarchical Clustering (Ward's method) - for validation
Optimal k selected via silhouette analysis
Adjusted Rand Index = 0.63 (substantial agreement between methods)

Association Rules

Apriori algorithm (support ≥ 0.01, confidence ≥ 0.30)
Transactions: top topics + discretized ratings + recommend status
287 rules generated, filtered for negative outcome predictors

Visualization

t-SNE for 2D cluster visualization

Findings

1. Management Quality is the #1 Predictor of Negative Outcomes

Rule	Confidence	Lift
{recommend_no, topic_8} → {rating_low}	60.3%	4.72
{recommend_no, topic_3, topic_8} → {rating_low}	71.5%	5.60
{rating_low, topic_8} → {recommend_no}	96.8%	2.92

Topic 8 = "management", "manager", "bad", "poor", "staff"

2. Problem Cluster Identified

Cluster 8: Mean rating 2.67, only 33% would recommend - the most dissatisfied employee segment.

3. Topic Labels Discovered

Topic	Top Words	Label
8	management, manager, bad, poor	Management Issues
6	leadership, culture, process, change	Leadership & Culture
3	benefit, time, leave, health	Benefits & Time Off
4	hour, shift, pay, customer	Hourly/Shift Conditions

Business Recommendations

Implement manager feedback loops and leadership training
- Topic 8 dominates negative reviews
- Expected impact: Reduce negative reviews by 15–20%
Audit departments with management + benefits complaints
- Topic 8 + Topic 3 co-occurrence strongly predicts negative outcomes
- Expected impact: Improve retention by 10–15%
Conduct culture assessments in low-rated business units
- Topic 6 (leadership/culture) also appears in negative rules
- Expected impact: Improve "would recommend" rate by 10%

Project Structure

├── analysis/
│   └── main.R                 <- Full analysis pipeline
├── data/
│   ├── raw/                   <- Original Glassdoor dataset
│   └── processed/             <- Cleaned data, tokens, topic distributions
├── models/
│   ├── lda_model.rds          <- Fitted LDA model
│   └── association_rules.rds  <- Apriori rules object
├── outputs/
│   └── figures/               <- t-SNE plots, rule visualizations
├── R/
│   └── utils.R                <- Helper functions
├── report/
│   └── final-report.Rmd       <- RPubs report source
└── article.Rmd                <- Main analysis report

Tech Stack

Language: R

Key Packages:

text2vec - TF-IDF, LDA
skmeans - Spherical K-Means
arules / arulesViz - Association rule mining
Rtsne - t-SNE visualization
tidytext / textstem - Text preprocessing

Limitations

Glassdoor bias: Disgruntled employees may be over-represented
Correlational only: No causal claims
2020 data only: COVID-19 effects may influence results
Topic coherence: Interpretation is subjective

References

Singh, H. - Clustering of text documents by implementation of K-means algorithms
Blei, D., Ng, A., Jordan, M. (2003) - Latent Dirichlet Allocation

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
R		R
analysis		analysis
data		data
models		models
outputs		outputs
report		report
.gitignore		.gitignore
Glassdoor-reviews-mining.Rproj		Glassdoor-reviews-mining.Rproj
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Glassdoor Reviews Mining: Topic Discovery, Clustering & Association Rules

Business Question

Data

Methods

Dimensionality Reduction

Clustering

Association Rules

Visualization

Findings

1. Management Quality is the #1 Predictor of Negative Outcomes

2. Problem Cluster Identified

3. Topic Labels Discovered

Business Recommendations

Project Structure

Tech Stack

Limitations

References

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Glassdoor Reviews Mining: Topic Discovery, Clustering & Association Rules

Business Question

Data

Methods

Dimensionality Reduction

Clustering

Association Rules

Visualization

Findings

1. Management Quality is the #1 Predictor of Negative Outcomes

2. Problem Cluster Identified

3. Topic Labels Discovered

Business Recommendations

Project Structure

Tech Stack

Limitations

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages