This repository contains a collection of data science, machine learning, and big data analytics projects completed as part of graduate coursework and independent research. The projects focus on large-scale data processing, unsupervised machine learning, topic modeling, and cloud-based analytics using Apache Spark, Databricks, Microsoft Azure, and R.
File: KMeans_Databricks.ipynb
This project demonstrates the implementation of K-Means clustering using Apache Spark and Databricks to analyze large datasets and identify meaningful data groupings.
Key Topics:
- Unsupervised Machine Learning
- Cluster Analysis
- Apache Spark MLlib
- Databricks
File: LDA_Databricks.ipynb
This project applies Latent Dirichlet Allocation (LDA) to discover hidden topics within large collections of text data.
Key Topics:
- Natural Language Processing (NLP)
- Topic Modeling
- Text Analytics
- Apache Spark
- Databricks
Files:
R_Questions.csvR_Answers.csvR_Answers_LDA.csvR_Tags.csv
Datasets used to analyze Stack Overflow discussions and identify trends, topics, and relationships within technical communities.
File: TermPaper.pdf
A comprehensive analysis of machine learning techniques, big data technologies, and cloud-based analytics platforms used throughout the project.
Files:
Tutorial_KMeans_Azure.pdfTutorial_LDA_Azure.pdf
Step-by-step documentation describing the implementation of K-Means clustering and LDA topic modeling within Microsoft Azure and Spark environments.
File: Update_Azure ML and Spark ML Analysis of Stack.pptx
Presentation summarizing project objectives, methodologies, findings, and lessons learned from applying machine learning techniques to large-scale datasets.
- Python
- Apache Spark
- Spark MLlib
- Databricks
- Microsoft Azure
- Machine Learning
- Natural Language Processing (NLP)
- K-Means Clustering
- Latent Dirichlet Allocation (LDA)
- Data Analytics
- Big Data Processing
These projects were completed to develop practical experience in:
- Large-scale data analysis
- Machine learning model development
- Cloud-based analytics platforms
- Distributed computing with Apache Spark
- Data visualization and reporting
- Natural language processing techniques
Greg Mamoyac
Graduate Studies in Computer Information Systems and Data Analytics