Skip to content

gregmamoyac/DataScience-MachineLearning-Projects

Repository files navigation

Data Science and Machine Learning Projects

This repository contains a collection of data science, machine learning, and big data analytics projects completed as part of graduate coursework and independent research. The projects focus on large-scale data processing, unsupervised machine learning, topic modeling, and cloud-based analytics using Apache Spark, Databricks, Microsoft Azure, and R.

Project Overview

K-Means Clustering with Databricks

File: KMeans_Databricks.ipynb

This project demonstrates the implementation of K-Means clustering using Apache Spark and Databricks to analyze large datasets and identify meaningful data groupings.

Key Topics:

  • Unsupervised Machine Learning
  • Cluster Analysis
  • Apache Spark MLlib
  • Databricks

Latent Dirichlet Allocation (LDA) Topic Modeling

File: LDA_Databricks.ipynb

This project applies Latent Dirichlet Allocation (LDA) to discover hidden topics within large collections of text data.

Key Topics:

  • Natural Language Processing (NLP)
  • Topic Modeling
  • Text Analytics
  • Apache Spark
  • Databricks

Stack Overflow Data Analysis

Files:

  • R_Questions.csv
  • R_Answers.csv
  • R_Answers_LDA.csv
  • R_Tags.csv

Datasets used to analyze Stack Overflow discussions and identify trends, topics, and relationships within technical communities.


Research Paper

File: TermPaper.pdf

A comprehensive analysis of machine learning techniques, big data technologies, and cloud-based analytics platforms used throughout the project.


Azure Machine Learning Tutorials

Files:

  • Tutorial_KMeans_Azure.pdf
  • Tutorial_LDA_Azure.pdf

Step-by-step documentation describing the implementation of K-Means clustering and LDA topic modeling within Microsoft Azure and Spark environments.


Final Presentation

File: Update_Azure ML and Spark ML Analysis of Stack.pptx

Presentation summarizing project objectives, methodologies, findings, and lessons learned from applying machine learning techniques to large-scale datasets.

Technologies Used

  • Python
  • Apache Spark
  • Spark MLlib
  • Databricks
  • Microsoft Azure
  • Machine Learning
  • Natural Language Processing (NLP)
  • K-Means Clustering
  • Latent Dirichlet Allocation (LDA)
  • Data Analytics
  • Big Data Processing

Learning Objectives

These projects were completed to develop practical experience in:

  • Large-scale data analysis
  • Machine learning model development
  • Cloud-based analytics platforms
  • Distributed computing with Apache Spark
  • Data visualization and reporting
  • Natural language processing techniques

Author

Greg Mamoyac

Graduate Studies in Computer Information Systems and Data Analytics

About

DataScience-MachineLearning-Projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors