Predictive Analytics Project Report - Maya Sopiah Lubis

Project Domain

Sky is an object which is closely related to human life. For centuries, the sky has been an object which has attracted a lot of interest from the mankinds for further research. The sky initiated many dreams and theories about the universe. Some of them are theories about the creation of the universe, introduction to the structure of the solar system, and the missions to search for planets in other galaxies which has the same structure as Earth. Therefore, the Sloan Digital Sky Survey (SDSS) project was launched in 1998 for exploring the space to look for answers toward several questions related to outer space that have always existed.

Among many tasks of SDSS, one of it is mapping the sky [1], [2]. The dataset of this classification project is the result of sky mapping in batch of seventeeth data release (DR17) of SDSS Stage 4. Based on released journal of "The Seventeenth Data Release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar, and APOGEE-2 Data", total of cumulative observed infrared spectra by SDSS telescopes is 2,659,178 individual infrared spectra which represent stellar objects in the sky [3]. Infrared spectras need to be observed and calculated in order to classify them into several categories. By the huge amount of stellar objects which require to be categorized, this task must be time-consuming when it is executed in manual way, considering that there are many mathematical calculations involved in the categorization process.

Machine learning is one of the vast developing technology nowadays. As machine learning evolves, it becomes a versatile technology which can be adapted to many fields, such as astronomy. One of machine learning capability is classification. By the provided variables or features in DR17 dataset from SDSS such as redshift and photometric systems, machine learning is capable to execute stellar objects classification for high amount of data in short amount of time and high reliability. Therefore, in this project, author is keen to apply machine learning for classifying stellar objects into three categories (galaxy, star, and quasar). Prediction will be using four machine learning algorithm for classification, which consist of Logistic Regression, K-Nearest Neighbor (KNN), Random Forest, and Extreme Gradient Boost (XGBoost).

Project journal reference: 'The Seventeenth Data Release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar and APOGEE-2 Data'

Business Understanding

Problem Statements

Based on the previously stated project domain, problem statements of this project are as follows.

How are the distributions of stellar objects across all categories?
Which features strongly show differences between each category of stellar objects?
What is the position of each stellar objects category on sky longitude and latitude?
How are the distributions of stellar objects compared to the photometric system?
How is the relationship between redshift towards photometric system on each stellar objects categories?
What is the best machine learning model for classifying stellar objects?

Goals

Based on the problem statements, goals to be achieved in this project are as follows.

To determine the amount and percentage of stellar objects on each categories.
To analyze the variable or feature of stellar objects with the highest correlation and mutual information value.
To examine each stellar objects category position on sky longitude and latitude.
To analyze the distributions of stellar object to each of photometric system, which consists of u, g, r, i, and z.
To identify relationship patterns between redshift value towards each of photometric system on each stellar objects category.
To find the best machine learning model with the highest accuracy for classifying stellar objects.

Solution Statements

The solutions for achieving the goals are as follows.

Execute the Explanatory Data Analysis (EDA) process by using suitable diagrams of plots to accomplish the goals of:
- determining the amount and percentage of stellar objects on each categories by using bar plot,
- analyzing the variable or feature of stellar objects with the highest correlation and mutual information value by using correlation heatmap and mutual information bar plot,
- examining each stellar objects category position on sky longitude and latitude by using strip plot,
- analyzing the distributions of stellar object to each of photometric system, which consists of u, g, r, i, and z by using strip plot and histogram,
- identifing relationship patterns between redshift value towards each of photometric system on each stellar objects category by using strip plot, histogram, and scatter plot.
Build four machine learning models for classification by using four algorithms below.
- Logistic Regression,
- K-Nearest Neighbor (KNN),
- Random Forest,
- Extreme Gradient Boosting (XGBoost).
Determine the best machine learning model by four metrics below.
- Confusion matrix,
- Error rate of training data prediction,
- Error rate of test data prediction,
- Accuracy score.

Data Understanding

Access to the dataset of this project is from Kaggle platform on this link. The dataset is published three years ago by user named FEDESORIANO. Usability rate of this dataset by Kaggle is 10.0. The dataset consists of single CSV file.

Features Description

The features name and features details are down below.

Feature Name	Feature Details
obj_id	An unique identifier of stellar object which identify each object in dataset
alpha	Right ascension angle, which is the celestial longitude line. The unit is in degree.
delta	Declination angle, which is the celestial latitude line. The unit is in degree.
u	Photometric system, representing ultraviolet color
g	Photometric system, representing green color
r	Photometric system, representing red color
i	Photometric system, representing near ultraviolet color
z	Photometric system, representing infrared color
run_ID	Run number, identify the specific scan
rerun_ID	Rerun number, to specify how the image was processed
cam_col	Camera column, to identify the scanline within the run
field_ID	Field number, to identify each field
spec_obj_ID	Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class)
class	Stellar object class. `GALAXY` represents galaxy. `STAR` represents star. `QSO` represents quasar.
redshift	Redshift value based on the increase in wavelength
plate	Plate ID, identifies each plate in SDSS
MJD	Modified Julian Date, used to indicate when a given piece of SDSS data was taken
fiber_ID	Fiber ID, identifies the fiber that pointed the light at the focal plane in each observation

Dataset Information

Data rows amount

There are 100,000 data rows in the dataset.

Features amount and type

There are 18 features in the dataset. Based on information from info() function, the DataFrame features data type consist of:

Float type consists of 10 features,
Integer type consists of 7 features,
Object (or text) type consists of 1 feature.

Missing data

No missing or empty data in the dataset

Duplicated Data

No duplicated data in the dataset

Features Description

	obj_ID	alpha	delta	u	g	r	i	z	run_ID	rerun_ID	cam_col	field_ID	spec_obj_ID	class	redshift	plate	MJD	fiber_ID
count	100000	100000	100000	100000	100000	100000	100000	100000	100000	100000	100000	100000	100000	100000	100000	100000	100000	100000
unique	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	3	nan	nan	nan	nan
top	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	GALAXY	nan	nan	nan	nan
freq	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	59445	nan	nan	nan	nan
mean	1.23766e+18	177.629	24.1353	21.9805	20.5314	19.6458	19.0849	18.6688	4481.37	301	3.51161	186.131	5.78388e+18	nan	0.576661	5137.01	55588.6	449.313
std	8.43856e+12	96.5022	19.6447	31.7693	31.7503	1.85476	1.75789	31.7282	1964.76	0	1.58691	149.011	3.32402e+18	nan	0.730707	2952.3	1808.48	272.498
min	1.23765e+18	0.00552783	-18.7853	-9999	-9999	9.82207	9.4699	-9999	109	301	1	11	2.99519e+17	nan	-0.00997067	266	51608	1
25%	1.23766e+18	127.518	5.14677	20.3524	18.9652	18.1358	17.7323	17.4607	3187	301	2	82	2.84414e+18	nan	0.0545168	2526	54234	221
50%	1.23766e+18	180.901	23.6459	22.1791	21.0998	20.1253	19.4051	19.0046	4188	301	4	146	5.61488e+18	nan	0.424173	4987	55868.5	433
75%	1.23767e+18	233.895	39.9016	23.6874	22.1238	21.0448	20.3965	19.9211	5326	301	5	241	8.33214e+18	nan	0.704154	7400.25	56777	645
max	1.23768e+18	360	83.0005	32.7814	31.6022	29.5719	32.1415	29.3837	8162	301	6	989	1.41269e+19	nan	7.01124	12547	58932	1000

In general, the understandable feature description at this moment is class. On class feature description, the DataFrame consists of three types of stellar object. Most of the data are in the GALAXY category with more than fifty nine thousand of data. For data abnormalities, some features can be suspected having negative outlier according to the comparison between features mean and min values. All columns out from class feature have a positive mean value. However, some features are having negative min value. Some of the features are:

delta
u
g
z
redshift

Outliers

Outliers are data points which diverge from observations for several reasons, such as a natural anomalies. The main reason for outliers detection and filtering is because outliers can extremely deviates some components in statistic analysis such as mean value. The deviated statistical like this can cause serious issue during machine learning model building [4].

`delta` feature

The conclusion from boxplot of delta feature above is that three classes of stellar object has no outliers in this feature. delta feature value smaller to zero is not an anomaly.

`u`, `g`, and `z` features

From the boxplots above, the plots are unidentified for all features since the range of plots is around zero. The boxplots indicate an outlier for each feature which have extremely small value compared to the position of boxplot. From the observation for each features, the outlier for each feature direct the one single same data row. The outliers originates not from different data row.

`redshift` feature

From boxplot of redshift feature above, no outliers are identified when redshift value is smaller than zero in all classes. Otherwise, outliers are identified when redshift value are more than 1.2 for GALAXY class, and redshift value are more than 3.9 for QSO class. No outliers are identified for STAR class. It possibly can happen since max value of redshift outlier are around 7.0, which is too big compared to STAR class range. From boxplot above for redshift feature of STAR class, the plot shows that statistical range of this class is extremely thin. With max value is slightly greater than 0.0006 and min value is slightly lesser than -0.0008, the statistical range value of STAR class is only 0.0014. This value is tiny compared to distribution of redshift value in GALAXY and QSO classes. Outliers exist quite a lot as well above max value and below min value.

All features

The conclusions based on boxplots above are down below.

Features consist of outliers in all classes are 3 in total.
- r: Outliers exist above max value and below min value.
- i: Outliers exist above max value and below min value.
- field_id: Outliers exist above max value only.
Features consist of outliers in some classes are 3 in total.
- spec_obj_ID: An outlier exists above max value of STAR class.
- plate: An outlier exists above max value of STAR class.
- MJD: Some outliers exist below min value of QSO class.
Features with no outlier are 5 in total.
- obj_ID
- alpha
- run_ID
- cam_col
- fiber_ID
Feature with no statistical range is only 1 in total, which is rerun_ID. Since this feature has only one value for entire data, feature will be dropped.

Handling Outliers - Ouliers Cleaning

Q1 = stellar_df.quantile(0.25)
Q3 = stellar_df.quantile(0.75)
IQR = Q3 - Q1

outliers = stellar_df[((stellar_df<(Q1-1.5*IQR))|(stellar_df>(Q3+1.5*IQR))).any(axis=1)]
print(f"Amount of outliers: {len(outliers)} rows")

Output:

Amount of outliers: 14403 rows

By using Interquartile Range (IQR) technique for observating outliers of the dataset as the code above [4], it is concluded that there are more than fourteen thousand outliers on the Stellar Object DataFrame. It is around more than one per tenth of total of whole data.

Since the amount of non-outlier data is sufficient, the outliers can be dropped.

stellar_df = stellar_df[~((stellar_df<(Q1-1.5*IQR))|(stellar_df>(Q3+1.5*IQR))).any(axis=1)]
print(f"Amount of stellar_df (after dropping outliers): {len(stellar_df)} rows")

Output:

Amount of stellar_df (after dropping outliers): 85596 rows

After outliers are dropped, the remaining data of the dataset is more than eighty five thousand rows.

Exploratory Data Analysis - Univariate

`class` Feature

The only categorical feature in stellar objects dataset is class feature. As mentioned, class feature consists of three classes, which is GALAXY, STAR, and QSO. The visualization of class feature in bar plot are down below.

From the barplot above, it concludes that the majority of data belongs to GALAXY class which holds more than half amount of whole data (64.95%), followed by STAR (23.85%) and QSO (11.20%) classes. The small percentage of data belongs to QSO. This is shown that observed stellar object in the sky by SDSS telescope mostly consist of galaxies. Stars and quasars are taking lesser amount in the sky. The barplot also indicates that the data for machine learning is imbalance. Thus, the data needs to be balanced to get great performance of prediction in the future.

Numerical Features

In this phase, every single numerical features in DataFrame is depicted in histogram in purpose of observing data distribution. There are sixteen numerical features which is observed. The multiple histograms above describe the distribution of data from numerical features of stellar object dataset. The conclusions are down below.

Features with nearly normal data distribution consist of obj_ID, alpha, u, g, r, i, z, run_ID, and cam_col. All photometric colors belongs to this category.
Features with right-skewed data distribution consist of: delta, field_ID, spec_obj_ID, redshift, plate, and fiber_ID.
Feature with left-skewed data distribution only consist of MJD.

Exploratory Data Analysis - Multivariate

Multivariate Analysis of Features Relationship

In this phase, the relationships between all features are analyzed. This is important since the data is taken in the same time. Therefore, there will be some possibilities that one feature can affect the value in other features, whether in positive or in negative ways. The observation of features relationship uses two metrics, which are Correlation Heatmap and Mutual Information Bar Plot.

Correlation Heatmap

Correlation heatmap is a heatmap which displays the strength of correlation between all features to each other of them [4]. There are three types of relationships in correlation heatmap [4]:

Positive correlation, which means two features affect each others value in the same direction. $x$ value is $0<x≤1$.
No correlation, which means two features don't affect each others value. $x$ value is equal to zero.
Negative correlation, which means two features affect each other value in different direction. $x$ value is $-1≤x<0$.

Correlation heatmap is only able to detect the relationship of numerical feature to the other numerical feature. The visualization of correlation heatmap is down below. According to the correlation heatmap above, the conclusions are down below.

The features with generally low positive or low negative correlations are:
1. obj_ID
2. alpha
3. delta
4. run_ID
5. cam_col
6. field_ID
7. fiber_ID
The features with mostly strong positive correlations are:
1. All photometric features, which are u, g, r, i, and z.
2. spec_obj_ID
3. redshift
4. plate
5. MJD

Mutual Information Bar Plot

Mutual information is an univariate metric to calculate the certainty level of a feature towards target value when a feature value is at a particular level. The higher mutual information value of a feature, it means more certain the feature to define target value. In contrast to correlation map, mutual information is able to calculate relationship of feature and target with different data type, which is easier to use. However, since mutual information is an univariate metric, it is not able to calculate relationship of each features in dataset as correlation heatmap does [5]. The visualization of mutual information bar plot is down below. Based on the mutual information barplot above, the conclusions are down below.

The highest value of mutual information holds by redshift feature. It means that redshift value in every level is strongly certain for defining or predicting the value of class feature as the target. This is in line with the correlation heatmap.
Surprisingly, obj_id and run_ID with the low correlation values, in contrast are quite certain to predict the value of class feature. This is shown that certainty is not always in line with correlation.
All photometric features are having low certainty to predict target value, even though they are strongly correlated to each other.
alpha, delta, fiber_ID, field_id, and cam_col are both low in correlation and certainty in prediction.
spec_obj_ID, plate, and MJD are both quite strong in correlation and certainty in prediction.

`class` Feature and Sky Longitude and Latitude

Sky longitude is represented as alpha feature, and sky latitude is represented as delta feature. Position of each stellar objects category in the sky based on sky longitude and latitude is on the strip plots visualization down below. The two strip plots above describe the position of stellar objects based on its categories in the sky. The conclusion are down below.

alpha feature
- The type of stellar object category which scatters evenly on each alpha degree is STAR. There is quite low density of STAR data in alpha on range 250 to 300, but still there is a lot stars on those coordinates.
- GALAXY and QSO have the most identical scattering pattern on alpha. Both categories are having a dense amount of object all along the coordinates. Somehow, the density is decreasing for both categories in range 50 to 100, and 250 to 300.
delta feature
- The type of stellar object category which has the longest scattering range all along the delta coordinates is STAR, starting from -20 up to 80.
- For STAR, The density is stable from -20 to 70. At the point of 70 degree, no star object there. Then, the rest of several star objects are appearing again slightly up from 70 degree until 80.
- Again, GALAXY and QSO have the most identical scattering pattern on delta. Both scattering start from range 10 to 70. Few stellar objects appear below minimum value and above maximum value of the range for both categories.

`class` Feature and Photometric System

Spectrum lengths of each stellar objects category in the sky based on photometric system are on the strip plots visualization down below. From photometric strip plots above, some conclusions are down below.

From all stellar object, QSO has the shortest range of spectrum value compared to the other object in all colors.
QSO minimum spectrum is higher than GALAXY and STAR minimum spectrum, and its maximum spectrum is lower than GALAXY and STAR maximum spectrum, which makes QSO spectrum range length is shorter and in between GALAXY and STAR spectrum range length.
GALAXY and STAR classes have the similar range of spectrum value of all colors.
The minimum and maximum value of GALAXY and STAR are also similar to each other.
In short sentence, GALAXY, STAR, and QSO have similarity in matter of photometric and quite hard to be distincted. To compare the differences between GALAXY and STAR, it needs other features which records more significant differences between these two objects.

Histograms above are giving more details about data density in each photometric system spectrum value. From the histograms of photometric features above, the conclusions are down below.

In all colors, GALAXY class data distribution tends to be skewed-left, which means most of the data have high value in photometric features.
For QSO class, data distribution tends to be skewed-left as well, except when it is recorded on u color which has skewed-right data distribution. It means that QSO in u color has more data with low value.
It is different compared to STAR class data distribution. STAR in all colors tends to have nearly normal distribution. It means that the majority of STAR data has photometric value in the middle position.

`class` Feature and `redshift` Feature

Based on features relationship analysis in prior, redshift feature has high value in both correlation matrix and mutual information barplot. Strip plot and histogram above explain further about the high value. Both plots are showing significant differences between three classes of stellar object. The conclusions are down below.

GALAXY class
- On strip plot, the value of redshift starts at 0.0 then stops at around 1.70. The density is stable since the start point until 1.0. Slightly up after 1.0 until around 1.70, the density is decreasing.
- On histogram, the density is visualized better. redshift data distribution for GALAXY class shows binomial pattern, where there are two summits which displays that GALAXY class have two highest values. Majority of data is on range 0.0 to 0.75, then significantly decreasing after 0.75 until around 1.70.
QSO class
- Similar as GALAXY class, on strip plot, the value of redshift of QSO starts at 0.0 then stops at around 1.70. The difference of both class lies on the density. Data density is low since the start point until 0.5. Then, it is getting denser after 0.5 until around 1.7.
- On histogram, redshift data distribution for QSO class shows skewed-left pattern, where most of data have high value of redshift value.
STAR class
- On strip plot, the value of redshift is only small strip at 0.0. This shows that all STAR data has no spectrum of redshift at all.
- On histogram, redshift data distribution for STAR class shows long bar only at 0.0. All data is centered only in one point.

`redshift` Feature and Photometric System

redshift feature has relatively quite strong correlation with all photometric features, with correlation scores around 0.33 to 0.68. Scatter plot is a plot which can show the correlation between two numerical features in details. Because of the strong relationship between redshift and photometric value, then serving the data in the scatter plot is giving more information to be delved. Here is the conclusions from the scatter plots above.

There is no significant difference in data patterns on all scatter plots.
For GALAXY class, increasing value of photometric features affects the increasing value of redshift as well for most of the data. However, increasing value of photometric value also supports increasing range of data distribution of redshift value. For example on scatter plot of z feature towards redshift, it is shown that when z value is on 14, range of redshift value is between 0.00 to around 0.10. When z value is on 16, range of redshift value is increasing between 0.00 to around 0.25. Then the maximum value in range of redshift keeps on increasing when z value is increasing. The higher z score, it doesn't affect the minimum value in range of redshift.
For QSO class, redshift value is already in relatively high place, even though the photometric value is increasing. However, increasing value of photometric value does supports decreasing range of data distribution of redshift value. For example on scatter plot of z feature towards redshift, it is shown that when z value is on 18, range of redshift value is between around 0.75 to 1.75. When z value is on 20, range of redshift value is decreasing between around 1.00 to 1.75. Then the minimum value in range of redshift keeps on increasing when z value is increasing. The higher z score, it doesn't affect the maximum value in range of redshift.
For STAR class, relationship between redshift feature and photometric features doesn't affect each other of it. For example on scatter plot of z feature towards redshift, how low or high the value of z, redshift value will always be zero.

Data Preparation

Dropping Features

In this phase, selecting the relevant features is necessary for building machine learning model. Selecting features helps to reduce unrelated features from dataset and preserving the most important information like major trends or patterns [6].
Since redshift is the strongest and it is affecting a lot of other features in Stellar Object dataset, redshift feature itself will remains, followed by all of the photometric features since those features correlate strongly with redshift. The rest of features except class feature, are dropped and not included in machine learning model building.
Features consist of redshift feature and all photometric values. Target is class feature.
The total of remaining feature is seven after dropping features.

Training and Test Data Splitting

Dividing data into training data and test data is important. Training data is used to let model learn all the patterns of data during model training process, while test data is used to examine the accuracy of model after training process. Test data are separated from training data to avoid data leakage which affecting accuracy value to be false positive or true negative. It also helps to avoid underfitting or overfitting towards model performance when predicting new data. [7]
With total data more than eighty five thousand, it is sufficient enough for performing model training. The chosen ratio between training data and test data in this case is 90:10. After training and test data splitting, the total of training data is slightly more than seventy seven thousand, while total of test data is more than eight thousand.

Handling Imbalance Data

Based on bar plot for calculating the amount of data for each category previously, it is shown that there is way many data in each category, but lesser in other categories. In Stellar Object dataset, more than half of data belongs to GALAXY class, while the rest of small amount belongs to STAR and QSO classes. It is called as imbalanced data when distribution data for all categories in dataset is not even. Imbalance data can affect the quality of model and poor model performance in prediction. Therefore handling imbalance data is necessary to be performed [8].
There are two ways of handling imbalance data, which is undersampling and and oversampling. Undersampling is a method to keep data in minority class, then reducing data in majority class. Likewise, oversampling is a method to keep data in majority class, then adding synthetic data in minority data. The method used in this case is combination of both using SMOTE-Tomek technique [8].
The bar plots above shows the comparison of data amount before and after handling imbalance data. SMOTE-Tomek has successfully balanced data amount of each class. GALAXY class as the majority class, is undersampled slightly from more than fifty thousand data to more than fourty nine data. Both for QSO and STAR classes as the minority, are oversampled to whopping fourty nine more thousand data.

Standardization

Standardization is a process of data transformation for having mean equals to zero and standard deviation equals to one. Standardization is important to ensure that some features don't dominate other feature due to high or low magnitude. Standardization is executed for both training and test data. [9]
After the standardization of training data, mean of all features equal to zero and the standar deviation of all features equal to one.
After the standardization of test data, mean of all features near equal zero and the standar deviation of all features near equal to one.

Label Encoding

Machine learning only can process numerical value. Hence, categorical label as stellar value needs to be converted from text to number. Label encoding is a process to convert categorical label to numerical form. Both label in feature and target are converted by this method [10].

Modeling

In this phase, machine learning model files for classifying stellar objects are build by executing training process on training data. There are four machine learning algorithm used for building the model:

Logistic Regression
K-Nearest Neighbor (KNN)
Random Forest
Extreme Gradient Boosting (XGBoost)

Each of machine learning process following these steps of model building:

Searching best parameter values for machine learning algorithm.
Model training. Error rate will be calculated by Mean Squared Error (MSE) metric.
Model prediction. Error rate will be calculated by Mean Squared Error (MSE) metric as well.
Calculating model accuracy. Accuracy rate will be calculated by Accuracy metric.

Logistic Regression

Model Definition

Logistic regression is a supervised machine learning classification algorithm which is used to predict the probability of certain classes based on features value. Logistic regression analyzes relationship between variables, then assign the probabilities using Sigmoid function which converts numerical result into probability between 0.0 and 1.0. [11]

Model Advantages

The advantages of Logistic Regression are down below [12].

It easily can be extended from binomial classification (classification of two classes) to multinomial classification (classification of more than two class).
Achieve nice accuracy for simple datasets, and it performs well when the dataset is linearly separable.
It is rapid at classifying the unknown records.

Model Disadvantages

The disadvantages of Logistic Regression are down below [12].

If the number of observations is lesser than number of features, Logistic regression is not recommended to be used since it will lead to overfitting.
It requires average or multicollinearity between independent variables.
It only can be used to predict discrete function.

Parameters Details

C: Represents the amount of regularization which determines on how much the model is penalized for increasing the magnitude of parameter values. C mission is to prevent overfitting.
penalty: Represents the regularization type. Penalty prevents model from assigning too much importance to any features, which makes model be more general.
solver: Represents the optimization algorithm. Solver job is to find the best weight or parameter value to minimize the loss function.
multi_class: Determine classification type between binomial or multinomial.
max_iter: Represents the maximum number of iterations that solver is allowed to run before it stops. If the solver has not converged to a solution within the specified number of iterations, it will terminate the process.

Implemented Parameters Value

The best value of Logistic Regression algorithm paramaters which are implemented are down below.

C : 10
penalty : 'l2'
solver : 'lbfgs'
multi_class : 'multinomial'
max_iter : 2000

Model Overview

On training process using Logistic Regression algorithm, model summaries are down below.

Error rate of predicting training data is around 0.0733, or 7.33%.
Error rate of predicting testing data is around 0.0853, or 8.53%.
Model accuracy is around 0.9370, or 93.70%.

K-Nearest Neighbor (KNN)

Model Definition

K-Nearest Neighbor (KNN) is a supervised machine learning algorithm which makes prediction based on the distance between one data point to others. KNN works by finding the nearest neigbor data of a data point in 'k' amount, based on distance metric such as Euclidean distance. [13]

Model Advantages

The advantages of KNN are down below [14].

Low complexity which makes KNN easy to implement.
When KNN is working, all data points are stored in memory storages. Whenever new data points are entered, the algorithm easily adjusts itself, which makes this algorithm adaptable.
Requires only two parameters, which is distance metric and value of 'k'.

Model Disadvantages

The disadvantages of KNN are down below [14].

KNN is a lazy algorithm which takes computing resource and data storage. The consequences of this behavior are time-consuming and resource exhausting.
Curse of dimensionality. This is a phenomenon when algorithm becomes difficult classifying the data properly when the dimensionality is too high.
When the algorithm is affected by curse of dimensionality, it also becomes prone to overfitting.

Parameters Details

metric: Represents the type of distance metric. The options consist of Euclidean distance, Manhattan distance, and Minkowski Distance.
n_neighbors: Represents the 'k' value, which is the amount of neighbor for each data point.
weight: Represents weight scoring technique for the neighbor data point.

Implemented Parameters Value

The best value of KNN algorithm paramaters which are implemented are down below.

metric : 'euclidean'
n_neighbors : 3
weights : 'distance'

Model Overview

On training process using KNN algorithm, model summaries are down below.

Error rate of predicting training data is 0.00, or 0%.
Error rate of predicting testing data is around 0.0967, or 9.67%.
Model accuracy is around 0.9461, or 94.61%.

Random Forest

Model Definition

Random Forest is an ensemble learning algorithm version of Decision Tree algorithm. It consists of aggregating multiple decision tree to obtain better result. Random Forest designed to enhance the accuracy and robustness of classification task. Each decision tree in the Random Forest is constructed using a subset of the training data and a random subset of features introducing diversity among the trees, making the model more robust and less prone to overfitting [15].

Model Advantages

The advantages of Random Forest are down below [16].

It tends to achieve higher accuracy due to aggregating several decision trees.
Capable to handle both numerical and categorical data without feature engineering.
Random Forest itself estimates the importance of features.

Model Disadvantages

The disadvantages of Random Forest are down below [16].

Using a large amount of trees in Random Forest is computationally expensive.
Random Forest takes longer time to train model than other algorithms.
Random Forest can suffer from overfitting when the model captures noise in the training data.

Parameters Details

random_state: Seed number to preserve selected data from dataset for training process.
n_estimators: Represents number of trees in the forest.
min_samples_split: Minimum samples required to split a node. Higher value prevents overfitting, but too high value can hinder model complexity.
min_samples_leaf: Minimum samples required to be at a leaf node.
max_features: Number of features considered for splitting at each node.
max_depth: Maximum depth of each tree. Deeper trees can capture more complex patterns, but also risk overfitting.

Implemented Parameters Value

The best value of Random Forest algorithm paramaters which are implemented are down below.

random_state : 42
n_estimators : 200
min_samples_split : 2
min_samples_leaf : 1
max_features : 'sqrt'
max_depth : None

Model Overview

On training process using Random Forest algorithm, model summaries are down below.

Error rate of predicting training data is 0.00, or 0%.
Error rate of predicting testing data is around 0.0303, or 3.03%.
Model accuracy is around 0.9736, or 97.36%.

Extreme Gradient Boosting (XGBoost)

Model Definition

XGBoost is an optimized distributed gradient boosting library designed for efficient and scalable training of machine learning models. It is an ensemble learning method that combines the predictions of multiple weak models to produce a stronger prediction. XGBoost stands for “Extreme Gradient Boosting” and it has become one of the most popular and widely used machine learning algorithms due to its ability to handle large datasets and its ability to achieve state-of-the-art performance in many machine learning tasks such as classification and regression [17].

Model Advantages

The advantages of XGBoost are down below [17].

XGBoost is designed for efficient and scalable model training, which makes it as a good choice for training large datasets.
Contains a lot of parameters which makes it highly customizable.
XGBoost has built-in support to handle missing data.

Model Disadvantages

The disadvantages of Random Forest are down below [17].

Can be computationally expensive, especially for training large models.
Prone to overfitting, especially when trained on a small amount of data.
Finding the optimal set of parameters is time-consuming and require expertise.

Parameters Details

random_state: Seed number to preserve selected data from dataset for training process.
eval_metric: Evaluation metrics for validation data, a default metric will be assigned according to objective.
learning_rate: Variable that modifies how much each tree contributes to the final prediction. While more trees are needed, smaller values frequently result in more accurate models.
max_depth: Controls the depth of every tree, avoiding overfitting and being essential to controlling the model's complexity.
n_estimators: Specifies the number of boosting rounds.
subsample: Manages the percentage of data that is sampled at random to grow each tree, hence lowering variance and enhancing generalization. Setting it too low, though, could result in underfitting.

Implemented Parameters Value

The best value of XGBoost algorithm paramaters which are implemented are down below.

random_state : 42
eval_metric : 'logloss'
learning_rate : 0.2
max_depth : 7
n_estimators : 200
subsample : 0.8

Model Overview

On training process using XGBoost algorithm, model summaries are down below.

Error rate of predicting training data is around 0.0093, or 0.93%.
Error rate of predicting testing data is around 0.0416, or 4.16%.
Model accuracy is around 0.9696, or 96.96%.

Selecting the Best Model

By observing each of model overview, the best model of this project is Random Forest. Random Forest achieve the lowest error rate of training and test data and the highest accuracy among other models. Random Forest is selected as the solution for stellar objects classification. Further explanation of this selection process is on 'Evaluation' phase.

Evaluation

In this final phase, the best model will be chosen for prediction, based on several categories:

Overfitting rate.
Error rate of training data prediction.
Error rate of test data prediction.
Model accuracy score.

Evaluation Metrics Explanation

Mean Squared Error (MSE)

Mean squared error (MSE) is a metric used to measure the average squared difference between the predicted values and the actual values in the dataset. It is calculated by taking the average of the squared residuals, where the residual is the difference between predicted value and the actual value for each data point. The MSE value provides a way to analyze the accuracy of the model [18].

The function of MSE is as follows.
Interpretation of formula are down below.

n is the number of observations in the dataset.
yi is the actual value of the observation.
$Y_i - \hat{Y}_i$ is the predicted value of the 'i' observation.

Lower MSE value indicates better accuracy. Higher MSE value indicates that prediction value deviates from the true value. In this project, MSE is used to calculate the error rate of training data and test data during training process.

Confusion Matrix

A confusion matrix is a matrix that summarizes the performance of a machine learning model on a set of test data. It is a means of displaying the number of accurate and inaccurate instances based on the model’s predictions. It is often used to measure the performance of classification models, which aim to predict a categorical label for each input instance [19].

The matrix displays the number of instances produced by the model on the test data.

True Positive (TP): The model correctly predicted a positive outcome (the actual outcome was positive).
True Negative (TN): The model correctly predicted a negative outcome (the actual outcome was negative).
False Positive (FP): The model incorrectly predicted a positive outcome (the actual outcome was negative). Also known as a Type I error.
False Negative (FN): The model incorrectly predicted a negative outcome (the actual outcome was positive). Also known as a Type II error.

Metrics based on Confusion Matrix are as follows:

Accuracy. It is used to measure the performance of the model. It is the ratio of Total correct instances to the total instances.
Precision. It is a measure of how accurate a model’s positive predictions are. It is defined as the ratio of true positive predictions to the total number of positive predictions made by the model.
Recall. It measures the effectiveness of a classification model in identifying all relevant instances from a dataset. It is the ratio of the number of true positive (TP) instances to the sum of true positive and false negative (FN) instances.
F1-score. It is used to evaluate the overall performance of a classification model. It is the harmonic mean of precision and recall.

For this project, Accuracy metric is used to measure model performance.

Project Result

Logistic Regression

Confusion Matrix

Classification Report

For Logistic Regression model, amount of mistakenly predicted data are down below.

GALAXY: 365 + 64 = 429
QSO: 110 + 0 = 110
STAR: 0 + 0 = 0

Total of mistakenly predicted data of Logistic Regression is 539. Model accuracy on classification report equals to 0.94 or 94%.

K-Nearest Neighbor (KNN)

Confusion Matrix

Classification Report

For KNN model, amount of mistakenly predicted data are down below.

GALAXY: 242 + 109 = 351
QSO: 93 + 3 = 96
STAR: 13 + 1 = 14

Total of mistakenly predicted data of KNN is 461. Model accuracy on classification report equals to 0.95 or 95%.

Random Forest

Confusion Matrix

Classification Report

For Random Forest model, amount of mistakenly predicted data are down below.

GALAXY: 121 + 11 = 132
QSO: 94 + 0 = 94
STAR: 0 + 0 = 0

Total of mistakenly predicted data of Random Forest is 226. Model accuracy on classification report equals to 0.97 or 97%.

Extreme Gradient Boost (XGBoost)

Confusion Matrix

Classification Report

For XGBoost model, amount of mistakenly predicted data are down below.

GALAXY: 142 + 20 = 162
QSO: 86 + 0 = 86
STAR: 12 + 0 = 12

Total of mistakenly predicted data of XGBoost is 260. Model accuracy on classification report equals to 0.97 or 97%.

Result

From all metrics above, both from confusion matrices and bar plots, the model with best scores in each category are down below.

Confusion Matrix: model with the lowest score of mistakenly predicted data is Random Forest with score equals to 226.
Training Error Rate: model with the lowest score of error rate are KNN and Random Forest with score equals to 0%.
Test Error Rate: model with the lowest score of error rate is Random Forest with score around to 3.03%.
Model Accuracy : model with the highest score of accuracy is Random Forest with score around to 97.36%.

Based on conclusion above, the best model for predicting stellar objects is Random Forest. This model shows the lowest rate of overfitting, training error rate, and test error rate. Also it shows the highest accuracy amongst all models. This model may be the best, but the model can be tuned further to perform better to handle overfitting.

Final Conclusions

The majority of stellar objects in this project dataset belongs to Galaxy category with 64.95% percentage, then followed by Star category with 23.85% percentage. The rest of small amount of stellar objects belongs to Quasar category with 11.20% percentage.
Stellar objects can be distinguished quiet easily by redshift value. Redshift affects majority of features value. It is also strongly certain to predict stellar object category.
Stars can be seen all across the entire sky longitude and latitude. Meanwhile, galaxies and quasars can be seen in most part of sky longitude and latitude. Only few galaxies and quasars are seen on these sky coordinates:
- Longitude, or alpha: in range 50-100 degree and 250-300 degree
- Latitude, or delta: in range -20-(-10) degree and 70-80 degree
Among all categories, quasars have the shortest spectrum length in all photometric system colors. Galaxies and stars have very similar spectrum length in all photometric system, which make these both categories difficult to be distinguised only by photometric system.
Photometric system affects redshift for each stellar object category in different way.
- Stars: photometric system does not affect redshift at all.
- Galaxies: photometric system increment affects the increasing value of redshift for most of the data. However, increasing value of photometric value also supports increasing range value of redshift value distribution.
- Quasars: photometric system increment affects the increasing value of redshift as well for most of the data. However, increasing value of photometric value also supports decreasing range value of redshift value distribution.
The best machine learning model for classifying stellar objects is Radom Forest model.

References

[1] Sloan Digital Sky Survey, “The wide net cast by the SDSS telescope,” https://skyserver.sdss.org/dr7/en/sdss/#:~:text=The%20wide%20net%20cast%20by,history%20of%20our%20solar%20system. Accessed on: Nov. 23, 2024.
[2] SDSS Voyages, “The SDSS Telescope,” https://voyages.sdss.org/preflight/capturing-recording-light/sdss-telescope/. Accessed on: Nov. 23, 2024.
[3] K. Abdurro’uf, C. Accetta, V. Aerts, R. Silva Aguirre, R. Ahumada, and N. Ajgaonkar, et al., “The seventeenth data release of the Sloan Digital Sky Surveys: Complete release of MaNGA, MaStar, and APOGEE-2 data,” The Astrophysical Journal Supplement Series, 259(35), pp. 1–39, 2022. DOI. Accessed on: Nov. 23, 2024.
[4] Suresh Kumar Mukhiya and Usman Ahmed, Hands-On Exploratory Data Analysis with Python, Birmingham: Packt Publishing Ltd., 2020.
[5] Kaggle Learn, "Mutual Information: Local features with the most potential", https://www.kaggle.com/code/ryanholbrook/mutual-information. Accessed on: Nov. 18, 2024.
[6] SimpliLearn, "Feature Selection in Machine Learning: All You Need to Know," https://www.simplilearn.com/tutorials/machine-learning-tutorial/feature-selection-in-machine-learning. Accessed on: Nov. 28, 2024.
[7] GeeksforGeeks, "Splitting Data for Machine Learning Models," https://www.geeksforgeeks.org/splitting-data-for-machine-learning-models/. Accessed on: Nov. 28, 2024.
[8] Analytics Vidhya, "SMOTE for Imbalanced Classification with Python," https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/. Accessed on Nov. 22, 2024.
[9] DataCamp, "Normalization vs. Standardization: How to Know the Difference," https://www.datacamp.com/tutorial/normalization-vs-standardization. Accessed on: Nov. 22, 2024.
[10] Kaggle Learn, "Categorical Variables," https://www.kaggle.com/code/alexisbcook/categorical-variables. Accessed on: Nov. 28, 2024.
[11] KD Nuggets, “How Does Logistic Regression Work?”, https://www.kdnuggets.com/2022/07/logistic-regression-work.html. Accessed on: Nov. 25, 2024.
[12] GeeksforGeeks, “Advantages and Disadvantages of Logistic Regression,” https://www.geeksforgeeks.org/advantages-and-disadvantages-of-logistic-regression/. Accessed on: Nov. 25, 2024.
[13] J. Brownlee, “Multinomial Logistic Regression with Python,” Machine Learning Mastery, 2020. https://machinelearningmastery.com/multinomial-logistic-regression-with-python/. Accessed on: Nov. 25, 2024.
[14] GeeksforGeeks, “K-Nearest Neighbor (KNN) Algorithm,” https://www.geeksforgeeks.org/k-nearest-neighbours/. Accessed on: Nov. 25, 2024.
[15] GeeksforGeeks, “Random Forest Algorithm in Machine Learning,” https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-learning/. Accessed on: Nov. 25, 2024.
[16] GeeksforGeeks, “What are the Advantages and Disadvantages of Random Forest?”, https://www.geeksforgeeks.org/what-are-the-advantages-and-disadvantages-of-random-forest/. Accessed on: Nov. 25, 2024.
[17] GeeksforGeeks, “XGBoost,” https://www.geeksforgeeks.org/xgboost/. Accessed on: Nov. 25, 2024.
[18] GeeksforGeeks, “Mean Squared Error,” https://www.geeksforgeeks.org/mean-squared-error/. Accessed on: Nov. 26, 2024.
[19] DataCamp, “What is A Confusion Matrix in Machine Learning? The Model Evaluation Tool Explained,” https://www.datacamp.com/tutorial/what-is-a-confusion-matrix-in-machine-learning. Accessed on: Nov. 26, 2024.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
images		images
README.md		README.md
requirements.txt		requirements.txt
stellar_classification.ipynb		stellar_classification.ipynb

Folders and files

Latest commit

History

Repository files navigation

Predictive Analytics Project Report - Maya Sopiah Lubis

Project Domain

Business Understanding

Problem Statements

Goals

Solution Statements

Data Understanding

Features Description

Dataset Information

Data rows amount

Features amount and type

Missing data

Duplicated Data

Features Description

Outliers

delta feature

u, g, and z features

redshift feature

All features

Handling Outliers - Ouliers Cleaning

Exploratory Data Analysis - Univariate

class Feature

Numerical Features

Exploratory Data Analysis - Multivariate

Multivariate Analysis of Features Relationship

class Feature and Sky Longitude and Latitude

class Feature and Photometric System

class Feature and redshift Feature

redshift Feature and Photometric System

Data Preparation

Dropping Features

Training and Test Data Splitting

Handling Imbalance Data

Standardization

Label Encoding

Modeling

Logistic Regression

Model Definition

Model Advantages

Model Disadvantages

Parameters Details

Implemented Parameters Value

Model Overview

K-Nearest Neighbor (KNN)

Model Definition

Model Advantages

Model Disadvantages

Parameters Details

Implemented Parameters Value

Model Overview

Random Forest

Model Definition

Model Advantages

Model Disadvantages

Parameters Details

Implemented Parameters Value

Model Overview

Extreme Gradient Boosting (XGBoost)

Model Definition

Model Advantages

Model Disadvantages

Parameters Details

Implemented Parameters Value

Model Overview

Selecting the Best Model

Evaluation

Evaluation Metrics Explanation

Mean Squared Error (MSE)

Confusion Matrix

Project Result

Logistic Regression

K-Nearest Neighbor (KNN)

Random Forest

Extreme Gradient Boost (XGBoost)

Result

Final Conclusions

`delta` feature

`u`, `g`, and `z` features

`redshift` feature

`class` Feature

`class` Feature and Sky Longitude and Latitude

`class` Feature and Photometric System

`class` Feature and `redshift` Feature

`redshift` Feature and Photometric System

Packages