Sky is an object which is closely related to human life. For centuries, the sky has been an object which has attracted a lot of interest from the mankinds for further research. The sky initiated many dreams and theories about the universe. Some of them are theories about the creation of the universe, introduction to the structure of the solar system, and the missions to search for planets in other galaxies which has the same structure as Earth. Therefore, the Sloan Digital Sky Survey (SDSS) project was launched in 1998 for exploring the space to look for answers toward several questions related to outer space that have always existed.
Among many tasks of SDSS, one of it is mapping the sky [1], [2]. The dataset of this classification project is the result of sky mapping in batch of seventeeth data release (DR17) of SDSS Stage 4. Based on released journal of "The Seventeenth Data Release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar, and APOGEE-2 Data", total of cumulative observed infrared spectra by SDSS telescopes is 2,659,178 individual infrared spectra which represent stellar objects in the sky [3]. Infrared spectras need to be observed and calculated in order to classify them into several categories. By the huge amount of stellar objects which require to be categorized, this task must be time-consuming when it is executed in manual way, considering that there are many mathematical calculations involved in the categorization process.
Machine learning is one of the vast developing technology nowadays. As machine learning evolves, it becomes a versatile technology which can be adapted to many fields, such as astronomy. One of machine learning capability is classification. By the provided variables or features in DR17 dataset from SDSS such as redshift and photometric systems, machine learning is capable to execute stellar objects classification for high amount of data in short amount of time and high reliability. Therefore, in this project, author is keen to apply machine learning for classifying stellar objects into three categories (galaxy, star, and quasar). Prediction will be using four machine learning algorithm for classification, which consist of Logistic Regression, K-Nearest Neighbor (KNN), Random Forest, and Extreme Gradient Boost (XGBoost).
Project journal reference: 'The Seventeenth Data Release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar and APOGEE-2 Data'
Based on the previously stated project domain, problem statements of this project are as follows.
- How are the distributions of stellar objects across all categories?
- Which features strongly show differences between each category of stellar objects?
- What is the position of each stellar objects category on sky longitude and latitude?
- How are the distributions of stellar objects compared to the photometric system?
- How is the relationship between redshift towards photometric system on each stellar objects categories?
- What is the best machine learning model for classifying stellar objects?
Based on the problem statements, goals to be achieved in this project are as follows.
- To determine the amount and percentage of stellar objects on each categories.
- To analyze the variable or feature of stellar objects with the highest correlation and mutual information value.
- To examine each stellar objects category position on sky longitude and latitude.
- To analyze the distributions of stellar object to each of photometric system, which consists of u, g, r, i, and z.
- To identify relationship patterns between redshift value towards each of photometric system on each stellar objects category.
- To find the best machine learning model with the highest accuracy for classifying stellar objects.
The solutions for achieving the goals are as follows.
- Execute the Explanatory Data Analysis (EDA) process by using suitable diagrams of plots to accomplish the goals of:
- determining the amount and percentage of stellar objects on each categories by using bar plot,
- analyzing the variable or feature of stellar objects with the highest correlation and mutual information value by using correlation heatmap and mutual information bar plot,
- examining each stellar objects category position on sky longitude and latitude by using strip plot,
- analyzing the distributions of stellar object to each of photometric system, which consists of u, g, r, i, and z by using strip plot and histogram,
- identifing relationship patterns between redshift value towards each of photometric system on each stellar objects category by using strip plot, histogram, and scatter plot.
- Build four machine learning models for classification by using four algorithms below.
- Logistic Regression,
- K-Nearest Neighbor (KNN),
- Random Forest,
- Extreme Gradient Boosting (XGBoost).
- Determine the best machine learning model by four metrics below.
- Confusion matrix,
- Error rate of training data prediction,
- Error rate of test data prediction,
- Accuracy score.
Access to the dataset of this project is from Kaggle platform on this link. The dataset is published three years ago by user named FEDESORIANO. Usability rate of this dataset by Kaggle is 10.0. The dataset consists of single CSV file.
The features name and features details are down below.
| Feature Name | Feature Details |
|---|---|
| obj_id | An unique identifier of stellar object which identify each object in dataset |
| alpha | Right ascension angle, which is the celestial longitude line. The unit is in degree. |
| delta | Declination angle, which is the celestial latitude line. The unit is in degree. |
| u | Photometric system, representing ultraviolet color |
| g | Photometric system, representing green color |
| r | Photometric system, representing red color |
| i | Photometric system, representing near ultraviolet color |
| z | Photometric system, representing infrared color |
| run_ID | Run number, identify the specific scan |
| rerun_ID | Rerun number, to specify how the image was processed |
| cam_col | Camera column, to identify the scanline within the run |
| field_ID | Field number, to identify each field |
| spec_obj_ID | Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class) |
| class | Stellar object class. GALAXY represents galaxy. STAR represents star. QSO represents quasar. |
| redshift | Redshift value based on the increase in wavelength |
| plate | Plate ID, identifies each plate in SDSS |
| MJD | Modified Julian Date, used to indicate when a given piece of SDSS data was taken |
| fiber_ID | Fiber ID, identifies the fiber that pointed the light at the focal plane in each observation |
There are 100,000 data rows in the dataset.
There are 18 features in the dataset. Based on information from info() function, the DataFrame features data type consist of:
- Float type consists of 10 features,
- Integer type consists of 7 features,
- Object (or text) type consists of 1 feature.
No missing or empty data in the dataset
No duplicated data in the dataset
| obj_ID | alpha | delta | u | g | r | i | z | run_ID | rerun_ID | cam_col | field_ID | spec_obj_ID | class | redshift | plate | MJD | fiber_ID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 |
| unique | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 3 | nan | nan | nan | nan |
| top | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | GALAXY | nan | nan | nan | nan |
| freq | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 59445 | nan | nan | nan | nan |
| mean | 1.23766e+18 | 177.629 | 24.1353 | 21.9805 | 20.5314 | 19.6458 | 19.0849 | 18.6688 | 4481.37 | 301 | 3.51161 | 186.131 | 5.78388e+18 | nan | 0.576661 | 5137.01 | 55588.6 | 449.313 |
| std | 8.43856e+12 | 96.5022 | 19.6447 | 31.7693 | 31.7503 | 1.85476 | 1.75789 | 31.7282 | 1964.76 | 0 | 1.58691 | 149.011 | 3.32402e+18 | nan | 0.730707 | 2952.3 | 1808.48 | 272.498 |
| min | 1.23765e+18 | 0.00552783 | -18.7853 | -9999 | -9999 | 9.82207 | 9.4699 | -9999 | 109 | 301 | 1 | 11 | 2.99519e+17 | nan | -0.00997067 | 266 | 51608 | 1 |
| 25% | 1.23766e+18 | 127.518 | 5.14677 | 20.3524 | 18.9652 | 18.1358 | 17.7323 | 17.4607 | 3187 | 301 | 2 | 82 | 2.84414e+18 | nan | 0.0545168 | 2526 | 54234 | 221 |
| 50% | 1.23766e+18 | 180.901 | 23.6459 | 22.1791 | 21.0998 | 20.1253 | 19.4051 | 19.0046 | 4188 | 301 | 4 | 146 | 5.61488e+18 | nan | 0.424173 | 4987 | 55868.5 | 433 |
| 75% | 1.23767e+18 | 233.895 | 39.9016 | 23.6874 | 22.1238 | 21.0448 | 20.3965 | 19.9211 | 5326 | 301 | 5 | 241 | 8.33214e+18 | nan | 0.704154 | 7400.25 | 56777 | 645 |
| max | 1.23768e+18 | 360 | 83.0005 | 32.7814 | 31.6022 | 29.5719 | 32.1415 | 29.3837 | 8162 | 301 | 6 | 989 | 1.41269e+19 | nan | 7.01124 | 12547 | 58932 | 1000 |
In general, the understandable feature description at this moment is class. On class feature description, the DataFrame consists of three types of stellar object. Most of the data are in the GALAXY category with more than fifty nine thousand of data.
For data abnormalities, some features can be suspected having negative outlier according to the comparison between features mean and min values. All columns out from class feature have a positive mean value. However, some features are having negative min value. Some of the features are:
deltaugzredshift
Outliers are data points which diverge from observations for several reasons, such as a natural anomalies. The main reason for outliers detection and filtering is because outliers can extremely deviates some components in statistic analysis such as mean value. The deviated statistical like this can cause serious issue during machine learning model building [4].
The conclusion from boxplot of delta feature above is that three classes of stellar object has no outliers in this feature. delta feature value smaller to zero is not an anomaly.
From the boxplots above, the plots are unidentified for all features since the range of plots is around zero. The boxplots indicate an outlier for each feature which have extremely small value compared to the position of boxplot. From the observation for each features, the outlier for each feature direct the one single same data row. The outliers originates not from different data row.
From boxplot of redshift feature above, no outliers are identified when redshift value is smaller than zero in all classes. Otherwise, outliers are identified when redshift value are more than 1.2 for GALAXY class, and redshift value are more than 3.9 for QSO class. No outliers are identified for STAR class. It possibly can happen since max value of redshift outlier are around 7.0, which is too big compared to STAR class range.
From boxplot above for redshift feature of STAR class, the plot shows that statistical range of this class is extremely thin. With max value is slightly greater than 0.0006 and min value is slightly lesser than -0.0008, the statistical range value of STAR class is only 0.0014. This value is tiny compared to distribution of redshift value in GALAXY and QSO classes. Outliers exist quite a lot as well above max value and below min value.
The conclusions based on boxplots above are down below.
- Features consist of outliers in all classes are 3 in total.
r: Outliers exist above max value and below min value.i: Outliers exist above max value and below min value.field_id: Outliers exist above max value only.
- Features consist of outliers in some classes are 3 in total.
spec_obj_ID: An outlier exists above max value ofSTARclass.plate: An outlier exists above max value ofSTARclass.MJD: Some outliers exist below min value ofQSOclass.
- Features with no outlier are 5 in total.
obj_IDalpharun_IDcam_colfiber_ID
- Feature with no statistical range is only 1 in total, which is
rerun_ID. Since this feature has only one value for entire data, feature will be dropped.
Q1 = stellar_df.quantile(0.25)
Q3 = stellar_df.quantile(0.75)
IQR = Q3 - Q1
outliers = stellar_df[((stellar_df<(Q1-1.5*IQR))|(stellar_df>(Q3+1.5*IQR))).any(axis=1)]
print(f"Amount of outliers: {len(outliers)} rows")
Output:
Amount of outliers: 14403 rows
By using Interquartile Range (IQR) technique for observating outliers of the dataset as the code above [4], it is concluded that there are more than fourteen thousand outliers on the Stellar Object DataFrame. It is around more than one per tenth of total of whole data.
Since the amount of non-outlier data is sufficient, the outliers can be dropped.
stellar_df = stellar_df[~((stellar_df<(Q1-1.5*IQR))|(stellar_df>(Q3+1.5*IQR))).any(axis=1)]
print(f"Amount of stellar_df (after dropping outliers): {len(stellar_df)} rows")
Output:
Amount of stellar_df (after dropping outliers): 85596 rows
After outliers are dropped, the remaining data of the dataset is more than eighty five thousand rows.
The only categorical feature in stellar objects dataset is class feature. As mentioned, class feature consists of three classes, which is GALAXY, STAR, and QSO. The visualization of class feature in bar plot are down below.

From the barplot above, it concludes that the majority of data belongs to GALAXY class which holds more than half amount of whole data (64.95%), followed by STAR (23.85%) and QSO (11.20%) classes. The small percentage of data belongs to QSO. This is shown that observed stellar object in the sky by SDSS telescope mostly consist of galaxies. Stars and quasars are taking lesser amount in the sky. The barplot also indicates that the data for machine learning is imbalance. Thus, the data needs to be balanced to get great performance of prediction in the future.
In this phase, every single numerical features in DataFrame is depicted in histogram in purpose of observing data distribution. There are sixteen numerical features which is observed.
The multiple histograms above describe the distribution of data from numerical features of stellar object dataset. The conclusions are down below.
- Features with nearly normal data distribution consist of
obj_ID,alpha,u,g,r,i,z,run_ID, andcam_col. All photometric colors belongs to this category. - Features with right-skewed data distribution consist of:
delta,field_ID,spec_obj_ID,redshift,plate, andfiber_ID. - Feature with left-skewed data distribution only consist of
MJD.
In this phase, the relationships between all features are analyzed. This is important since the data is taken in the same time. Therefore, there will be some possibilities that one feature can affect the value in other features, whether in positive or in negative ways. The observation of features relationship uses two metrics, which are Correlation Heatmap and Mutual Information Bar Plot.
Correlation Heatmap
Correlation heatmap is a heatmap which displays the strength of correlation between all features to each other of them [4]. There are three types of relationships in correlation heatmap [4]:
- Positive correlation, which means two features affect each others value in the same direction.
$x$ value is$0<x≤1$ . - No correlation, which means two features don't affect each others value.
$x$ value is equal to zero. - Negative correlation, which means two features affect each other value in different direction.
$x$ value is$-1≤x<0$ .
Correlation heatmap is only able to detect the relationship of numerical feature to the other numerical feature. The visualization of correlation heatmap is down below.
According to the correlation heatmap above, the conclusions are down below.
- The features with generally low positive or low negative correlations are:
obj_IDalphadeltarun_IDcam_colfield_IDfiber_ID
- The features with mostly strong positive correlations are:
- All photometric features, which are
u,g,r,i, andz. spec_obj_IDredshiftplateMJD
- All photometric features, which are
Mutual Information Bar Plot
Mutual information is an univariate metric to calculate the certainty level of a feature towards target value when a feature value is at a particular level. The higher mutual information value of a feature, it means more certain the feature to define target value.
In contrast to correlation map, mutual information is able to calculate relationship of feature and target with different data type, which is easier to use. However, since mutual information is an univariate metric, it is not able to calculate relationship of each features in dataset as correlation heatmap does [5]. The visualization of mutual information bar plot is down below.
Based on the mutual information barplot above, the conclusions are down below.
- The highest value of mutual information holds by
redshiftfeature. It means thatredshiftvalue in every level is strongly certain for defining or predicting the value ofclassfeature as the target. This is in line with the correlation heatmap. - Surprisingly,
obj_idandrun_IDwith the low correlation values, in contrast are quite certain to predict the value ofclassfeature. This is shown that certainty is not always in line with correlation. - All photometric features are having low certainty to predict target value, even though they are strongly correlated to each other.
alpha,delta,fiber_ID,field_id, andcam_colare both low in correlation and certainty in prediction.spec_obj_ID,plate, andMJDare both quite strong in correlation and certainty in prediction.
Sky longitude is represented as alpha feature, and sky latitude is represented as delta feature. Position of each stellar objects category in the sky based on sky longitude and latitude is on the strip plots visualization down below.
The two strip plots above describe the position of stellar objects based on its categories in the sky. The conclusion are down below.
alphafeature- The type of stellar object category which scatters evenly on each
alphadegree isSTAR. There is quite low density ofSTARdata inalphaon range 250 to 300, but still there is a lot stars on those coordinates. GALAXYandQSOhave the most identical scattering pattern onalpha. Both categories are having a dense amount of object all along the coordinates. Somehow, the density is decreasing for both categories in range 50 to 100, and 250 to 300.
- The type of stellar object category which scatters evenly on each
deltafeature- The type of stellar object category which has the longest scattering range all along the
deltacoordinates isSTAR, starting from -20 up to 80. - For
STAR, The density is stable from -20 to 70. At the point of 70 degree, no star object there. Then, the rest of several star objects are appearing again slightly up from 70 degree until 80. - Again,
GALAXYandQSOhave the most identical scattering pattern ondelta. Both scattering start from range 10 to 70. Few stellar objects appear below minimum value and above maximum value of the range for both categories.
- The type of stellar object category which has the longest scattering range all along the
Spectrum lengths of each stellar objects category in the sky based on photometric system are on the strip plots visualization down below.
From photometric strip plots above, some conclusions are down below.
- From all stellar object,
QSOhas the shortest range of spectrum value compared to the other object in all colors. QSOminimum spectrum is higher thanGALAXYandSTARminimum spectrum, and its maximum spectrum is lower thanGALAXYandSTARmaximum spectrum, which makesQSOspectrum range length is shorter and in betweenGALAXYandSTARspectrum range length.GALAXYandSTARclasses have the similar range of spectrum value of all colors.- The minimum and maximum value of
GALAXYandSTARare also similar to each other. - In short sentence,
GALAXY,STAR, andQSOhave similarity in matter of photometric and quite hard to be distincted. To compare the differences betweenGALAXYandSTAR, it needs other features which records more significant differences between these two objects.
Histograms above are giving more details about data density in each photometric system spectrum value. From the histograms of photometric features above, the conclusions are down below.
- In all colors,
GALAXYclass data distribution tends to be skewed-left, which means most of the data have high value in photometric features. - For
QSOclass, data distribution tends to be skewed-left as well, except when it is recorded onucolor which has skewed-right data distribution. It means thatQSOinucolor has more data with low value. - It is different compared to
STARclass data distribution.STARin all colors tends to have nearly normal distribution. It means that the majority ofSTARdata has photometric value in the middle position.
Based on features relationship analysis in prior, redshift feature has high value in both correlation matrix and mutual information barplot. Strip plot and histogram above explain further about the high value. Both plots are showing significant differences between three classes of stellar object. The conclusions are down below.
GALAXYclass- On strip plot, the value of
redshiftstarts at 0.0 then stops at around 1.70. The density is stable since the start point until 1.0. Slightly up after 1.0 until around 1.70, the density is decreasing. - On histogram, the density is visualized better.
redshiftdata distribution forGALAXYclass shows binomial pattern, where there are two summits which displays thatGALAXYclass have two highest values. Majority of data is on range 0.0 to 0.75, then significantly decreasing after 0.75 until around 1.70.
- On strip plot, the value of
QSOclass- Similar as
GALAXYclass, on strip plot, the value ofredshiftofQSOstarts at 0.0 then stops at around 1.70. The difference of both class lies on the density. Data density is low since the start point until 0.5. Then, it is getting denser after 0.5 until around 1.7. - On histogram,
redshiftdata distribution forQSOclass shows skewed-left pattern, where most of data have high value ofredshiftvalue.
- Similar as
STARclass- On strip plot, the value of
redshiftis only small strip at 0.0. This shows that allSTARdata has no spectrum of redshift at all. - On histogram,
redshiftdata distribution forSTARclass shows long bar only at 0.0. All data is centered only in one point.
- On strip plot, the value of
redshift feature has relatively quite strong correlation with all photometric features, with correlation scores around 0.33 to 0.68. Scatter plot is a plot which can show the correlation between two numerical features in details. Because of the strong relationship between redshift and photometric value, then serving the data in the scatter plot is giving more information to be delved.
Here is the conclusions from the scatter plots above.
- There is no significant difference in data patterns on all scatter plots.
- For
GALAXYclass, increasing value of photometric features affects the increasing value ofredshiftas well for most of the data. However, increasing value of photometric value also supports increasing range of data distribution ofredshiftvalue. For example on scatter plot ofzfeature towardsredshift, it is shown that whenzvalue is on 14, range ofredshiftvalue is between 0.00 to around 0.10. Whenzvalue is on 16, range ofredshiftvalue is increasing between 0.00 to around 0.25. Then the maximum value in range ofredshiftkeeps on increasing whenzvalue is increasing. The higherzscore, it doesn't affect the minimum value in range ofredshift. - For
QSOclass,redshiftvalue is already in relatively high place, even though the photometric value is increasing. However, increasing value of photometric value does supports decreasing range of data distribution ofredshiftvalue. For example on scatter plot ofzfeature towardsredshift, it is shown that whenzvalue is on 18, range ofredshiftvalue is between around 0.75 to 1.75. Whenzvalue is on 20, range ofredshiftvalue is decreasing between around 1.00 to 1.75. Then the minimum value in range ofredshiftkeeps on increasing whenzvalue is increasing. The higherzscore, it doesn't affect the maximum value in range ofredshift. - For
STARclass, relationship betweenredshiftfeature and photometric features doesn't affect each other of it. For example on scatter plot ofzfeature towardsredshift, how low or high the value ofz,redshiftvalue will always be zero.
In this phase, selecting the relevant features is necessary for building machine learning model. Selecting features helps to reduce unrelated features from dataset and preserving the most important information like major trends or patterns [6].
Since redshift is the strongest and it is affecting a lot of other features in Stellar Object dataset, redshift feature itself will remains, followed by all of the photometric features since those features correlate strongly with redshift. The rest of features except class feature, are dropped and not included in machine learning model building.
Features consist of redshift feature and all photometric values. Target is class feature.

The total of remaining feature is seven after dropping features.
Dividing data into training data and test data is important. Training data is used to let model learn all the patterns of data during model training process, while test data is used to examine the accuracy of model after training process. Test data are separated from training data to avoid data leakage which affecting accuracy value to be false positive or true negative. It also helps to avoid underfitting or overfitting towards model performance when predicting new data. [7]

With total data more than eighty five thousand, it is sufficient enough for performing model training. The chosen ratio between training data and test data in this case is 90:10. After training and test data splitting, the total of training data is slightly more than seventy seven thousand, while total of test data is more than eight thousand.
Based on bar plot for calculating the amount of data for each category previously, it is shown that there is way many data in each category, but lesser in other categories. In Stellar Object dataset, more than half of data belongs to GALAXY class, while the rest of small amount belongs to STAR and QSO classes. It is called as imbalanced data when distribution data for all categories in dataset is not even. Imbalance data can affect the quality of model and poor model performance in prediction. Therefore handling imbalance data is necessary to be performed [8].
There are two ways of handling imbalance data, which is undersampling and and oversampling. Undersampling is a method to keep data in minority class, then reducing data in majority class. Likewise, oversampling is a method to keep data in majority class, then adding synthetic data in minority data. The method used in this case is combination of both using SMOTE-Tomek technique [8].

The bar plots above shows the comparison of data amount before and after handling imbalance data. SMOTE-Tomek has successfully balanced data amount of each class. GALAXY class as the majority class, is undersampled slightly from more than fifty thousand data to more than fourty nine data. Both for QSO and STAR classes as the minority, are oversampled to whopping fourty nine more thousand data.
Standardization is a process of data transformation for having mean equals to zero and standard deviation equals to one. Standardization is important to ensure that some features don't dominate other feature due to high or low magnitude. Standardization is executed for both training and test data. [9]

After the standardization of training data, mean of all features equal to zero and the standar deviation of all features equal to one.

After the standardization of test data, mean of all features near equal zero and the standar deviation of all features near equal to one.
Machine learning only can process numerical value. Hence, categorical label as stellar value needs to be converted from text to number. Label encoding is a process to convert categorical label to numerical form. Both label in feature and target are converted by this method [10].
In this phase, machine learning model files for classifying stellar objects are build by executing training process on training data. There are four machine learning algorithm used for building the model:
- Logistic Regression
- K-Nearest Neighbor (KNN)
- Random Forest
- Extreme Gradient Boosting (XGBoost)
Each of machine learning process following these steps of model building:
- Searching best parameter values for machine learning algorithm.
- Model training. Error rate will be calculated by Mean Squared Error (MSE) metric.
- Model prediction. Error rate will be calculated by Mean Squared Error (MSE) metric as well.
- Calculating model accuracy. Accuracy rate will be calculated by Accuracy metric.
Logistic regression is a supervised machine learning classification algorithm which is used to predict the probability of certain classes based on features value. Logistic regression analyzes relationship between variables, then assign the probabilities using Sigmoid function which converts numerical result into probability between 0.0 and 1.0. [11]
The advantages of Logistic Regression are down below [12].
- It easily can be extended from binomial classification (classification of two classes) to multinomial classification (classification of more than two class).
- Achieve nice accuracy for simple datasets, and it performs well when the dataset is linearly separable.
- It is rapid at classifying the unknown records.
The disadvantages of Logistic Regression are down below [12].
- If the number of observations is lesser than number of features, Logistic regression is not recommended to be used since it will lead to overfitting.
- It requires average or multicollinearity between independent variables.
- It only can be used to predict discrete function.
- C: Represents the amount of regularization which determines on how much the model is penalized for increasing the magnitude of parameter values. C mission is to prevent overfitting.
- penalty: Represents the regularization type. Penalty prevents model from assigning too much importance to any features, which makes model be more general.
- solver: Represents the optimization algorithm. Solver job is to find the best weight or parameter value to minimize the loss function.
- multi_class: Determine classification type between binomial or multinomial.
- max_iter: Represents the maximum number of iterations that solver is allowed to run before it stops. If the solver has not converged to a solution within the specified number of iterations, it will terminate the process.
The best value of Logistic Regression algorithm paramaters which are implemented are down below.
C: 10penalty: 'l2'solver: 'lbfgs'multi_class: 'multinomial'max_iter: 2000
On training process using Logistic Regression algorithm, model summaries are down below.
- Error rate of predicting training data is around 0.0733, or 7.33%.
- Error rate of predicting testing data is around 0.0853, or 8.53%.
- Model accuracy is around 0.9370, or 93.70%.
K-Nearest Neighbor (KNN) is a supervised machine learning algorithm which makes prediction based on the distance between one data point to others. KNN works by finding the nearest neigbor data of a data point in 'k' amount, based on distance metric such as Euclidean distance. [13]
The advantages of KNN are down below [14].
- Low complexity which makes KNN easy to implement.
- When KNN is working, all data points are stored in memory storages. Whenever new data points are entered, the algorithm easily adjusts itself, which makes this algorithm adaptable.
- Requires only two parameters, which is distance metric and value of 'k'.
The disadvantages of KNN are down below [14].
- KNN is a lazy algorithm which takes computing resource and data storage. The consequences of this behavior are time-consuming and resource exhausting.
- Curse of dimensionality. This is a phenomenon when algorithm becomes difficult classifying the data properly when the dimensionality is too high.
- When the algorithm is affected by curse of dimensionality, it also becomes prone to overfitting.
- metric: Represents the type of distance metric. The options consist of Euclidean distance, Manhattan distance, and Minkowski Distance.
- n_neighbors: Represents the 'k' value, which is the amount of neighbor for each data point.
- weight: Represents weight scoring technique for the neighbor data point.
The best value of KNN algorithm paramaters which are implemented are down below.
metric: 'euclidean'n_neighbors: 3weights: 'distance'
On training process using KNN algorithm, model summaries are down below.
- Error rate of predicting training data is 0.00, or 0%.
- Error rate of predicting testing data is around 0.0967, or 9.67%.
- Model accuracy is around 0.9461, or 94.61%.
Random Forest is an ensemble learning algorithm version of Decision Tree algorithm. It consists of aggregating multiple decision tree to obtain better result. Random Forest designed to enhance the accuracy and robustness of classification task. Each decision tree in the Random Forest is constructed using a subset of the training data and a random subset of features introducing diversity among the trees, making the model more robust and less prone to overfitting [15].
The advantages of Random Forest are down below [16].
- It tends to achieve higher accuracy due to aggregating several decision trees.
- Capable to handle both numerical and categorical data without feature engineering.
- Random Forest itself estimates the importance of features.
The disadvantages of Random Forest are down below [16].
- Using a large amount of trees in Random Forest is computationally expensive.
- Random Forest takes longer time to train model than other algorithms.
- Random Forest can suffer from overfitting when the model captures noise in the training data.
- random_state: Seed number to preserve selected data from dataset for training process.
- n_estimators: Represents number of trees in the forest.
- min_samples_split: Minimum samples required to split a node. Higher value prevents overfitting, but too high value can hinder model complexity.
- min_samples_leaf: Minimum samples required to be at a leaf node.
- max_features: Number of features considered for splitting at each node.
- max_depth: Maximum depth of each tree. Deeper trees can capture more complex patterns, but also risk overfitting.
The best value of Random Forest algorithm paramaters which are implemented are down below.
random_state: 42n_estimators: 200min_samples_split: 2min_samples_leaf: 1max_features: 'sqrt'max_depth: None
On training process using Random Forest algorithm, model summaries are down below.
- Error rate of predicting training data is 0.00, or 0%.
- Error rate of predicting testing data is around 0.0303, or 3.03%.
- Model accuracy is around 0.9736, or 97.36%.
XGBoost is an optimized distributed gradient boosting library designed for efficient and scalable training of machine learning models. It is an ensemble learning method that combines the predictions of multiple weak models to produce a stronger prediction. XGBoost stands for “Extreme Gradient Boosting” and it has become one of the most popular and widely used machine learning algorithms due to its ability to handle large datasets and its ability to achieve state-of-the-art performance in many machine learning tasks such as classification and regression [17].
The advantages of XGBoost are down below [17].
- XGBoost is designed for efficient and scalable model training, which makes it as a good choice for training large datasets.
- Contains a lot of parameters which makes it highly customizable.
- XGBoost has built-in support to handle missing data.
The disadvantages of Random Forest are down below [17].
- Can be computationally expensive, especially for training large models.
- Prone to overfitting, especially when trained on a small amount of data.
- Finding the optimal set of parameters is time-consuming and require expertise.
- random_state: Seed number to preserve selected data from dataset for training process.
- eval_metric: Evaluation metrics for validation data, a default metric will be assigned according to objective.
- learning_rate: Variable that modifies how much each tree contributes to the final prediction. While more trees are needed, smaller values frequently result in more accurate models.
- max_depth: Controls the depth of every tree, avoiding overfitting and being essential to controlling the model's complexity.
- n_estimators: Specifies the number of boosting rounds.
- subsample: Manages the percentage of data that is sampled at random to grow each tree, hence lowering variance and enhancing generalization. Setting it too low, though, could result in underfitting.
The best value of XGBoost algorithm paramaters which are implemented are down below.
random_state: 42eval_metric: 'logloss'learning_rate: 0.2max_depth: 7n_estimators: 200subsample: 0.8
On training process using XGBoost algorithm, model summaries are down below.
- Error rate of predicting training data is around 0.0093, or 0.93%.
- Error rate of predicting testing data is around 0.0416, or 4.16%.
- Model accuracy is around 0.9696, or 96.96%.
By observing each of model overview, the best model of this project is Random Forest. Random Forest achieve the lowest error rate of training and test data and the highest accuracy among other models. Random Forest is selected as the solution for stellar objects classification. Further explanation of this selection process is on 'Evaluation' phase.
In this final phase, the best model will be chosen for prediction, based on several categories:
- Overfitting rate.
- Error rate of training data prediction.
- Error rate of test data prediction.
- Model accuracy score.
Mean squared error (MSE) is a metric used to measure the average squared difference between the predicted values and the actual values in the dataset. It is calculated by taking the average of the squared residuals, where the residual is the difference between predicted value and the actual value for each data point. The MSE value provides a way to analyze the accuracy of the model [18].
The function of MSE is as follows.

Interpretation of formula are down below.
- n is the number of observations in the dataset.
- yi is the actual value of the observation.
-
$Y_i - \hat{Y}_i$ is the predicted value of the 'i' observation.
Lower MSE value indicates better accuracy. Higher MSE value indicates that prediction value deviates from the true value. In this project, MSE is used to calculate the error rate of training data and test data during training process.
A confusion matrix is a matrix that summarizes the performance of a machine learning model on a set of test data. It is a means of displaying the number of accurate and inaccurate instances based on the model’s predictions. It is often used to measure the performance of classification models, which aim to predict a categorical label for each input instance [19].
The matrix displays the number of instances produced by the model on the test data.
- True Positive (TP): The model correctly predicted a positive outcome (the actual outcome was positive).
- True Negative (TN): The model correctly predicted a negative outcome (the actual outcome was negative).
- False Positive (FP): The model incorrectly predicted a positive outcome (the actual outcome was negative). Also known as a Type I error.
- False Negative (FN): The model incorrectly predicted a negative outcome (the actual outcome was positive). Also known as a Type II error.

Metrics based on Confusion Matrix are as follows:
- Accuracy. It is used to measure the performance of the model. It is the ratio of Total correct instances to the total instances.

- Precision. It is a measure of how accurate a model’s positive predictions are. It is defined as the ratio of true positive predictions to the total number of positive predictions made by the model.

- Recall. It measures the effectiveness of a classification model in identifying all relevant instances from a dataset. It is the ratio of the number of true positive (TP) instances to the sum of true positive and false negative (FN) instances.

- F1-score. It is used to evaluate the overall performance of a classification model. It is the harmonic mean of precision and recall.

For this project, Accuracy metric is used to measure model performance.
Confusion Matrix

Classification Report

For Logistic Regression model, amount of mistakenly predicted data are down below.
GALAXY: 365 + 64 = 429QSO: 110 + 0 = 110STAR: 0 + 0 = 0
Total of mistakenly predicted data of Logistic Regression is 539. Model accuracy on classification report equals to 0.94 or 94%.
Confusion Matrix

Classification Report

For KNN model, amount of mistakenly predicted data are down below.
GALAXY: 242 + 109 = 351QSO: 93 + 3 = 96STAR: 13 + 1 = 14
Total of mistakenly predicted data of KNN is 461. Model accuracy on classification report equals to 0.95 or 95%.
Confusion Matrix

Classification Report

For Random Forest model, amount of mistakenly predicted data are down below.
GALAXY: 121 + 11 = 132QSO: 94 + 0 = 94STAR: 0 + 0 = 0
Total of mistakenly predicted data of Random Forest is 226. Model accuracy on classification report equals to 0.97 or 97%.
Confusion Matrix

Classification Report

For XGBoost model, amount of mistakenly predicted data are down below.
GALAXY: 142 + 20 = 162QSO: 86 + 0 = 86STAR: 12 + 0 = 12
Total of mistakenly predicted data of XGBoost is 260. Model accuracy on classification report equals to 0.97 or 97%.

From all metrics above, both from confusion matrices and bar plots, the model with best scores in each category are down below.
- Confusion Matrix: model with the lowest score of mistakenly predicted data is Random Forest with score equals to 226.
- Training Error Rate: model with the lowest score of error rate are KNN and Random Forest with score equals to 0%.
- Test Error Rate: model with the lowest score of error rate is Random Forest with score around to 3.03%.
- Model Accuracy : model with the highest score of accuracy is Random Forest with score around to 97.36%.
Based on conclusion above, the best model for predicting stellar objects is Random Forest. This model shows the lowest rate of overfitting, training error rate, and test error rate. Also it shows the highest accuracy amongst all models. This model may be the best, but the model can be tuned further to perform better to handle overfitting.
- The majority of stellar objects in this project dataset belongs to Galaxy category with 64.95% percentage, then followed by Star category with 23.85% percentage. The rest of small amount of stellar objects belongs to Quasar category with 11.20% percentage.
- Stellar objects can be distinguished quiet easily by redshift value. Redshift affects majority of features value. It is also strongly certain to predict stellar object category.
- Stars can be seen all across the entire sky longitude and latitude. Meanwhile, galaxies and quasars can be seen in most part of sky longitude and latitude. Only few galaxies and quasars are seen on these sky coordinates:
- Longitude, or alpha: in range 50-100 degree and 250-300 degree
- Latitude, or delta: in range -20-(-10) degree and 70-80 degree
- Among all categories, quasars have the shortest spectrum length in all photometric system colors. Galaxies and stars have very similar spectrum length in all photometric system, which make these both categories difficult to be distinguised only by photometric system.
- Photometric system affects redshift for each stellar object category in different way.
- Stars: photometric system does not affect redshift at all.
- Galaxies: photometric system increment affects the increasing value of
redshiftfor most of the data. However, increasing value of photometric value also supports increasing range value of redshift value distribution. - Quasars: photometric system increment affects the increasing value of
redshiftas well for most of the data. However, increasing value of photometric value also supports decreasing range value of redshift value distribution.
- The best machine learning model for classifying stellar objects is Radom Forest model.
[1] Sloan Digital Sky Survey, “The wide net cast by the SDSS telescope,” https://skyserver.sdss.org/dr7/en/sdss/#:~:text=The%20wide%20net%20cast%20by,history%20of%20our%20solar%20system. Accessed on: Nov. 23, 2024.
[2] SDSS Voyages, “The SDSS Telescope,” https://voyages.sdss.org/preflight/capturing-recording-light/sdss-telescope/. Accessed on: Nov. 23, 2024.
[3] K. Abdurro’uf, C. Accetta, V. Aerts, R. Silva Aguirre, R. Ahumada, and N. Ajgaonkar, et al., “The seventeenth data release of the Sloan Digital Sky Surveys: Complete release of MaNGA, MaStar, and APOGEE-2 data,” The Astrophysical Journal Supplement Series, 259(35), pp. 1–39, 2022. DOI. Accessed on: Nov. 23, 2024.
[4] Suresh Kumar Mukhiya and Usman Ahmed, Hands-On Exploratory Data Analysis with Python, Birmingham: Packt Publishing Ltd., 2020.
[5] Kaggle Learn, "Mutual Information: Local features with the most potential", https://www.kaggle.com/code/ryanholbrook/mutual-information. Accessed on: Nov. 18, 2024.
[6] SimpliLearn, "Feature Selection in Machine Learning: All You Need to Know," https://www.simplilearn.com/tutorials/machine-learning-tutorial/feature-selection-in-machine-learning. Accessed on: Nov. 28, 2024.
[7] GeeksforGeeks, "Splitting Data for Machine Learning Models," https://www.geeksforgeeks.org/splitting-data-for-machine-learning-models/. Accessed on: Nov. 28, 2024.
[8] Analytics Vidhya, "SMOTE for Imbalanced Classification with Python," https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/. Accessed on Nov. 22, 2024.
[9] DataCamp, "Normalization vs. Standardization: How to Know the Difference," https://www.datacamp.com/tutorial/normalization-vs-standardization. Accessed on: Nov. 22, 2024.
[10] Kaggle Learn, "Categorical Variables," https://www.kaggle.com/code/alexisbcook/categorical-variables. Accessed on: Nov. 28, 2024.
[11] KD Nuggets, “How Does Logistic Regression Work?”, https://www.kdnuggets.com/2022/07/logistic-regression-work.html. Accessed on: Nov. 25, 2024.
[12] GeeksforGeeks, “Advantages and Disadvantages of Logistic Regression,” https://www.geeksforgeeks.org/advantages-and-disadvantages-of-logistic-regression/. Accessed on: Nov. 25, 2024.
[13] J. Brownlee, “Multinomial Logistic Regression with Python,” Machine Learning Mastery, 2020. https://machinelearningmastery.com/multinomial-logistic-regression-with-python/. Accessed on: Nov. 25, 2024.
[14] GeeksforGeeks, “K-Nearest Neighbor (KNN) Algorithm,” https://www.geeksforgeeks.org/k-nearest-neighbours/. Accessed on: Nov. 25, 2024.
[15] GeeksforGeeks, “Random Forest Algorithm in Machine Learning,” https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-learning/. Accessed on: Nov. 25, 2024.
[16] GeeksforGeeks, “What are the Advantages and Disadvantages of Random Forest?”, https://www.geeksforgeeks.org/what-are-the-advantages-and-disadvantages-of-random-forest/. Accessed on: Nov. 25, 2024.
[17] GeeksforGeeks, “XGBoost,” https://www.geeksforgeeks.org/xgboost/. Accessed on: Nov. 25, 2024.
[18] GeeksforGeeks, “Mean Squared Error,” https://www.geeksforgeeks.org/mean-squared-error/. Accessed on: Nov. 26, 2024.
[19] DataCamp, “What is A Confusion Matrix in Machine Learning? The Model Evaluation Tool Explained,” https://www.datacamp.com/tutorial/what-is-a-confusion-matrix-in-machine-learning. Accessed on: Nov. 26, 2024.
