The purpose of this project is to automatically generate labels for clusters of AADL (Architecture Analysis and Design Language) models based on their content. The program processes a series of AADL models, which have been organized into different clusters, to analyze their components, features, and connections. Using natural language processing (NLP) techniques such as TF-IDF and LDA, the system identifies key terms from the models and assigns appropriate labels to each cluster. This labeling process is validated by comparing the generated labels to ground truth labels.
The project is divided into the following main .py files:
- AADL_manager.py - Manages the scanning, parsing, and processing of AADL models. It checks if AADL models are suitable based on their components and features, generates CSV files for suitable models, and associates models with clusters.
- labelling.py - Handles text preprocessing for AADL model data, including tokenization, stopword removal, and lemmatization. It also applies TF-IDF and LDA for label generation and creates reports and plots for the results.
- validation.py - Implements the validation process, including comparing the generated labels with ground truth labels. It uses cosine similarity, precision, recall, and F1 scores to validate the quality of the labels generated by TF-IDF and LDA algorithms.
- main.py - The main entry point of the program. It orchestrates the entire process, from scanning AADL files and preprocessing the data, to generating labels with TF-IDF and LDA, and validating the results.
- utility.py - Contains utility functions for loading configurations, file handling, and directory management. It helps with creating directories, deleting files, listing files, copying files, and getting timestamps.
- Python 3.x
- Required Python libraries:
lxml,collections,os,json, and any other dependencies listed in the followingInstallationparagraph. - NLTK: Used for tokenization, stopword removal, and lemmatization in labelling.py. Make sure to download the required NLTK resources (punkt, stopwords, and wordnet).
- Word2Vec Model: A pre-trained Word2Vec model (Google News vectors) is required in validation.py for text vectorization. Ensure the model is downloaded and correctly loaded.
-
Clone the repository to your local machine:
git clone https://github.com/MarcoDiCapua/AADL_Labelling.git cd AADL_Labelling -
Install the required dependencies:
pip install pandas numpy matplotlib seaborn scikit-learn nltk gensim lxml collections os json
Before running the script, ensure that your config.json is properly configured. The configuration file contains paths to the necessary folders, including:
xmi_folder: Path to the folder containing the AADL.aaxl2files.xmi_suitable_models: Path to the folder where suitable models will be stored.output_folder: Path where the.txtreport will be saved.ground_truth: Path where the.xlsxground truth is saved.
Example config.json:
{
"input_matrix": "input/average_sim.csv",
"clusters": "input/average_sim/clusters.csv",
"xmi_folder": "input/xmi",
"xmi_suitable_models": "output/AADL/xmi_suitable_models",
"output_folder": "output",
"ground_truth": "input/ground_truth_cluster_labels.xlsx"
}Run the main.py script to start the scanning and analysis process:
python main.pyAfter the script completes, the results will be saved in the specified output_folder.
After the main.py execution there will be some output files generated for each step done as such:
Files Generated:
suitable_models_data.csv: Contains data about suitable AADL models, including components and features.suitable_models_cluster.csv: Maps each suitable AADL model to its respective cluster.
Reports:
aadl_scan_results.txt: A text file detailing the results of the AADL file scan, including counts of components, features, connection instances, mode instances, and flow specifications.
Plots:
-
total_vs_suitable_models.png: A bar chart showing the total number of AADL models versus suitable models. -
cluster_distribution.png: A bar chart showing the distribution of AADL models across clusters. -
suitable_cluster_distribution.png: A bar chart showing the distribution of suitable models across clusters. -
top_25_components.png: A bar chart displaying the top 25 most common components across suitable AADL models. -
top_25_features.png: A bar chart displaying the top 25 most common features across suitable AADL models. -
top_25_connections.png: A bar chart displaying the top 25 most common connection instances.
Preprocessed Data Files:
-
preprocessed_clusters.csv: Contains the processed cluster data after tokenization, stopword removal, and lemmatization. -
preprocessed_suitable_models_data.csv: Contains the processed data for suitable models, including components and features, after preprocessing steps.
Intermediate Preprocessing Files:
-
tokenized_clusters.csv: Contains tokenized model names from the clusters data. -
non_alphanumeric_removed_clusters.csv: Contains the clusters data with non-alphanumeric characters removed from the model names. -
stopwords_removed_clusters.csv: Contains the clusters data with stopwords removed from the model names. -
lemmatized_clusters.csv: Contains the clusters data after lemmatization of model names. -
tokenized_suitable_models_data.csv: Contains tokenized data for suitable models (model names, components, features). -
non_alphanumeric_removed_suitable_models_data.csv: Contains suitable models data with non-alphanumeric characters removed. -
stopwords_removed_suitable_models_data.csv: Contains suitable models data with stopwords removed. -
lemmatized_suitable_models_data.csv: Contains suitable models data after lemmatization.
Top 25 Word Reports and Plots:
-
Top 25 Words Plotsfor each preprocessing step (e.g., Top 25 Words in tokenized_clusters.csv, Top 25 Words in non_alphanumeric_removed_clusters.csv, etc.), saved as PNG files in the Top25Report folder. -
Top 25 Words CSV Reports: A CSV file containing the top 25 words for each preprocessing step (e.g., top25_tokenized_clusters.csv, top25_non_alphanumeric_removed_clusters.csv, etc.).
TF-IDF Outputs:
-
Clusters_Top_10_TFIDF.csv: Contains the top 10 TF-IDF words for each cluster. -
Combined_Top_10_TFIDF.csv: Contains the top 10 TF-IDF words for each cluster, considering components and features together. -
Total_Top_10_TFIDF.csv: Merged file containing the top 10 TF-IDF words for both clusters and the combined columns (model names and components/features). -
TFIDF_Labels.csv: Contains the final labels generated using TF-IDF for each cluster, including the top 5 words with their respective scores.
TF-IDF Reports:
TFIDF_Summary_Report.csv: Summary report containing the average and standard deviation of TF-IDF scores for each cluster.
TF-IDF Plots:
-
TFIDF_Scores_for_Clusters_Top_10_Words.png: A plot showing the TF-IDF scores for the top 10 words by cluster. -
TFIDF_Scores_for_Combined_Top_10_Words.png: A plot showing the TF-IDF scores for the combined words by cluster. -
Label_Distribution_TFIDF_Bars_Affiliated.png: A bar chart showing the distribution of labels for clusters based on TF-IDF.
LDA Outputs:
-
Clusters_Top_LDA.csv: Contains the top words from the LDA analysis for each cluster. -
Combined_Top_LDA.csv: Contains the top words from the LDA analysis for the combined model names, components, and features. -
LDA_Labels.csv: Contains the final labels generated using LDA for each cluster, including the top 5 words.
LDA Reports:
-
Clusters_Perplexity.csv: Contains the perplexity scores for each cluster and the number of topics. -
Combined_Perplexity.csv: Contains the perplexity scores for each cluster and the number of topics for the combined data.
LDA Plots:
-
Topics_Distribution_by_Cluster.png: A stacked bar chart showing the distribution of topics for each cluster. -
Perplexity_vs_Number_of_Topics.png: A plot showing the perplexity values for different numbers of topics.
TF-IDF Validation:
TF-IDF_Validation_Results.csv: Contains cosine similarity, precision, recall, and F1 scores for each cluster's TF-IDF labels compared to the ground truth.
Plots:
-
TF-IDF_Cosine_Similarity_Plot.png: A bar chart showing the cosine similarity scores for each cluster. -
TF-IDF_Precision_Recall_F1_Boxplot.png:A boxplot showing the distribution of precision, recall, and F1 scores across all clusters. -
TF-IDF_Precision_Recall_Bar.png: A bar chart comparing precision and recall for each cluster.
LDA Validation:
LDA_Validation_Results.csv: Contains cosine similarity, precision, recall, and F1 scores for each cluster's LDA labels compared to the ground truth.
LDA Plots:
-
LDA_Cosine_Similarity_Plot.png: A bar chart showing the cosine similarity scores for each cluster. -
LDA_Precision_Recall_F1_Boxplot.png: A boxplot showing the distribution of precision, recall, and F1 scores across all clusters. -
LDA_Precision_Recall_Bar.png: A bar chart comparing precision and recall for each cluster.