This program is meant to search for a word in a news dataset.
- Python 3+
- Python modules:
- pandas
pip install pandas - nltk
pip install nltk
- pandas
- Python modules:
- Various nodes:
- This project is meant to be a work for TET subject in EAFIT University, for Systems Engineering students. We are given access to a cluster and we have two slave nodes and a master node which we are using in the code. If you have you're own nodes, you can set their ip adresses in file mpi/hosts_mpi.
First, you have to know for solution there is a batch part, whose idea is to create a table with each word's occurrency in the content of a new. To do so, a table is created where each word is paired to the id of the new it was in, how many times it was found (frequency) and the title of the new.
After the table is completely created, you have to enter the word you are looking for and the algorithm accesses to the mentioned table to search for that word. Saves registers where it was found and organizes those registers descending to finally show top 10 of registers (biggest frequency). It shows frequency, id of the new and its title. This is the online part of code.
Conventions: nThreads: number of threads you want the code runs with.
To use this program, you have to follow the next steps:
-
To clean files
Note: Make sure you have files in /opt/datasets so code can access to clean them. Otherwise, go to code and change path in cleaning.py, line 42.
$ cd serial/ $ python cleaning.py -
To execute serial algorithm
Note: After cleaning, files are saved automatically in this folder. If you want to save it somewhere else, you'll have to change all .cpp files to indicate where they are stored.
$ cd serial/ $ g++ serial.cpp -std=c++11 -o serial $ time ./serial -
To execute OpenMP algorithm
$ cd openmp/ $ g++ wordCounterOMP.cpp -std=c++11 -o omp -fopenmp $ export OMP_NUM_THREADS=<nThreads> $ export OMP_DISPLAY_ENV='true' $ time ./ompNote: an inverted time index is going to be displayed. Divide that time on the number of threads you set and you'll get the real time it took to create batch part (table with the occurrency of each word).
-
To execute MPI with OpenMP algorithm
Note: np refers to number of nodes the algorithm is going to run in. It's specifically 3 because we count on 2 slaves nodes and 1 master node, as explained before.
$ cd mpi $ mpic++ wordCounterMPI.cpp -o mpi -std=c++11 -fopenmp $ export OMP_NUM_THREADS=<nThreads> $ export OMP_DISPLAY_ENV='true' $ mpirun -f hosts_mpi -np 3 ./mpi
Serial
It took 203.18s (3:23:18m) to create inverted index in serial algorithm.
This algorithm is obviously slow, given the fact creating the initial table is O(n), with n equals to number of words in all culumn content of the whole dataset.
Accessing to the created table in the online part is log(n), which gives a little efficiency to it.
OpenMP
Implementing threads helped in execution. However, as explained in OpenMP design, there were two problems with threads: they couldn't either access to the same new nor access to the same table entry to set its value. What threads could do was, given a line (new), they could start counting each word, and different lines could be given to various threads at the same time.
As shown in the following charts, we can conclude that using 3 threads is the most optimum. The more threads, the worst is going to become performance.
MPI with OpenMP
Doing a distributed serial algorithm improves efficiency in a very palpable way. Each node took afile, created the table with all words existing in each new and after master sends the word to search, they made their table with top 10 frequencies. After that, master collects those tables and organizes them to show the final table that evidences top 10 in the whole dataset.
The following table shows that, even though difference with 3 threas is really small, it keeps being more omptimum to use two threads per node.
- Wendy María Arango Chavarría warango4@eafit.edu.co -- Explanation
- Manuela Carrasco Pinzón mcarras1@eafit.edu.co




