Tabelog Scrapy with Google Compute Engine (GCE)

Omar Trejo
February, 2017

This repository contains scraping code for a Tabelog (www.tabelog.com/en/) using either a local environment or a Google Compute Engine (GCE) instance. It's implemented using Python 2.7 and designed to work on Ubuntu 16.04. There are 14 branches in total, where each branch contains a different subset of the pages that need to be scrapped such that Tabelog's blocking mechanism is not activated.

The idea is to not depend on various laptops locally to be able to scrape results from Tabelog without getting blocked (due to the amount of requests). To achieve this, the approach is to run the code for each prefecture whose data we want in a new GCE instance. Each new GCE instance gets a new and different IP, so IP-blocks from Tabelog can be handled this way. Each GCE is automatically turned off after it's scraping task is finished, to keep the data even after the instance no longer exists, it is sent to a Google Cloud Storage (GCS) bucket within the same Google Cloud project.

The code is also able to run locally. If only a few prefectures are needed this may be the best approach. If for some reason the laptop's IP gets blocked, you may fallback to the GCE approach.

Instructions

Create GCE instance with defaults but change to:
- Ubuntu 16.04 LTS with 10 GB
  - The code should work with any Linux OS. However, the setup.sh script is written for an Ubuntu 16.04 LTS OS. If easyness of use is desired, make sure you select this OS.
- Allow full access to all Cloud APIs
  - This is not strictly required, but it's easier than turning the specific APIs we need by hand and there's no security risk of doing so because this instances are turned on for brief periods of time.
- Allow HTTP/HTTPS traffic
  - This is required to be able to do the scrapping.
- NOTE: If you don't know how to do this, take a look at the resources provided below for GCE.
- NOTE: Keep track of the project, zone, and instance names as we'll need them later to turn down the instance automatically.
Setup environment in the GCE instance's terminal:
- Clone repository
  - $ git clone https://github.com/otrenav/scrapy-tabelog
- Change directory to repository
  - $ cd scrapy-tabelog
- Execute setup script
  - $ bash setup.sh
  - If this doesn't seem to work, you can execute each line manually in the terminal. This is just to make it easier.
Execute main function:
- $ python main.py ... (see usage section)
Process results:
- $ cd process_results/
- $ bash process_results.sh

Usage

Local environment

$ python main.py --prefectures=<prefecture>,...,<prefecture> --category_groups=1,2,3,4

The prefectures argument must receive a list separated by commas and without spaces, with the desired prefecture names, in lower case. Most of the time these should be taken from the groups in the List of prefectures below. The category_groups parameter (optional) must receive a list separated by commas and without spaces, of the group numbers in the constants.py file. The purpose of these groups is to split the number of results to be retrieved into groups to avoid being blocked by Tabelog. This is useful in the case of Tokyo, where there are more than 70,000 restaurants (which seems to be the limit of allowed scraped results from a single IP). If the cateogry_groups parameter is not provided, all groups will be scraped for the given prefectures.

Google Compute Engine

To be able to turn off the instance automatically after finishing scraping, we need to specify the project, zone, and instance names chosen when creating the instance. This will be sent to the main file as such:

$ python main.py --prefectures=<prefecture>,...,<prefecture>
                 --category_groups=1,2,3,4
                 --project=<google_cloud_project>
                 --zone=<google_cloud_zone>
                 --instance=<google_cloud_instance>

The prefecture argument is optional. If it's not supplied all prefectures will be scrapped automatically. Save name conventions as in the Local Environment case apply. However the data will be saved to a GCS bucket. It will be stored under the same project in Google Cloud.

The project, zone, and instance names are optional (even if working inside a GCE). If we're in a GCE instance but these parameters are not supplied, the code will execute as if it were inside a local environment, and no GCE instance deletion will occur. If one of these parameters is supplied, all must be supplied. If this is the case the data will sill be saved within the same Google Cloud project, but the instance will remain turned on until manually deleted. This may increase billing and is not recommended.

Warning: Google Cloud arguments are not checked to be valid within so you must be careful they are correct when using them.

Process the results

Processing the results will save all the JSON files in the ./inputs/json/ directory into the ./inputs/csv/raw/ as CSVs. Then it will process each of them and save its clean version in ./inputs/csv/clean/. Finally, it will take all of those CSVs, join them, and save them into a single file in ./outputs/data.csv. An analysis script will also be executed, which will create the final two files with diagnosis around the data.csv file.

Making the repository private

This repository is currently public so that we can clone it within the GCE instance without having to make any configurations regardin SSH keys. However, if we would like to keep the repository private, it can be done and we would have to make an extra setup step for each GCE instance. Since this is not desired to keep the code as easy to use as possible, it's not currently implemented this way, and the repository is kept public.

Resources

List of prefectures

The number of allowed requests before getting blocked seems to be around 70,000 requests. Try to group your servers to not require more than that amount.

Group 1

Prefecture	Number of listed restaurants
Aichi	47,758
Akita	6,350
Aomori	8,250

Group 2

Prefecture	Number of listed restaurants
Chiba	30,764
Ehime	9,424
Fukui	6,184

Group 3

Prefecture	Number of listed restaurants
Fukuoka	34,045
Fukushima	11,539
Gifu	14,145

Group 4

Prefecture	Number of listed restaurants
Gunma	13,968
Hiroshima	17,812
Hokkaido	39,100

Group 5

Prefecture	Number of listed restaurants
Hyogo	36,096
Ibaraki	15,317
Ishikawa	8,694
Iwate	7,567

Group 6

Prefecture	Number of listed restaurants
Kagawa	7,147
Kagoshima	9,912
Kanagawa	46,685

Group 7

Prefecture	Number of listed restaurants
Kochi	5,487
Kumamoto	10,160
Kyoto	21,112
Mie	11,257
Miyagi	14,176
Miyazaki	7,121

Group 8

Prefecture	Number of listed restaurants
Nagano	17,366
Nagasaki	8,395
Nara	7,199
Niigata	13,762
Oita	7,931
Okayama	11,216
Okinawa	13,252

Group 9

Prefecture	Number of listed restaurants
Osaka	66,526
Saga	4,929

Group 10

Prefecture	Number of listed restaurants
Saitama	33,718
Shiga	6,732
Shimane	4,054
Shizuoka	25,124

Group 11

Prefecture	Number of listed restaurants
Tochigi	13,360
Tokushima	5,254

Groups 12 and 13

Prefecture	Number of listed restaurants
Tokyo	130,632

Group 14

Prefecture	Number of listed restaurants
Tottori	3,638
Toyama	6,550
Wakayama	7,130
Yamagata	7,642
Yamaguchi	8,180
Yamanashi	7,228

"We are the people we have been waiting for."

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
google_cloud		google_cloud
helpers		helpers
process_results		process_results
scrapy		scrapy
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tabelog Scrapy with Google Compute Engine (GCE)

Instructions

Usage

Local environment

Google Compute Engine

Process the results

Making the repository private

Resources

List of prefectures

Group 1

Group 2

Group 3

Group 4

Group 5

Group 6

Group 7

Group 8

Group 9

Group 10

Group 11

Groups 12 and 13

Group 14

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tabelog Scrapy with Google Compute Engine (GCE)

Instructions

Usage

Local environment

Google Compute Engine

Process the results

Making the repository private

Resources

List of prefectures

Group 1

Group 2

Group 3

Group 4

Group 5

Group 6

Group 7

Group 8

Group 9

Group 10

Group 11

Groups 12 and 13

Group 14

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages