Spidey Web Crawler

A robust, modular asynchronous web crawler built in Python that crawls websites and downloads files with specified extensions.

Features

Parallel crawling - Multiple async workers for URL fetching and file downloads
SHA256 deduplication - Files saved as {checksum}.{ext}, no duplicates
Full control - Pause, resume, stop, and monitor crawl progress in real-time
Events system - Subscribe to progress, page crawled, file saved events
Rate limiting - Token bucket rate limiter to avoid overwhelming servers
Retry with backoff - Exponential backoff for failed requests
Modular architecture - Separate components for fetch, parse, storage, queue
Thread-safe queues - Safe concurrent access for URL and file queues

Installation

uv pip install spidey

Or from source:

cd spidey
uv sync

Quick Start

from spidey import Spidey

crawler = Spidey.from_args(
    urls=["https://example.com"],
    extensions=[".svg", ".png", ".jpg"],
    max_pages=100,
    folder="data"
)

crawler.crawl()

Full Control Example

from spidey import Spidey
import threading
import time

# Create crawler
crawler = Spidey.from_args(
    urls=["https://example.com"],
    extensions=[".svg", ".png", ".jpg"],
    num_workers=5,
    max_pages=50,
    folder="data"
)

# Subscribe to events
crawler.on("progress", lambda e: print(f"Progress: {e.data}"))
crawler.on("file_saved", lambda e: print(f"Saved: {e.data['checksum'][:16]}..."))
crawler.on("crawl_complete", lambda e: print(f"Done! {e.data}"))

# Run in background thread for external control
t = threading.Thread(target=crawler.crawl)
t.start()

# Control while running
time.sleep(5)
crawler.pause()
print("Paused...")

time.sleep(2)
crawler.resume()
print("Resumed...")

# Or just run to completion
# crawler.crawl()

t.join()

Configuration Options

Parameter	Type	Default	Description
`urls`	List[str]	Required	Starting URLs to crawl
`extensions`	List[str]	Required	File extensions to download (e.g., `.svg`, `.png`)
`limited_to_domains`	bool	False	Limit crawling to initial domains only
`max_pages`	int	1000	Maximum number of pages to crawl
`sleep_time`	float	0.0	Delay between requests in seconds
`restricted_domains`	List[str]	[]	Domains to exclude from crawling
`folder`	str	""	Output folder for downloaded files
`unique_file_name`	bool	True	Generate unique filenames
`num_workers`	int	10	Number of concurrent workers
`max_retries`	int	3	Maximum retry attempts for failed requests
`retry_delay`	float	1.0	Initial retry delay in seconds
`request_timeout`	float	30.0	HTTP request timeout in seconds
`max_concurrent_requests`	int	50	Maximum concurrent HTTP requests

Output Structure

Files are organized by extension and named with their SHA256 checksum:

data/
├── svg/
│   ├── a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6.svg
│   └── f7g8h9i0j1k2l3m4n5o6p7q8r9s0t1u2.svg
├── png/
│   ├── l3m4n5o6p7q8r9s0t1u2v3w4x5y6z7.png
│   └── a1b2c3d4e5e6f7g8h9i0j1k2l3m4n5o6.png
├── jpg/
├── html/
└── css/

This approach ensures:

No duplicate files are saved
Easy deduplication across runs
Quick file identification by checksum

Events

Subscribe to events for real-time monitoring:

crawler.on("state_changed", lambda e: print(f"State: {e.data['new']}"))
crawler.on("progress", lambda e: print(f"Pages: {e.data['pages_visited']}"))
crawler.on("page_crawled", lambda e: print(f"URL: {e.data['url']}"))
crawler.on("file_saved", lambda e: print(f"File: {e.data['checksum'][:16]}..."))
crawler.on("crawl_complete", lambda e: print(f"Complete: {e.data}"))

Event	Data
`state_changed`	`{old: str, new: str}`
`progress`	`{pages_visited, urls_queued, files_saved, files_skipped}`
`page_crawled`	`{url, new_urls, files}`
`file_saved`	`{url, checksum, size}`
`crawl_complete`	`{stats: {...}}`

Crawler States

from spidey import CrawlerState

# Check current state
print(crawler.state)  # CrawlerState.RUNNING

# States: IDLE, RUNNING, PAUSED, STOPPED, COMPLETED

Architecture

Spidey (Main Orchestrator)
    │
    ├── Controller (State, Stats, Events, Pause/Resume/Stop)
    │
    ├── URLQueue (Thread-safe URL batching)
    │       │
    │       └── URL Workers (async)
    │               │
    │               └── Fetcher (HTTP + retry + rate limit)
    │               └── Parser (extract URLs and files)
    │
    ├── FileQueue (Thread-safe file download queue)
    │       │
    │       └── File Workers (async)
    │               │
    │               └── Fetcher (download bytes)
    │               └── Storage (SHA256 dedup + write)
    │
    └── Storage (SHA256 deduplication)

Components

Component	Responsibility
`Config`	All settings with validation
`Controller`	State management, stats, events
`URLQueue`	Thread-safe URL batching
`FileQueue`	Thread-safe file download queue
`Fetcher`	HTTP client with retry & rate limiting
`Parser`	Extract links and files from HTML
`Storage`	SHA256 deduplication & file writes

License

MIT License - See LICENSE file

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
spidey		spidey
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spidey Web Crawler

Features

Installation

Quick Start

Full Control Example

Configuration Options

Output Structure

Events

Crawler States

Architecture

Components

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spidey Web Crawler

Features

Installation

Quick Start

Full Control Example

Configuration Options

Output Structure

Events

Crawler States

Architecture

Components

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages