Skip to content

wesleysantana/WebScrapingCSharp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

📚 Books Scraper API (C# + ASP.NET Core + AngleSharp)

A Web API in C# (.NET 8) that performs web scraping on the Books to Scrape sandbox site.
Users can query books by category (e.g., Travel, Mystery, Fiction) with optional filters such as maximum price, minimum rating, and item limits.

This project demonstrates clean architecture, scraping best practices, and a production-style API while being entirely educational and safe for portfolio use.


🚀 Features

  • REST endpoint (/api/books) to retrieve books.
  • Select multiple categories in one request.
  • Optional filters:
    • Minimum rating (minRating)
    • Maximum price (maxPrice)
    • Limit of items per category (maxItemsPerCategory)
  • Automatic pagination (scrapes all pages of a category).
  • Random delay between requests (polite scraping).
  • Safe cancellation with CancellationToken.
  • Returns clean JSON (BookDto).

🛠️ Tech Stack


📡 Example Usage

Endpoint

GET /api/books?categories=Travel,Mystery&minRating=3&maxItemsPerCategory=5

JSON Response

[
  {
    "title": "Sharp Objects",
    "category": "Mystery",
    "price": 47.82,
    "inStock": true,
    "rating": 4,
    "detailUrl": "https://books.toscrape.com/catalogue/sharp-objects_997/index.html",
    "imageUrl": "https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg"
  },
  {
    "title": "In the Woods",
    "category": "Mystery",
    "price": 36.95,
    "inStock": true,
    "rating": 3,
    "detailUrl": "https://books.toscrape.com/catalogue/in-the-woods_979/index.html",
    "imageUrl": "https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg"
  }
]

▶️ Getting Started

Prerequisites

Run locally

# clone the repo
git clone https://github.com/your-username/books-scraper-api.git
cd books-scraper-api

# restore packages
dotnet restore

# run the API
dotnet run

Open Swagger UI at:
👉 http://localhost:5087/swagger


⚙️ Swagger Examples

  • GET /api/books?categories=Travel → all Travel books
  • GET /api/books?categories=Fiction&maxPrice=20 → Fiction books under £20
  • GET /api/books?categories=Poetry,Classics&minRating=4 → Poetry & Classics with rating ≥ 4

🧑‍💻 Architecture

  • Controllers → HTTP endpoints (BooksController)
  • Services → Scraping logic (BooksScraper)
  • Models → DTOs (BookDto, ScrapeRequest)
  • Infrastructure → Configurable scraping options (delays, concurrency)

🔍 Best Practices Applied

  • Concurrency limit: prevents too many parallel requests.
  • Random delays: mimics human browsing and avoids stressing the server.
  • Cancellation support: aborts long operations cleanly.
  • Configurable filters: user controls what to scrape.
  • Separation of concerns: controllers, services, and models clearly separated.

🎯 Purpose

This project was built for educational and portfolio purposes.
Books to Scrape is a public sandbox site created specifically for practicing scraping — no real data is involved.


📌 Next Steps

  • Persist scraped data in SQLite/PostgreSQL.
  • Add caching to avoid repeated requests.
  • Write unit tests for the scraping service.
  • Create a Go version scraping Scrape This Site.

📜 License

This project is intended for educational use only.
Do not use scraping techniques against websites without permission.

About

Web API in C# (.NET 8) that performs web scraping on the "Books to Scrape" sandbox site using AngleSharp. Allows querying books by category with filters (price, rating, limit). Educational project for portfolio purposes.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages