Skip to content

Releases: raintree-technology/docpull

docpull 4.3.1

15 Jun 17:58
fceb86a

Choose a tag to compare

DocPull 4.3.1 tightens the public-web to agent-ready Markdown positioning across PyPI, GitHub, README, and the website.

Highlights:

  • Updated package, repository, README, website, Open Graph, and structured metadata around public static/server-rendered web pages, Python, CLI/SDK, MCP, RAG, and agent context workflows.
  • Added launch copy, comparison guidance, and marketing visibility research for developer, Python, MCP, and RAG discovery channels.
  • Added launch assets: square logo variants, desktop/mobile screenshots, full-page capture, and a short website demo video.

Release artifacts were published to PyPI via trusted publishing: https://pypi.org/project/docpull/4.3.1/

docpull 4.0.1

06 Jun 20:48
1f75b01

Choose a tag to compare

A release-readiness patch that tightens the public product boundary. No runtime
API changes and no migration needed.

Changed

  • Make the Python docpull mcp server the only documented supported MCP path
    for agents, plugins, Claude Code, Cursor, and Claude Desktop.
  • Mark the root TypeScript/Bun mcp/ tree as an internal lab, make its package
    metadata private, and remove end-user install instructions for that path.
  • Replace stale YAML example files with current CLI recipes so docs no longer
    advertise removed options such as --sources-file, TOON output,
    keep_variant, language, or create_index.
  • Update website examples and performance copy to match the current CLI and
    benchmark results.

v4.0.0 — security audit + breaking cleanup

04 Jun 17:55
a3a288e

Choose a tag to compare

A security + cleanup major release. A multi-agent security audit closed a high-severity SSRF and nine further findings; it ships alongside a tech-debt cleanup that removes several unused public APIs — those removals are what make this a major version.

⚠️ Breaking changes (removed unused public API)

Every removed symbol had zero callers in the library and test suite.

  • CacheManager — removed has_changed, is_fetched, is_failed, get_failed_urls, get_cache_stats, clear_state, has_resume_data. Incremental fetch and resume are unaffected (they use the retained update_cache, mark_fetched, mark_failed, get_fetched_urls, get_pending_urls, save_/load_/clear_discovered_urls, evict_expired).
  • StreamingDeduplicator.is_duplicate → use check_and_register (its first return value reports whether the content was new).
  • DocpullConfig.from_yaml_file → use DocpullConfig.from_yaml(path.read_text()).

🔒 Security

  • DNS-rebinding TOCTOU in the URL validator (high). resolve_allowed_addresses() resolved the hostname a second time and dialed that unscreened result, so a TTL-0 attacker could pass validation with a public IP and have the socket connect to an internal one (e.g. cloud metadata). It now resolves once and returns exactly the addresses it screened.
  • Wider SSRF coverage. Blocks CGNAT shared space (100.64.0.0/10) and IPv4-mapped IPv6, and strips the trailing DNS root dot (localhost.) before localhost/suffix checks — in both the Python validator and the TypeScript MCP source gate. The MCP gate also denies wildcard rebinding hosts (*.nip.io, *.sslip.io, *.xip.io).
  • robots.txt memory-exhaustion DoS. Body read capped at 512 KB.
  • YAML frontmatter injection. Tag/keyword frontmatter (from page JSON-LD / OpenGraph) is quoted, escaped, and stripped of CR/LF so a hostile page can't inject top-level keys.
  • Conditional-request header injection. Cached ETag / Last-Modified are stripped of CR/LF/NUL before reuse as If-None-Match / If-Modified-Since.
  • Supply chain. Pinned release tooling (pip/build/twine); dropped six unused MCP dependencies; bumped aiohttp to >=3.14.0 (CVE-2026-34993, CVE-2026-47265).

🧹 Changed / Fixed

  • Internal cleanup with no behaviour change: removed the unused concurrency package, logging_config, and dead private methods; de-duplicated the discovery HTML-fetch helper and the HTTP GET/HEAD redirect re-validation path.
  • MCP: pgvector embedding inserts are batched under PostgreSQL's 32767 bind-parameter ceiling, so libraries with thousands of chunks index in one transaction instead of failing.

Install

```bash
pip install --upgrade docpull # 4.0.0
```

Full changelog: docs/CHANGELOG.md · v3.0.2...v4.0.0

v2.3.0 — Framework-aware extraction, LLM chunking, Python MCP, agent fast path

24 Apr 21:47

Choose a tag to compare

[2.3.0] - 2026-04-24

Sharpened positioning around the agent / RAG use case, plus real bug fixes
surfaced by validation against Next.js, Supabase, Anthropic, FastAPI, Tailwind,
and Drizzle documentation sites.

Added

  • Framework-specific fast extractors: Next.js __NEXT_DATA__, Mintlify,
    OpenAPI / Swagger JSON rendered directly to Markdown, plus source-type
    tagging for Docusaurus and Sphinx. Runs before the generic extractor.
  • Next.js App Router detection via self.__next_f.push, router state tree,
    and /_next/static/ path markers — no longer relies on __NEXT_DATA__,
    which is absent on modern App Router pages.
  • SPA detection (pre- and post-conversion): pages that produce only
    Loading... shells are skipped with a clear reason. --strict-js-required
    turns this into a hard error for agents that want to route elsewhere.
  • Trafilatura extractor as an optional alternative content extractor
    (pip install docpull[trafilatura], then --extractor trafilatura).
  • Token-aware Markdown chunking: --max-tokens-per-file N splits pages
    on heading then paragraph boundaries. Exact counts with tiktoken,
    character-estimate fallback otherwise.
  • NDJSON output format (--format ndjson) for streaming one record per
    page or per chunk. --stream writes to stdout for live pipeline consumption.
  • llm profile: bundles NDJSON + 4k-token chunks + rich metadata + dedup.
  • --single / fetch_one(url): fast single-page path with no discovery,
    designed for AI-agent tool loops.
  • Python MCP server (docpull mcp): exposes fetch_url, ensure_docs,
    list_sources, list_indexed, and grep_docs tools over stdio. Install
    via pip install docpull[mcp].

Fixed

  • robots.txt redirect handling: Cloudflare/HTTP-2 responses send
    lowercase header names, but the Location lookup was case-sensitive,
    causing 301/308 redirects to be treated as errors. This blocked
    docs.anthropic.com and any other site whose robots.txt was redirected.
  • html2text link escape artifacts: cleaned up mangled links of the form
    [text](prefix/<https:/real.url>) in the post-processing pass; handles
    both text and image-only (empty-text) links.

Removed

  • Dead dependencies: requests (replaced by aiohttp in v2.0) and
    gitpython (never used in v2+).

Changed

  • ContentFilterConfig gains extractor, enable_special_cases, and
    strict_js_required fields. OutputConfig gains max_tokens_per_file,
    tokenizer, emit_chunks, and ndjson_filename.

v2.2.1 - Security Hardening

15 Apr 21:40

Choose a tag to compare

Security Fixes

  • ILIKE wildcard DoS% and _ metacharacters in grep_docs MCP tool input are now escaped, preventing expensive full-table scans
  • CRLF header injection--user-agent and --auth-header now reject CR, LF, and null bytes at both the Pydantic config layer and the HTTP client transport layer
  • Dead code removal — Removed IntegrationConfig (containing post_process_hook: Path, a command-injection sink if ever wired up), plus unused ARCHIVE_CREATED and GIT_COMMITTED event types
  • Proxy SSRF warning — Logs a warning when proxy mode bypasses the DNS-pinning resolver
  • .gitignore hardening — Added patterns for .env.*, *.pem, *.key, *.p12, *.pfx, *.crt

Breaking Changes

  • IntegrationConfig has been removed from the public API. The fields git_commit, git_message, archive, archive_format, and post_process_hook are no longer accepted in configuration. These were never implemented (dead code).
  • YAML config files containing an integration: block will now fail validation.

Testing

  • 12 new regression tests for CRLF injection and dead code removal
  • All 157 tests pass

Audit Report

Full attack surface map available at security/01-attack-surface.md.

v2.2.0: Resume, Auth, JSON/SQLite output

15 Dec 21:00

Choose a tag to compare

New Features

  • Resume capability (--resume): Continue interrupted fetches
  • URL preview mode (--preview-urls): See discovered URLs before fetching
  • Authentication support: --auth-bearer, --auth-basic, --auth-cookie, --auth-header
  • Env var expansion for auth tokens ($VAR and ${VAR} syntax)
  • Adaptive rate limiting (--adaptive-rate-limit): Auto-adjust based on 429 responses
  • JSON output (--format json): Stream documents to single JSON file
  • SQLite output (--format sqlite): Save to SQLite database
  • Skip reason tracking: Better progress feedback

Breaking Changes

  • Requires Python 3.10+ (dropped 3.9 support)

Install

pip install docpull --upgrade

v2.0.0 - Complete Architecture Rewrite

29 Nov 23:26

Choose a tag to compare

Breaking Changes

  • New Python API: Fetcher class with async context manager and streaming events
  • src/ layout: PEP 517/518 compliant package structure
  • Pydantic models: Configuration via DocpullConfig instead of dictionaries
  • Removed v1.x modules: All deprecated code removed

New Features

  • Streaming Event API: AsyncIterator[FetchEvent] for real-time progress
  • Pipeline Architecture: Composable steps (Validate, Fetch, Convert, Dedup, Save)
  • CacheManager: O(1) lookups with batched writes and TTL eviction
  • StreamingDeduplicator: Real-time content deduplication via SHA-256
  • JavaScript Rendering: Browser-based fetching via Playwright
  • Profile Presets: RAG, MIRROR, QUICK for common use cases
  • Rate Limiting: Per-host concurrent request limits
  • Security: robots.txt respect and URL validation

Quick Start

```bash

CLI

docpull https://docs.example.com --profile rag

Python API

from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async with Fetcher(DocpullConfig(url="https://docs.example.com", profile=ProfileName.RAG)) as f:
async for event in f.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}")
```

Full Changelog

See CHANGELOG.md

v1.5.0

29 Nov 03:55

Choose a tag to compare

Release v1.5.0: Major Simplification and Modernization

Breaking Changes

  • Removed legacy profile system (stripe-specific profiles)
  • Removed deprecated requirements.txt (use pyproject.toml instead)

Changes

  • Simplified architecture: Consolidated utils into main package
  • Reorganized documentation: Moved CONTRIBUTING.md and SECURITY.md to .github/
  • Added GitHub issue templates configuration
  • Cleaner fetcher architecture: Removed stripe-specific fetcher
  • Updated tests for new structure

Removed Files

  • CHANGELOG.md - Deprecated in favor of GitHub releases
  • MANIFEST.in - No longer needed with modern packaging
  • TROUBLESHOOTING.md - Content moved to README
  • requirements.txt - Dependencies now in pyproject.toml
  • Legacy profile system files
  • Legacy utils directory

Installation

pip install docpull

Or install from source:

pip install git+https://github.com/raintree-technology/docpull.git

v1.3.0: Rich Metadata Extraction & Simplified Profiles

20 Nov 19:30

Choose a tag to compare

v1.3.0: Rich Metadata Extraction & Simplified Profiles

Highlights

docpull v1.3.0 adds rich structured metadata extraction for enhanced AI/RAG integration and simplifies the profile system by focusing on the excellent generic fetcher.

New Features

Rich Metadata Extraction

  • Structured Metadata: Extract Open Graph, JSON-LD, and microdata during fetch
  • Enhanced Frontmatter: Adds author, description, keywords, images, publish dates, and more
  • AI/RAG Ready: Richer context for embeddings and retrieval systems
  • Opt-in Feature: Enabled with --rich-metadata flag or rich_metadata: true in config
  • Powered by extruct: Uses the battle-tested extruct library for extraction

Simplified Profile System

  • Streamlined Architecture: Removed 7 built-in profiles (React, Next.js, D3, Plaid, Tailwind, Bun, Turborepo)
  • Kept Stripe: Retained as reference implementation for custom profiles
  • Generic Fetcher Excellence: Works excellently for all documentation sites
  • Reduced Complexity: Less maintenance burden, simpler codebase
  • Easy Customization: Users can create custom profiles as needed

Technical Details

New Dependencies

  • Added extruct>=0.15.0 for structured metadata extraction

New Files

  • docpull/metadata_extractor.py - Rich metadata extraction module
  • tests/test_metadata_extractor.py - Comprehensive test suite (13 tests)

Updated Files

  • docpull/fetchers/base.py - Integrated rich metadata extraction
  • docpull/fetchers/generic_async.py - Added use_rich_metadata parameter
  • docpull/config.py - Added rich_metadata configuration option
  • docpull/sources_config.py - Added rich_metadata field
  • docpull/cli.py - Added --rich-metadata CLI flag
  • docpull/profiles/__init__.py - Simplified to single Stripe profile

Removed Files

  • 7 profile files (react.py, nextjs.py, d3.py, plaid.py, tailwind.py, bun.py, turborepo.py)
  • 7 fetcher implementation files (same names)

Version & Testing

  • Bumped version from 1.2.1 to 1.3.0
  • All 107 tests passing ✅
  • Zero mypy type errors ✅
  • All lint checks passing ✅

Example Usage

Rich Metadata Extraction

# Extract rich metadata during fetch
docpull https://docs.anthropic.com --rich-metadata

# Combine with other features
docpull https://stripe.com/docs --rich-metadata --create-index --language en

# Multi-source configuration
docpull --sources-file config.yaml

Enhanced Frontmatter Output

---
url: https://docs.example.com/guide
fetched: 2025-11-20
title: Getting Started Guide
description: Learn the basics of our platform
author: John Doe
keywords: [tutorial, guide, api]
image: https://docs.example.com/og-image.png
type: article
site_name: Example Docs
published_time: 2024-01-15T10:00:00Z
modified_time: 2024-01-20T15:30:00Z
---

Multi-Source Configuration with Rich Metadata

sources:
  anthropic:
    url: https://docs.anthropic.com
    rich_metadata: true  # Enable rich metadata extraction
    language: en
    create_index: true

  stripe:
    url: https://stripe.com/docs
    rich_metadata: true
    max_file_size: 200kb

Backward Compatibility

All existing workflows continue to work unchanged. Rich metadata extraction is opt-in, and the generic fetcher handles all documentation sites that previously used specific profiles.

Installation

pip install --upgrade docpull

Links


Stats: 30 files changed, +765/-867 lines

v1.2.1 - Critical Bug Fixes & Type Checking

17 Nov 01:19

Choose a tag to compare

🐛 Bug Fixes

This patch release fixes critical issues found in v1.2.0:

Type Checking & Code Quality

  • Fixed all 60 mypy type errors - achieved zero type errors ✅
  • Added proper type annotations throughout the codebase
  • Improved type safety in processors, formatters, and orchestrator modules
  • All lint checks now passing (mypy, ruff, black)

Test Fixes

  • Fixed test failure in test_orchestrator.py (archive_format parameter)
  • Fixed 9 SourcesConfiguration test failures
  • All 101 tests now passing ✅

Code Cleanup

  • Removed deprecated files (EMOJI_CLEANUP.md)
  • Fixed Black formatting issues
  • Added specific error codes to type: ignore comments

📝 Technical Details

Files Updated

  • docpull/processors/content_filter.py: More specific return types
  • docpull/formatters/: Proper type annotations for nested functions
  • docpull/orchestrator.py: Correct parameter naming and type hints
  • docpull/cli.py: Better handling of Optional[str] types
  • docpull/processors/language_filter.py: Fixed config type assignments
  • docpull/processors/deduplicator.py: Fixed config type assignments

CI/CD

This release ensures the codebase passes all CI checks and maintains high code quality standards.

📦 Installation

pip install --upgrade docpull

🔗 Links