Releases: raintree-technology/docpull
docpull 4.3.1
DocPull 4.3.1 tightens the public-web to agent-ready Markdown positioning across PyPI, GitHub, README, and the website.
Highlights:
- Updated package, repository, README, website, Open Graph, and structured metadata around public static/server-rendered web pages, Python, CLI/SDK, MCP, RAG, and agent context workflows.
- Added launch copy, comparison guidance, and marketing visibility research for developer, Python, MCP, and RAG discovery channels.
- Added launch assets: square logo variants, desktop/mobile screenshots, full-page capture, and a short website demo video.
Release artifacts were published to PyPI via trusted publishing: https://pypi.org/project/docpull/4.3.1/
docpull 4.0.1
A release-readiness patch that tightens the public product boundary. No runtime
API changes and no migration needed.
Changed
- Make the Python
docpull mcpserver the only documented supported MCP path
for agents, plugins, Claude Code, Cursor, and Claude Desktop. - Mark the root TypeScript/Bun
mcp/tree as an internal lab, make its package
metadata private, and remove end-user install instructions for that path. - Replace stale YAML example files with current CLI recipes so docs no longer
advertise removed options such as--sources-file, TOON output,
keep_variant,language, orcreate_index. - Update website examples and performance copy to match the current CLI and
benchmark results.
v4.0.0 — security audit + breaking cleanup
A security + cleanup major release. A multi-agent security audit closed a high-severity SSRF and nine further findings; it ships alongside a tech-debt cleanup that removes several unused public APIs — those removals are what make this a major version.
⚠️ Breaking changes (removed unused public API)
Every removed symbol had zero callers in the library and test suite.
CacheManager— removedhas_changed,is_fetched,is_failed,get_failed_urls,get_cache_stats,clear_state,has_resume_data. Incremental fetch and resume are unaffected (they use the retainedupdate_cache,mark_fetched,mark_failed,get_fetched_urls,get_pending_urls,save_/load_/clear_discovered_urls,evict_expired).StreamingDeduplicator.is_duplicate→ usecheck_and_register(its first return value reports whether the content was new).DocpullConfig.from_yaml_file→ useDocpullConfig.from_yaml(path.read_text()).
🔒 Security
- DNS-rebinding TOCTOU in the URL validator (high).
resolve_allowed_addresses()resolved the hostname a second time and dialed that unscreened result, so a TTL-0 attacker could pass validation with a public IP and have the socket connect to an internal one (e.g. cloud metadata). It now resolves once and returns exactly the addresses it screened. - Wider SSRF coverage. Blocks CGNAT shared space (
100.64.0.0/10) and IPv4-mapped IPv6, and strips the trailing DNS root dot (localhost.) before localhost/suffix checks — in both the Python validator and the TypeScript MCP source gate. The MCP gate also denies wildcard rebinding hosts (*.nip.io,*.sslip.io,*.xip.io). - robots.txt memory-exhaustion DoS. Body read capped at 512 KB.
- YAML frontmatter injection. Tag/keyword frontmatter (from page JSON-LD / OpenGraph) is quoted, escaped, and stripped of CR/LF so a hostile page can't inject top-level keys.
- Conditional-request header injection. Cached
ETag/Last-Modifiedare stripped of CR/LF/NUL before reuse asIf-None-Match/If-Modified-Since. - Supply chain. Pinned release tooling (
pip/build/twine); dropped six unused MCP dependencies; bumpedaiohttpto>=3.14.0(CVE-2026-34993, CVE-2026-47265).
🧹 Changed / Fixed
- Internal cleanup with no behaviour change: removed the unused
concurrencypackage,logging_config, and dead private methods; de-duplicated the discovery HTML-fetch helper and the HTTP GET/HEAD redirect re-validation path. - MCP: pgvector embedding inserts are batched under PostgreSQL's 32767 bind-parameter ceiling, so libraries with thousands of chunks index in one transaction instead of failing.
Install
```bash
pip install --upgrade docpull # 4.0.0
```
Full changelog: docs/CHANGELOG.md · v3.0.2...v4.0.0
v2.3.0 — Framework-aware extraction, LLM chunking, Python MCP, agent fast path
[2.3.0] - 2026-04-24
Sharpened positioning around the agent / RAG use case, plus real bug fixes
surfaced by validation against Next.js, Supabase, Anthropic, FastAPI, Tailwind,
and Drizzle documentation sites.
Added
- Framework-specific fast extractors: Next.js
__NEXT_DATA__, Mintlify,
OpenAPI / Swagger JSON rendered directly to Markdown, plus source-type
tagging for Docusaurus and Sphinx. Runs before the generic extractor. - Next.js App Router detection via
self.__next_f.push, router state tree,
and/_next/static/path markers — no longer relies on__NEXT_DATA__,
which is absent on modern App Router pages. - SPA detection (pre- and post-conversion): pages that produce only
Loading...shells are skipped with a clear reason.--strict-js-required
turns this into a hard error for agents that want to route elsewhere. - Trafilatura extractor as an optional alternative content extractor
(pip install docpull[trafilatura], then--extractor trafilatura). - Token-aware Markdown chunking:
--max-tokens-per-file Nsplits pages
on heading then paragraph boundaries. Exact counts withtiktoken,
character-estimate fallback otherwise. - NDJSON output format (
--format ndjson) for streaming one record per
page or per chunk.--streamwrites to stdout for live pipeline consumption. llmprofile: bundles NDJSON + 4k-token chunks + rich metadata + dedup.--single/fetch_one(url): fast single-page path with no discovery,
designed for AI-agent tool loops.- Python MCP server (
docpull mcp): exposesfetch_url,ensure_docs,
list_sources,list_indexed, andgrep_docstools over stdio. Install
viapip install docpull[mcp].
Fixed
- robots.txt redirect handling: Cloudflare/HTTP-2 responses send
lowercase header names, but theLocationlookup was case-sensitive,
causing 301/308 redirects to be treated as errors. This blocked
docs.anthropic.comand any other site whose robots.txt was redirected. - html2text link escape artifacts: cleaned up mangled links of the form
[text](prefix/<https:/real.url>)in the post-processing pass; handles
both text and image-only (empty-text) links.
Removed
- Dead dependencies:
requests(replaced byaiohttpin v2.0) and
gitpython(never used in v2+).
Changed
ContentFilterConfiggainsextractor,enable_special_cases, and
strict_js_requiredfields.OutputConfiggainsmax_tokens_per_file,
tokenizer,emit_chunks, andndjson_filename.
v2.2.1 - Security Hardening
Security Fixes
- ILIKE wildcard DoS —
%and_metacharacters ingrep_docsMCP tool input are now escaped, preventing expensive full-table scans - CRLF header injection —
--user-agentand--auth-headernow reject CR, LF, and null bytes at both the Pydantic config layer and the HTTP client transport layer - Dead code removal — Removed
IntegrationConfig(containingpost_process_hook: Path, a command-injection sink if ever wired up), plus unusedARCHIVE_CREATEDandGIT_COMMITTEDevent types - Proxy SSRF warning — Logs a warning when proxy mode bypasses the DNS-pinning resolver
.gitignorehardening — Added patterns for.env.*,*.pem,*.key,*.p12,*.pfx,*.crt
Breaking Changes
IntegrationConfighas been removed from the public API. The fieldsgit_commit,git_message,archive,archive_format, andpost_process_hookare no longer accepted in configuration. These were never implemented (dead code).- YAML config files containing an
integration:block will now fail validation.
Testing
- 12 new regression tests for CRLF injection and dead code removal
- All 157 tests pass
Audit Report
Full attack surface map available at security/01-attack-surface.md.
v2.2.0: Resume, Auth, JSON/SQLite output
New Features
- Resume capability (
--resume): Continue interrupted fetches - URL preview mode (
--preview-urls): See discovered URLs before fetching - Authentication support:
--auth-bearer,--auth-basic,--auth-cookie,--auth-header - Env var expansion for auth tokens (
$VARand${VAR}syntax) - Adaptive rate limiting (
--adaptive-rate-limit): Auto-adjust based on 429 responses - JSON output (
--format json): Stream documents to single JSON file - SQLite output (
--format sqlite): Save to SQLite database - Skip reason tracking: Better progress feedback
Breaking Changes
- Requires Python 3.10+ (dropped 3.9 support)
Install
pip install docpull --upgrade
v2.0.0 - Complete Architecture Rewrite
Breaking Changes
- New Python API:
Fetcherclass with async context manager and streaming events - src/ layout: PEP 517/518 compliant package structure
- Pydantic models: Configuration via
DocpullConfiginstead of dictionaries - Removed v1.x modules: All deprecated code removed
New Features
- Streaming Event API:
AsyncIterator[FetchEvent]for real-time progress - Pipeline Architecture: Composable steps (Validate, Fetch, Convert, Dedup, Save)
- CacheManager: O(1) lookups with batched writes and TTL eviction
- StreamingDeduplicator: Real-time content deduplication via SHA-256
- JavaScript Rendering: Browser-based fetching via Playwright
- Profile Presets: RAG, MIRROR, QUICK for common use cases
- Rate Limiting: Per-host concurrent request limits
- Security: robots.txt respect and URL validation
Quick Start
```bash
CLI
docpull https://docs.example.com --profile rag
Python API
from docpull import Fetcher, DocpullConfig, ProfileName, EventType
async with Fetcher(DocpullConfig(url="https://docs.example.com", profile=ProfileName.RAG)) as f:
async for event in f.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}")
```
Full Changelog
See CHANGELOG.md
v1.5.0
Release v1.5.0: Major Simplification and Modernization
Breaking Changes
- Removed legacy profile system (stripe-specific profiles)
- Removed deprecated
requirements.txt(usepyproject.tomlinstead)
Changes
- Simplified architecture: Consolidated utils into main package
- Reorganized documentation: Moved CONTRIBUTING.md and SECURITY.md to
.github/ - Added GitHub issue templates configuration
- Cleaner fetcher architecture: Removed stripe-specific fetcher
- Updated tests for new structure
Removed Files
CHANGELOG.md- Deprecated in favor of GitHub releasesMANIFEST.in- No longer needed with modern packagingTROUBLESHOOTING.md- Content moved to READMErequirements.txt- Dependencies now in pyproject.toml- Legacy profile system files
- Legacy utils directory
Installation
pip install docpullOr install from source:
pip install git+https://github.com/raintree-technology/docpull.gitv1.3.0: Rich Metadata Extraction & Simplified Profiles
v1.3.0: Rich Metadata Extraction & Simplified Profiles
Highlights
docpull v1.3.0 adds rich structured metadata extraction for enhanced AI/RAG integration and simplifies the profile system by focusing on the excellent generic fetcher.
New Features
Rich Metadata Extraction
- Structured Metadata: Extract Open Graph, JSON-LD, and microdata during fetch
- Enhanced Frontmatter: Adds author, description, keywords, images, publish dates, and more
- AI/RAG Ready: Richer context for embeddings and retrieval systems
- Opt-in Feature: Enabled with
--rich-metadataflag orrich_metadata: truein config - Powered by extruct: Uses the battle-tested extruct library for extraction
Simplified Profile System
- Streamlined Architecture: Removed 7 built-in profiles (React, Next.js, D3, Plaid, Tailwind, Bun, Turborepo)
- Kept Stripe: Retained as reference implementation for custom profiles
- Generic Fetcher Excellence: Works excellently for all documentation sites
- Reduced Complexity: Less maintenance burden, simpler codebase
- Easy Customization: Users can create custom profiles as needed
Technical Details
New Dependencies
- Added
extruct>=0.15.0for structured metadata extraction
New Files
docpull/metadata_extractor.py- Rich metadata extraction moduletests/test_metadata_extractor.py- Comprehensive test suite (13 tests)
Updated Files
docpull/fetchers/base.py- Integrated rich metadata extractiondocpull/fetchers/generic_async.py- Addeduse_rich_metadataparameterdocpull/config.py- Addedrich_metadataconfiguration optiondocpull/sources_config.py- Addedrich_metadatafielddocpull/cli.py- Added--rich-metadataCLI flagdocpull/profiles/__init__.py- Simplified to single Stripe profile
Removed Files
- 7 profile files (react.py, nextjs.py, d3.py, plaid.py, tailwind.py, bun.py, turborepo.py)
- 7 fetcher implementation files (same names)
Version & Testing
- Bumped version from
1.2.1to1.3.0 - All 107 tests passing ✅
- Zero mypy type errors ✅
- All lint checks passing ✅
Example Usage
Rich Metadata Extraction
# Extract rich metadata during fetch
docpull https://docs.anthropic.com --rich-metadata
# Combine with other features
docpull https://stripe.com/docs --rich-metadata --create-index --language en
# Multi-source configuration
docpull --sources-file config.yamlEnhanced Frontmatter Output
---
url: https://docs.example.com/guide
fetched: 2025-11-20
title: Getting Started Guide
description: Learn the basics of our platform
author: John Doe
keywords: [tutorial, guide, api]
image: https://docs.example.com/og-image.png
type: article
site_name: Example Docs
published_time: 2024-01-15T10:00:00Z
modified_time: 2024-01-20T15:30:00Z
---Multi-Source Configuration with Rich Metadata
sources:
anthropic:
url: https://docs.anthropic.com
rich_metadata: true # Enable rich metadata extraction
language: en
create_index: true
stripe:
url: https://stripe.com/docs
rich_metadata: true
max_file_size: 200kbBackward Compatibility
All existing workflows continue to work unchanged. Rich metadata extraction is opt-in, and the generic fetcher handles all documentation sites that previously used specific profiles.
Installation
pip install --upgrade docpullLinks
Stats: 30 files changed, +765/-867 lines
v1.2.1 - Critical Bug Fixes & Type Checking
🐛 Bug Fixes
This patch release fixes critical issues found in v1.2.0:
Type Checking & Code Quality
- Fixed all 60 mypy type errors - achieved zero type errors ✅
- Added proper type annotations throughout the codebase
- Improved type safety in processors, formatters, and orchestrator modules
- All lint checks now passing (mypy, ruff, black)
Test Fixes
- Fixed test failure in
test_orchestrator.py(archive_format parameter) - Fixed 9 SourcesConfiguration test failures
- All 101 tests now passing ✅
Code Cleanup
- Removed deprecated files (EMOJI_CLEANUP.md)
- Fixed Black formatting issues
- Added specific error codes to type: ignore comments
📝 Technical Details
Files Updated
docpull/processors/content_filter.py: More specific return typesdocpull/formatters/: Proper type annotations for nested functionsdocpull/orchestrator.py: Correct parameter naming and type hintsdocpull/cli.py: Better handling of Optional[str] typesdocpull/processors/language_filter.py: Fixed config type assignmentsdocpull/processors/deduplicator.py: Fixed config type assignments
CI/CD
This release ensures the codebase passes all CI checks and maintains high code quality standards.
📦 Installation
pip install --upgrade docpull