15 Jun 17:58

admin-raintree

fceb86a

docpull 4.3.1 Latest

Latest

DocPull 4.3.1 tightens the public-web to agent-ready Markdown positioning across PyPI, GitHub, README, and the website.

Highlights:

Updated package, repository, README, website, Open Graph, and structured metadata around public static/server-rendered web pages, Python, CLI/SDK, MCP, RAG, and agent context workflows.
Added launch copy, comparison guidance, and marketing visibility research for developer, Python, MCP, and RAG discovery channels.
Added launch assets: square logo variants, desktop/mobile screenshots, full-page capture, and a short website demo video.

Release artifacts were published to PyPI via trusted publishing: https://pypi.org/project/docpull/4.3.1/

Assets 5

06 Jun 20:48

admin-raintree

v4.0.1

1f75b01

docpull 4.0.1

A release-readiness patch that tightens the public product boundary. No runtime
API changes and no migration needed.

Changed

Make the Python docpull mcp server the only documented supported MCP path
for agents, plugins, Claude Code, Cursor, and Claude Desktop.
Mark the root TypeScript/Bun mcp/ tree as an internal lab, make its package
metadata private, and remove end-user install instructions for that path.
Replace stale YAML example files with current CLI recipes so docs no longer
advertise removed options such as --sources-file, TOON output,
keep_variant, language, or create_index.
Update website examples and performance copy to match the current CLI and
benchmark results.

Assets 2

04 Jun 17:55

admin-raintree

v4.0.0

a3a288e

v4.0.0 — security audit + breaking cleanup

A security + cleanup major release. A multi-agent security audit closed a high-severity SSRF and nine further findings; it ships alongside a tech-debt cleanup that removes several unused public APIs — those removals are what make this a major version.

⚠️ Breaking changes (removed unused public API)

Every removed symbol had zero callers in the library and test suite.

CacheManager — removed has_changed, is_fetched, is_failed, get_failed_urls, get_cache_stats, clear_state, has_resume_data. Incremental fetch and resume are unaffected (they use the retained update_cache, mark_fetched, mark_failed, get_fetched_urls, get_pending_urls, save_/load_/clear_discovered_urls, evict_expired).
StreamingDeduplicator.is_duplicate → use check_and_register (its first return value reports whether the content was new).
DocpullConfig.from_yaml_file → use DocpullConfig.from_yaml(path.read_text()).

🔒 Security

DNS-rebinding TOCTOU in the URL validator (high). resolve_allowed_addresses() resolved the hostname a second time and dialed that unscreened result, so a TTL-0 attacker could pass validation with a public IP and have the socket connect to an internal one (e.g. cloud metadata). It now resolves once and returns exactly the addresses it screened.
Wider SSRF coverage. Blocks CGNAT shared space (100.64.0.0/10) and IPv4-mapped IPv6, and strips the trailing DNS root dot (localhost.) before localhost/suffix checks — in both the Python validator and the TypeScript MCP source gate. The MCP gate also denies wildcard rebinding hosts (*.nip.io, *.sslip.io, *.xip.io).
robots.txt memory-exhaustion DoS. Body read capped at 512 KB.
YAML frontmatter injection. Tag/keyword frontmatter (from page JSON-LD / OpenGraph) is quoted, escaped, and stripped of CR/LF so a hostile page can't inject top-level keys.
Conditional-request header injection. Cached ETag / Last-Modified are stripped of CR/LF/NUL before reuse as If-None-Match / If-Modified-Since.
Supply chain. Pinned release tooling (pip/build/twine); dropped six unused MCP dependencies; bumped aiohttp to >=3.14.0 (CVE-2026-34993, CVE-2026-47265).

🧹 Changed / Fixed

Internal cleanup with no behaviour change: removed the unused concurrency package, logging_config, and dead private methods; de-duplicated the discovery HTML-fetch helper and the HTTP GET/HEAD redirect re-validation path.
MCP: pgvector embedding inserts are batched under PostgreSQL's 32767 bind-parameter ceiling, so libraries with thousands of chunks index in one transaction instead of failing.

Install

```bash
pip install --upgrade docpull # 4.0.0
```

Full changelog: docs/CHANGELOG.md · v3.0.2...v4.0.0

Assets 2

24 Apr 21:47

admin-raintree

v2.3.0

47ee6be

v2.3.0 — Framework-aware extraction, LLM chunking, Python MCP, agent fast path

[2.3.0] - 2026-04-24

Sharpened positioning around the agent / RAG use case, plus real bug fixes
surfaced by validation against Next.js, Supabase, Anthropic, FastAPI, Tailwind,
and Drizzle documentation sites.

Added

Framework-specific fast extractors: Next.js __NEXT_DATA__, Mintlify,
OpenAPI / Swagger JSON rendered directly to Markdown, plus source-type
tagging for Docusaurus and Sphinx. Runs before the generic extractor.
Next.js App Router detection via self.__next_f.push, router state tree,
and /_next/static/ path markers — no longer relies on __NEXT_DATA__,
which is absent on modern App Router pages.
SPA detection (pre- and post-conversion): pages that produce only
Loading... shells are skipped with a clear reason. --strict-js-required
turns this into a hard error for agents that want to route elsewhere.
Trafilatura extractor as an optional alternative content extractor
(pip install docpull[trafilatura], then --extractor trafilatura).
Token-aware Markdown chunking: --max-tokens-per-file N splits pages
on heading then paragraph boundaries. Exact counts with tiktoken,
character-estimate fallback otherwise.
NDJSON output format (--format ndjson) for streaming one record per
page or per chunk. --stream writes to stdout for live pipeline consumption.
llm profile: bundles NDJSON + 4k-token chunks + rich metadata + dedup.
--single / fetch_one(url): fast single-page path with no discovery,
designed for AI-agent tool loops.
Python MCP server (docpull mcp): exposes fetch_url, ensure_docs,
list_sources, list_indexed, and grep_docs tools over stdio. Install
via pip install docpull[mcp].

Fixed

robots.txt redirect handling: Cloudflare/HTTP-2 responses send
lowercase header names, but the Location lookup was case-sensitive,
causing 301/308 redirects to be treated as errors. This blocked
docs.anthropic.com and any other site whose robots.txt was redirected.
html2text link escape artifacts: cleaned up mangled links of the form
[text](prefix/<https:/real.url>) in the post-processing pass; handles
both text and image-only (empty-text) links.

Removed

Dead dependencies: requests (replaced by aiohttp in v2.0) and
gitpython (never used in v2+).

Changed

ContentFilterConfig gains extractor, enable_special_cases, and
strict_js_required fields. OutputConfig gains max_tokens_per_file,
tokenizer, emit_chunks, and ndjson_filename.

Assets 2

15 Apr 21:40

zacharyr0th

v2.2.1

84e1e81

v2.2.1 - Security Hardening

Security Fixes

ILIKE wildcard DoS — % and _ metacharacters in grep_docs MCP tool input are now escaped, preventing expensive full-table scans
CRLF header injection — --user-agent and --auth-header now reject CR, LF, and null bytes at both the Pydantic config layer and the HTTP client transport layer
Dead code removal — Removed IntegrationConfig (containing post_process_hook: Path, a command-injection sink if ever wired up), plus unused ARCHIVE_CREATED and GIT_COMMITTED event types
Proxy SSRF warning — Logs a warning when proxy mode bypasses the DNS-pinning resolver
.gitignore hardening — Added patterns for .env.*, *.pem, *.key, *.p12, *.pfx, *.crt

Breaking Changes

IntegrationConfig has been removed from the public API. The fields git_commit, git_message, archive, archive_format, and post_process_hook are no longer accepted in configuration. These were never implemented (dead code).
YAML config files containing an integration: block will now fail validation.

Testing

12 new regression tests for CRLF injection and dead code removal
All 157 tests pass

Audit Report

Full attack surface map available at security/01-attack-surface.md.

Assets 2

15 Dec 21:00

zacharyr0th

v2.2.0

44391bb

v2.2.0: Resume, Auth, JSON/SQLite output

New Features

Resume capability (--resume): Continue interrupted fetches
URL preview mode (--preview-urls): See discovered URLs before fetching
Authentication support: --auth-bearer, --auth-basic, --auth-cookie, --auth-header
Env var expansion for auth tokens ($VAR and ${VAR} syntax)
Adaptive rate limiting (--adaptive-rate-limit): Auto-adjust based on 429 responses
JSON output (--format json): Stream documents to single JSON file
SQLite output (--format sqlite): Save to SQLite database
Skip reason tracking: Better progress feedback

Breaking Changes

Requires Python 3.10+ (dropped 3.9 support)

Install

pip install docpull --upgrade

Assets 2

29 Nov 23:26

zacharyr0th

v2.0.0

a81b33c

v2.0.0 - Complete Architecture Rewrite

Breaking Changes

New Python API: Fetcher class with async context manager and streaming events
src/ layout: PEP 517/518 compliant package structure
Pydantic models: Configuration via DocpullConfig instead of dictionaries
Removed v1.x modules: All deprecated code removed

New Features

Streaming Event API: AsyncIterator[FetchEvent] for real-time progress
Pipeline Architecture: Composable steps (Validate, Fetch, Convert, Dedup, Save)
CacheManager: O(1) lookups with batched writes and TTL eviction
StreamingDeduplicator: Real-time content deduplication via SHA-256
JavaScript Rendering: Browser-based fetching via Playwright
Profile Presets: RAG, MIRROR, QUICK for common use cases
Rate Limiting: Per-host concurrent request limits
Security: robots.txt respect and URL validation

Quick Start

```bash

CLI

docpull https://docs.example.com --profile rag

Python API

from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async with Fetcher(DocpullConfig(url="https://docs.example.com", profile=ProfileName.RAG)) as f:
async for event in f.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}")
```

Full Changelog

See CHANGELOG.md

Assets 2

29 Nov 03:55

zacharyr0th

v1.5.0

6d7e4c9

v1.5.0

Release v1.5.0: Major Simplification and Modernization

Breaking Changes

Removed legacy profile system (stripe-specific profiles)
Removed deprecated requirements.txt (use pyproject.toml instead)

Changes

Simplified architecture: Consolidated utils into main package
Reorganized documentation: Moved CONTRIBUTING.md and SECURITY.md to .github/
Added GitHub issue templates configuration
Cleaner fetcher architecture: Removed stripe-specific fetcher
Updated tests for new structure

Removed Files

CHANGELOG.md - Deprecated in favor of GitHub releases
MANIFEST.in - No longer needed with modern packaging
TROUBLESHOOTING.md - Content moved to README
requirements.txt - Dependencies now in pyproject.toml
Legacy profile system files
Legacy utils directory

Installation

pip install docpull

Or install from source:

pip install git+https://github.com/raintree-technology/docpull.git

Assets 2

20 Nov 19:30

zacharyr0th

v1.3.0

2e3fcc1

v1.3.0: Rich Metadata Extraction & Simplified Profiles

Highlights

docpull v1.3.0 adds rich structured metadata extraction for enhanced AI/RAG integration and simplifies the profile system by focusing on the excellent generic fetcher.

New Features

Rich Metadata Extraction

Structured Metadata: Extract Open Graph, JSON-LD, and microdata during fetch
Enhanced Frontmatter: Adds author, description, keywords, images, publish dates, and more
AI/RAG Ready: Richer context for embeddings and retrieval systems
Opt-in Feature: Enabled with --rich-metadata flag or rich_metadata: true in config
Powered by extruct: Uses the battle-tested extruct library for extraction

Simplified Profile System

Streamlined Architecture: Removed 7 built-in profiles (React, Next.js, D3, Plaid, Tailwind, Bun, Turborepo)
Kept Stripe: Retained as reference implementation for custom profiles
Generic Fetcher Excellence: Works excellently for all documentation sites
Reduced Complexity: Less maintenance burden, simpler codebase
Easy Customization: Users can create custom profiles as needed

Technical Details

New Dependencies

Added extruct>=0.15.0 for structured metadata extraction

New Files

docpull/metadata_extractor.py - Rich metadata extraction module
tests/test_metadata_extractor.py - Comprehensive test suite (13 tests)

Updated Files

docpull/fetchers/base.py - Integrated rich metadata extraction
docpull/fetchers/generic_async.py - Added use_rich_metadata parameter
docpull/config.py - Added rich_metadata configuration option
docpull/sources_config.py - Added rich_metadata field
docpull/cli.py - Added --rich-metadata CLI flag
docpull/profiles/__init__.py - Simplified to single Stripe profile

Removed Files

7 profile files (react.py, nextjs.py, d3.py, plaid.py, tailwind.py, bun.py, turborepo.py)
7 fetcher implementation files (same names)

Version & Testing

Bumped version from 1.2.1 to 1.3.0
All 107 tests passing ✅
Zero mypy type errors ✅
All lint checks passing ✅

Example Usage

Rich Metadata Extraction

# Extract rich metadata during fetch
docpull https://docs.anthropic.com --rich-metadata

# Combine with other features
docpull https://stripe.com/docs --rich-metadata --create-index --language en

# Multi-source configuration
docpull --sources-file config.yaml

Enhanced Frontmatter Output

---
url: https://docs.example.com/guide
fetched: 2025-11-20
title: Getting Started Guide
description: Learn the basics of our platform
author: John Doe
keywords: [tutorial, guide, api]
image: https://docs.example.com/og-image.png
type: article
site_name: Example Docs
published_time: 2024-01-15T10:00:00Z
modified_time: 2024-01-20T15:30:00Z
---

Multi-Source Configuration with Rich Metadata

sources:
  anthropic:
    url: https://docs.anthropic.com
    rich_metadata: true  # Enable rich metadata extraction
    language: en
    create_index: true

  stripe:
    url: https://stripe.com/docs
    rich_metadata: true
    max_file_size: 200kb

Backward Compatibility

All existing workflows continue to work unchanged. Rich metadata extraction is opt-in, and the generic fetcher handles all documentation sites that previously used specific profiles.

Installation

pip install --upgrade docpull

Links

Stats: 30 files changed, +765/-867 lines

Assets 2

17 Nov 01:19

zacharyr0th

v1.2.1

7ac9efe

v1.2.1 - Critical Bug Fixes & Type Checking

🐛 Bug Fixes

This patch release fixes critical issues found in v1.2.0:

Type Checking & Code Quality

Fixed all 60 mypy type errors - achieved zero type errors ✅
Added proper type annotations throughout the codebase
Improved type safety in processors, formatters, and orchestrator modules
All lint checks now passing (mypy, ruff, black)

Test Fixes

Fixed test failure in test_orchestrator.py (archive_format parameter)
Fixed 9 SourcesConfiguration test failures
All 101 tests now passing ✅

Code Cleanup

Removed deprecated files (EMOJI_CLEANUP.md)
Fixed Black formatting issues
Added specific error codes to type: ignore comments

📝 Technical Details

Files Updated

docpull/processors/content_filter.py: More specific return types
docpull/formatters/: Proper type annotations for nested functions
docpull/orchestrator.py: Correct parameter naming and type hints
docpull/cli.py: Better handling of Optional[str] types
docpull/processors/language_filter.py: Fixed config type assignments
docpull/processors/deduplicator.py: Fixed config type assignments

CI/CD

This release ensures the codebase passes all CI checks and maintains high code quality standards.

📦 Installation

pip install --upgrade docpull

🔗 Links

Assets 2

Releases: raintree-technology/docpull

docpull 4.3.1

Uh oh!

docpull 4.0.1

Changed

Uh oh!

v4.0.0 — security audit + breaking cleanup

⚠️ Breaking changes (removed unused public API)

🔒 Security

🧹 Changed / Fixed

Install

Uh oh!

v2.3.0 — Framework-aware extraction, LLM chunking, Python MCP, agent fast path

[2.3.0] - 2026-04-24

Added

Fixed

Removed

Changed

Uh oh!

v2.2.1 - Security Hardening

Security Fixes

Breaking Changes

Testing

Audit Report

Uh oh!

v2.2.0: Resume, Auth, JSON/SQLite output

New Features

Breaking Changes

Install

Uh oh!

v2.0.0 - Complete Architecture Rewrite

Breaking Changes

New Features

Quick Start

CLI

Python API

Full Changelog

Uh oh!

v1.5.0

Release v1.5.0: Major Simplification and Modernization

Breaking Changes

Changes

Removed Files

Installation

Uh oh!

v1.3.0: Rich Metadata Extraction & Simplified Profiles

v1.3.0: Rich Metadata Extraction & Simplified Profiles

Highlights

New Features

Rich Metadata Extraction

Simplified Profile System

Technical Details

New Dependencies

New Files

Updated Files

Removed Files

Version & Testing

Example Usage

Rich Metadata Extraction

Enhanced Frontmatter Output

Multi-Source Configuration with Rich Metadata

Backward Compatibility

Installation

Links

Uh oh!

v1.2.1 - Critical Bug Fixes & Type Checking

🐛 Bug Fixes

Type Checking & Code Quality

Test Fixes

Code Cleanup

📝 Technical Details

Files Updated

CI/CD

📦 Installation

🔗 Links

Uh oh!