Skip to content

clone: route Cloudflare-challenged sites through NOVA_PROXY fallback#4

Open
TippyFlitsUK wants to merge 1 commit into
mainfrom
clone-cloudflare-proxy-fallback
Open

clone: route Cloudflare-challenged sites through NOVA_PROXY fallback#4
TippyFlitsUK wants to merge 1 commit into
mainfrom
clone-cloudflare-proxy-fallback

Conversation

@TippyFlitsUK

Copy link
Copy Markdown
Collaborator

Problem

Some sites (e.g. operationbroadway.com) serve a Cloudflare managed challenge that a datacenter IP cannot pass headless. Verified on the dealbot that vanilla headless, playwright-extra + stealth, and a real headful browser via xvfb all fail to clear it — so this is IP reputation, not browser fingerprint. The cloner captured the "Just a moment..." interstitial instead of the real site, partly because the challenge-wait regex did not even match the "Just a moment" title.

Change

  • Before launching Playwright, probe the start URL for a Cloudflare challenge (cf-mitigated: challenge header, or a challenge body on a 403/429/503).
  • Only when NOVA_PROXY is set and the site is actually challenged does that clone route through the proxy. Every other site launches direct — byte-for-byte unchanged. With NOVA_PROXY unset, behaviour is identical to today.
  • NOVA_PROXY accepts a full http://user:pass@host:port URL, parsed into Playwright's { server, username, password }.
  • Added "just a moment" to the challenge-wait regex as a backstop.

What does NOT work (and was rejected)

A residential proxy is required. Datacenter VPNs (PIA) and playwright-extra + puppeteer-extra-plugin-stealth do not beat this challenge — confirmed empirically. This PR is deliberately minimal: no stealth swap, no engine change, no always-on proxy.

Verification

  • Challenged site through a residential proxy → real content (correct title, 333 KB, zero cf_chl/challenge markers), routing clone through proxy logged.
  • Non-challenged site (example.com) with NOVA_PROXY set → goes direct, no proxy.
  • Full nova demo via focify-me's live endpoint → success, 11 pages, real content.

Some sites (e.g. operationbroadway.com) serve a Cloudflare managed
challenge that a datacenter IP cannot pass headless -- even a real
headful browser fails, so it is IP reputation, not browser fingerprint.
The cloner captured the "Just a moment..." interstitial instead of the
real site.

Before launching the browser, probe the start URL for a Cloudflare
challenge (cf-mitigated: challenge header, or a challenge body on a
403/429/503). Only when NOVA_PROXY is set AND the site is actually
challenged is that clone routed through the proxy; every other site
launches direct, unchanged. Also match "just a moment" in the
challenge-wait regex.

NOVA_PROXY accepts a full http://user:pass@host:port URL, parsed into
Playwright's { server, username, password }.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant