clone: route Cloudflare-challenged sites through NOVA_PROXY fallback#4
Open
TippyFlitsUK wants to merge 1 commit into
Open
clone: route Cloudflare-challenged sites through NOVA_PROXY fallback#4TippyFlitsUK wants to merge 1 commit into
TippyFlitsUK wants to merge 1 commit into
Conversation
Some sites (e.g. operationbroadway.com) serve a Cloudflare managed
challenge that a datacenter IP cannot pass headless -- even a real
headful browser fails, so it is IP reputation, not browser fingerprint.
The cloner captured the "Just a moment..." interstitial instead of the
real site.
Before launching the browser, probe the start URL for a Cloudflare
challenge (cf-mitigated: challenge header, or a challenge body on a
403/429/503). Only when NOVA_PROXY is set AND the site is actually
challenged is that clone routed through the proxy; every other site
launches direct, unchanged. Also match "just a moment" in the
challenge-wait regex.
NOVA_PROXY accepts a full http://user:pass@host:port URL, parsed into
Playwright's { server, username, password }.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Some sites (e.g. operationbroadway.com) serve a Cloudflare managed challenge that a datacenter IP cannot pass headless. Verified on the dealbot that vanilla headless,
playwright-extra+ stealth, and a real headful browser via xvfb all fail to clear it — so this is IP reputation, not browser fingerprint. The cloner captured the "Just a moment..." interstitial instead of the real site, partly because the challenge-wait regex did not even match the "Just a moment" title.Change
cf-mitigated: challengeheader, or a challenge body on a 403/429/503).NOVA_PROXYis set and the site is actually challenged does that clone route through the proxy. Every other site launches direct — byte-for-byte unchanged. WithNOVA_PROXYunset, behaviour is identical to today.NOVA_PROXYaccepts a fullhttp://user:pass@host:portURL, parsed into Playwright's{ server, username, password }.What does NOT work (and was rejected)
A residential proxy is required. Datacenter VPNs (PIA) and
playwright-extra+puppeteer-extra-plugin-stealthdo not beat this challenge — confirmed empirically. This PR is deliberately minimal: no stealth swap, no engine change, no always-on proxy.Verification
cf_chl/challenge markers),routing clone through proxylogged.NOVA_PROXYset → goes direct, no proxy.nova demovia focify-me's live endpoint → success, 11 pages, real content.