feat: add configurable retry-on-failure for proxy requests by Mrzqd · Pull Request #6 · Resinat/Resin

Mrzqd · 2026-03-03T02:43:50Z

Summary

Add per-platform max_retries configuration option that automatically retries transient upstream failures (connect failed, timeout, request failed) using a different node
Record retry attempts and per-attempt failure details (node hash/tag, error type, error message) in the request log database
Display retry information in the WebUI: badge in log list, summary callout in log detail, and per-attempt failure cards in diagnostics section

Changes

Backend

Platform config: max_retries field across model, runtime struct, codec, env config (RESIN_DEFAULT_PLATFORM_MAX_RETRIES), DB migration (v5), state repo, and API service
Proxy retry logic: retry loops in forward HTTP, forward CONNECT, and reverse proxy handlers; only retries on ErrUpstreamConnectFailed, ErrUpstreamTimeout, ErrUpstreamRequestFailed
Request body replay: buffers request body when retries > 0 so it can be replayed on each attempt
Request log schema: retry_attempts (INTEGER) and retry_details (JSON TEXT) columns; backward-compatible migrateSchema() for existing rolling DBs
API response: retry_attempts and retry_details array in log list/detail endpoints

Frontend

Platform forms: max_retries input on create and edit pages
Request log list: ↻N warning badge in the network column when retries > 0
Request log detail: colored callout in log summary ("请求经过 N 次重试后成功/失败"), per-attempt failure cards in diagnostics showing node and error info, "最终结果" card for the final error

Test plan

go build ./... passes
go test ./... passes (all packages)
npm run build passes (webui)
Manual test: set max_retries > 0 on a platform, trigger upstream failure, verify retry badge and detail cards in request log

Made with Cursor

Add per-platform MaxRetries configuration that automatically retries transient upstream failures (connect failed, timeout, request failed) on a different node. Retry attempts and per-attempt failure details (node used, error type/message) are recorded in request logs. Backend: - Platform model: add max_retries field + DB migration (v5) - Environment: add RESIN_DEFAULT_PLATFORM_MAX_RETRIES - Proxy layer: retry loops in forward HTTP/CONNECT and reverse proxy - Request log: add retry_attempts column + retry_details JSON column - Rolling DB backward compat: migrateSchema() for old DBs - API: expose retry_attempts and retry_details in log responses Frontend: - Platform create/edit forms: max_retries input field - Request log list: retry badge (↻N) in network column - Request log detail: retry summary in log overview section, per-attempt failure cards in diagnostics with node and error info Made-with: Cursor

Resinat · 2026-03-08T13:52:45Z

Thanks for submitting the PR! Elevating max_retries to a platform-level configuration is a great design choice. We can keep that part of the configuration code exactly as is.

Error retries improve the user experience, but they also introduce risks at the underlying implementation level of the gateway. To ensure we are aligned on the architecture and to avoid disagreements or rework on implementation details later, we first need to reach a consensus on the underlying design principles of the retry mechanism.

Our core considerations for the retry mechanism are primarily twofold (which is also why we must strictly limit the scope of retries):

Memory Safety Bottom Line: To prevent OOM (Out of Memory) risks caused by large requests in production, the gateway layer must not fully cache the request body (Payload) using methods like io.ReadAll merely to support retries.
Preventing Side Effects (Strict Adherence to at-most-once Semantics): If a request has already been sent to the upstream, blindly retrying it could lead to duplicate writes or executions of non-idempotent requests. Therefore, retries can only occur at the very early stage where it is "confirmed that the request has not yet been written to the network" (Best Effort principle).

Based on the above principles, and combined with the existing request tracking (httptrace) and lifecycle stages (upstream_stage) in the main branch, we have defined the retry boundaries (merge criteria) for this PR as follows:

I. Lifecycle Boundaries for Triggering Retries

Failure Stage (`upstream_stage`)	Typical Error Causes	Retry?	Architectural Consideration (Why)
`connect_dial`	Dial failure, connection timeout	✅ Yes	Still in the very early connection establishment phase; the request has not entered transmission at all.
`forward_roundtrip` / `reverse_roundtrip` AND Request NOT Written	`dns_error`, `dial_error`, `connection_refused`, `timeout`, TLS handshake failure, etc.	✅ Yes	Although routed, the underlying layer confirms connection failure. Fits the "best effort" retry scope.
`forward_roundtrip` / `reverse_roundtrip` AND Request Written / Status Uncertain	Any error	❌ No	The request may have reached the network and triggered upstream side effects. Retrying breaks at-most-once semantics.
`forward_upstream_to_client_copy`	Response body transmission interrupted	❌ No	Already in the response transmission phase; no longer a "connection failure".
CONNECT tunnel handshake & data plane phases (`connect_hijack`, `*_copy`, etc.)	IO failure, tunnel interrupted	❌ No	The tunnel is established or data is flowing; falls outside the "connection failure" semantics.
`context.Canceled` at any stage	Client actively canceled	❌ No	This is client-side behavior, not an upstream availability issue. Should not retry or penalize the node.
Routing/Auth/Platform errors	`NO_AVAILABLE_NODES`, `AUTH_*`, etc.	❌ No	Not transient network/connection issues. Retries won't change the outcome.
Upstream returns HTTP status code	`4xx/5xx`	❌ No	This does not belong to "connection failure" semantics. Retrying usually doesn't improve the outcome and amplifies the risk of side effects.

II. Core Implementation Requirements

To enforce these boundaries in the code, the current implementation must satisfy the following requirements:

Configuration Effectiveness: If max_retries <= 0, the retry logic must be bypassed entirely.
State Determination Based on Real Signals: Determining whether a "request is unwritten" must rely on existing request-written signals in the main branch (e.g., the WroteRequest hook in httptrace). It cannot rely solely on guessing based on error types. If the written status cannot be definitively confirmed, it should be treated as "unsafe to retry" (i.e., do not retry).
No Payload Caching: The current logic in the PR that uses ReadAll (or similar) to cache the request body must be removed. The gateway data plane must not introduce any mechanisms that actively read, temporarily store, or replay the Request Body.
Best Effort: The positioning of this retry mechanism is Best Effort. We are not aiming to cover 100% of failure scenarios; the absolute prerequisite is to guarantee system safety.

III. Test Cases (Acceptance Criteria)

To guarantee the stability of the core logic, please include the following 6 core test cases when updating the code. These will serve as the baseline for subsequent reviews and the final merge:

Positive: connect_dial failures can be retried normally based on max_retries.
Negative: In the forward/reverse path, simulate a failure after the request has been written, and assert that retries are NOT triggered.
Negative: Simulate a failure during the *_copy phase (e.g., data streaming), and assert that retries are NOT triggered.
Regression Prevention: Verify that in scenarios with large request bodies, the code does not trigger any full caching of the Request Body.
Config Boundary: When max_retries <= 0, assert that the retry logic is completely skipped.
Semantic Boundary: context.Canceled, NO_AVAILABLE_NODES, AUTH_*, and upstream 4xx/5xx responses do not trigger retries.

The retry logic at the gateway layer does require meticulous fine-tuning when handling edge cases, but this is crucial for the overall robustness of the system. Thank you for your hard work in adjusting the code approach according to these principles! If you have any questions about integrating this with the existing architecture, please feel free to leave a comment in the PR to discuss!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add configurable retry-on-failure for proxy requests#6

feat: add configurable retry-on-failure for proxy requests#6
Mrzqd wants to merge 1 commit intoResinat:masterfrom
Mrzqd:feat/retry-on-failure

Mrzqd commented Mar 3, 2026 •

edited by Resinat

Loading

Uh oh!

Resinat commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mrzqd commented Mar 3, 2026 • edited by Resinat Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Backend

Frontend

Test plan

Uh oh!

Resinat commented Mar 8, 2026

I. Lifecycle Boundaries for Triggering Retries

II. Core Implementation Requirements

III. Test Cases (Acceptance Criteria)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mrzqd commented Mar 3, 2026 •

edited by Resinat

Loading