Skip to content

feat: add configurable retry-on-failure for proxy requests#6

Open
Mrzqd wants to merge 1 commit intoResinat:masterfrom
Mrzqd:feat/retry-on-failure
Open

feat: add configurable retry-on-failure for proxy requests#6
Mrzqd wants to merge 1 commit intoResinat:masterfrom
Mrzqd:feat/retry-on-failure

Conversation

@Mrzqd
Copy link
Copy Markdown

@Mrzqd Mrzqd commented Mar 3, 2026

Summary

  • Add per-platform max_retries configuration option that automatically retries transient upstream failures (connect failed, timeout, request failed) using a different node
  • Record retry attempts and per-attempt failure details (node hash/tag, error type, error message) in the request log database
  • Display retry information in the WebUI: badge in log list, summary callout in log detail, and per-attempt failure cards in diagnostics section

Changes

Backend

  • Platform config: max_retries field across model, runtime struct, codec, env config (RESIN_DEFAULT_PLATFORM_MAX_RETRIES), DB migration (v5), state repo, and API service
  • Proxy retry logic: retry loops in forward HTTP, forward CONNECT, and reverse proxy handlers; only retries on ErrUpstreamConnectFailed, ErrUpstreamTimeout, ErrUpstreamRequestFailed
  • Request body replay: buffers request body when retries > 0 so it can be replayed on each attempt
  • Request log schema: retry_attempts (INTEGER) and retry_details (JSON TEXT) columns; backward-compatible migrateSchema() for existing rolling DBs
  • API response: retry_attempts and retry_details array in log list/detail endpoints

Frontend

  • Platform forms: max_retries input on create and edit pages
  • Request log list: ↻N warning badge in the network column when retries > 0
  • Request log detail: colored callout in log summary ("请求经过 N 次重试后成功/失败"), per-attempt failure cards in diagnostics showing node and error info, "最终结果" card for the final error

Test plan

  • go build ./... passes
  • go test ./... passes (all packages)
  • npm run build passes (webui)
  • Manual test: set max_retries > 0 on a platform, trigger upstream failure, verify retry badge and detail cards in request log

Made with Cursor

Add per-platform MaxRetries configuration that automatically retries
transient upstream failures (connect failed, timeout, request failed)
on a different node. Retry attempts and per-attempt failure details
(node used, error type/message) are recorded in request logs.

Backend:
- Platform model: add max_retries field + DB migration (v5)
- Environment: add RESIN_DEFAULT_PLATFORM_MAX_RETRIES
- Proxy layer: retry loops in forward HTTP/CONNECT and reverse proxy
- Request log: add retry_attempts column + retry_details JSON column
- Rolling DB backward compat: migrateSchema() for old DBs
- API: expose retry_attempts and retry_details in log responses

Frontend:
- Platform create/edit forms: max_retries input field
- Request log list: retry badge (↻N) in network column
- Request log detail: retry summary in log overview section,
  per-attempt failure cards in diagnostics with node and error info

Made-with: Cursor
@Resinat
Copy link
Copy Markdown
Owner

Resinat commented Mar 8, 2026

Thanks for submitting the PR! Elevating max_retries to a platform-level configuration is a great design choice. We can keep that part of the configuration code exactly as is.

Error retries improve the user experience, but they also introduce risks at the underlying implementation level of the gateway. To ensure we are aligned on the architecture and to avoid disagreements or rework on implementation details later, we first need to reach a consensus on the underlying design principles of the retry mechanism.

Our core considerations for the retry mechanism are primarily twofold (which is also why we must strictly limit the scope of retries):

  1. Memory Safety Bottom Line: To prevent OOM (Out of Memory) risks caused by large requests in production, the gateway layer must not fully cache the request body (Payload) using methods like io.ReadAll merely to support retries.
  2. Preventing Side Effects (Strict Adherence to at-most-once Semantics): If a request has already been sent to the upstream, blindly retrying it could lead to duplicate writes or executions of non-idempotent requests. Therefore, retries can only occur at the very early stage where it is "confirmed that the request has not yet been written to the network" (Best Effort principle).

Based on the above principles, and combined with the existing request tracking (httptrace) and lifecycle stages (upstream_stage) in the main branch, we have defined the retry boundaries (merge criteria) for this PR as follows:

I. Lifecycle Boundaries for Triggering Retries

Failure Stage (upstream_stage) Typical Error Causes Retry? Architectural Consideration (Why)
connect_dial Dial failure, connection timeout Yes Still in the very early connection establishment phase; the request has not entered transmission at all.
forward_roundtrip / reverse_roundtrip
AND Request NOT Written
dns_error, dial_error, connection_refused, timeout, TLS handshake failure, etc. Yes Although routed, the underlying layer confirms connection failure. Fits the "best effort" retry scope.
forward_roundtrip / reverse_roundtrip
AND Request Written / Status Uncertain
Any error No The request may have reached the network and triggered upstream side effects. Retrying breaks at-most-once semantics.
forward_upstream_to_client_copy Response body transmission interrupted No Already in the response transmission phase; no longer a "connection failure".
CONNECT tunnel handshake & data plane phases
(connect_hijack, *_copy, etc.)
IO failure, tunnel interrupted No The tunnel is established or data is flowing; falls outside the "connection failure" semantics.
context.Canceled at any stage Client actively canceled No This is client-side behavior, not an upstream availability issue. Should not retry or penalize the node.
Routing/Auth/Platform errors NO_AVAILABLE_NODES, AUTH_*, etc. No Not transient network/connection issues. Retries won't change the outcome.
Upstream returns HTTP status code 4xx/5xx No This does not belong to "connection failure" semantics. Retrying usually doesn't improve the outcome and amplifies the risk of side effects.

II. Core Implementation Requirements

To enforce these boundaries in the code, the current implementation must satisfy the following requirements:

  1. Configuration Effectiveness: If max_retries <= 0, the retry logic must be bypassed entirely.
  2. State Determination Based on Real Signals: Determining whether a "request is unwritten" must rely on existing request-written signals in the main branch (e.g., the WroteRequest hook in httptrace). It cannot rely solely on guessing based on error types. If the written status cannot be definitively confirmed, it should be treated as "unsafe to retry" (i.e., do not retry).
  3. No Payload Caching: The current logic in the PR that uses ReadAll (or similar) to cache the request body must be removed. The gateway data plane must not introduce any mechanisms that actively read, temporarily store, or replay the Request Body.
  4. Best Effort: The positioning of this retry mechanism is Best Effort. We are not aiming to cover 100% of failure scenarios; the absolute prerequisite is to guarantee system safety.

III. Test Cases (Acceptance Criteria)

To guarantee the stability of the core logic, please include the following 6 core test cases when updating the code. These will serve as the baseline for subsequent reviews and the final merge:

  • Positive: connect_dial failures can be retried normally based on max_retries.
  • Negative: In the forward/reverse path, simulate a failure after the request has been written, and assert that retries are NOT triggered.
  • Negative: Simulate a failure during the *_copy phase (e.g., data streaming), and assert that retries are NOT triggered.
  • Regression Prevention: Verify that in scenarios with large request bodies, the code does not trigger any full caching of the Request Body.
  • Config Boundary: When max_retries <= 0, assert that the retry logic is completely skipped.
  • Semantic Boundary: context.Canceled, NO_AVAILABLE_NODES, AUTH_*, and upstream 4xx/5xx responses do not trigger retries.

The retry logic at the gateway layer does require meticulous fine-tuning when handling edge cases, but this is crucial for the overall robustness of the system. Thank you for your hard work in adjusting the code approach according to these principles! If you have any questions about integrating this with the existing architecture, please feel free to leave a comment in the PR to discuss!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants