feat: add configurable retry-on-failure for proxy requests#6
feat: add configurable retry-on-failure for proxy requests#6Mrzqd wants to merge 1 commit intoResinat:masterfrom
Conversation
Add per-platform MaxRetries configuration that automatically retries transient upstream failures (connect failed, timeout, request failed) on a different node. Retry attempts and per-attempt failure details (node used, error type/message) are recorded in request logs. Backend: - Platform model: add max_retries field + DB migration (v5) - Environment: add RESIN_DEFAULT_PLATFORM_MAX_RETRIES - Proxy layer: retry loops in forward HTTP/CONNECT and reverse proxy - Request log: add retry_attempts column + retry_details JSON column - Rolling DB backward compat: migrateSchema() for old DBs - API: expose retry_attempts and retry_details in log responses Frontend: - Platform create/edit forms: max_retries input field - Request log list: retry badge (↻N) in network column - Request log detail: retry summary in log overview section, per-attempt failure cards in diagnostics with node and error info Made-with: Cursor
|
Thanks for submitting the PR! Elevating Error retries improve the user experience, but they also introduce risks at the underlying implementation level of the gateway. To ensure we are aligned on the architecture and to avoid disagreements or rework on implementation details later, we first need to reach a consensus on the underlying design principles of the retry mechanism. Our core considerations for the retry mechanism are primarily twofold (which is also why we must strictly limit the scope of retries):
Based on the above principles, and combined with the existing request tracking ( I. Lifecycle Boundaries for Triggering Retries
II. Core Implementation RequirementsTo enforce these boundaries in the code, the current implementation must satisfy the following requirements:
III. Test Cases (Acceptance Criteria)To guarantee the stability of the core logic, please include the following 6 core test cases when updating the code. These will serve as the baseline for subsequent reviews and the final merge:
The retry logic at the gateway layer does require meticulous fine-tuning when handling edge cases, but this is crucial for the overall robustness of the system. Thank you for your hard work in adjusting the code approach according to these principles! If you have any questions about integrating this with the existing architecture, please feel free to leave a comment in the PR to discuss! |
Summary
max_retriesconfiguration option that automatically retries transient upstream failures (connect failed, timeout, request failed) using a different nodeChanges
Backend
max_retriesfield across model, runtime struct, codec, env config (RESIN_DEFAULT_PLATFORM_MAX_RETRIES), DB migration (v5), state repo, and API serviceErrUpstreamConnectFailed,ErrUpstreamTimeout,ErrUpstreamRequestFailedretry_attempts(INTEGER) andretry_details(JSON TEXT) columns; backward-compatiblemigrateSchema()for existing rolling DBsretry_attemptsandretry_detailsarray in log list/detail endpointsFrontend
max_retriesinput on create and edit pages↻Nwarning badge in the network column when retries > 0Test plan
go build ./...passesgo test ./...passes (all packages)npm run buildpasses (webui)max_retries > 0on a platform, trigger upstream failure, verify retry badge and detail cards in request logMade with Cursor