Skip to content

[Bug]:Improve tunnel reliability with retries, validation, and failure visibility (closes #432)#433

Open
janhavitupe wants to merge 2 commits intoGetBindu:mainfrom
janhavitupe:fix/tunnel-reliability
Open

[Bug]:Improve tunnel reliability with retries, validation, and failure visibility (closes #432)#433
janhavitupe wants to merge 2 commits intoGetBindu:mainfrom
janhavitupe:fix/tunnel-reliability

Conversation

@janhavitupe
Copy link
Copy Markdown

@janhavitupe janhavitupe commented Apr 3, 2026

Summary

This PR improves the robustness and reliability of tunnel initialization by addressing silent failure cases and enhancing error handling, validation, and observability.

Changes

1. Retry + Backoff

  • Added retry logic with exponential backoff for tunnel creation
  • Prevents failures due to transient network issues

2. Tunnel Validation

  • Added HTTP health check after tunnel creation
  • Ensures tunnel is actually reachable before proceeding
  • Validation is integrated into retry flow

3. Explicit Failure Handling

  • Introduced TunnelError for clear failure signaling
  • Default behavior now fails fast instead of silently continuing
  • Added fail_on_tunnel_error flag for optional soft fallback

4. Improved Logging

  • Structured logging:
    • info → retry attempts
    • warning → soft failures
    • error → hard failures
  • Includes attempt counts and failure reasons

5. Developer Feedback (DX)

  • Startup banner now shows a prominent warning panel when tunnel fails
  • Clearly indicates that the agent is NOT publicly accessible

6. Tests

  • Added tests covering:
    • Retry success after failure
    • Complete failure (hard and soft modes)
    • Health check failure triggering retry
    • Tunnel disabled scenario

Why this change?

Previously, tunnel creation failures were silently ignored, leading to misleading states where the agent appeared publicly available but was not.

This PR ensures:

  • Failures are visible and actionable
  • System behaves predictably under failure conditions
  • Developers get clear feedback during startup

Backward Compatibility

  • Default behavior is now fail-fast (safer)
  • Previous behavior can be restored using:
    fail_on_tunnel_error=False

closes #432

Summary by CodeRabbit

  • New Features

    • Tunnel creation now retries with exponential backoff, includes health checks, and can be configured to either fail the operation or continue local-only on tunnel failure.
    • Improved user-facing failure messaging and a visible “tunnel failure” notice during startup when applicable.
  • Tests

    • Added tests covering retry behavior, backoff timing, success-after-retry, and failure modes for tunnel setup.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.92% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main changes: retry/validation logic and failure visibility for tunnel initialization, directly addressing issue #432.
Description check ✅ Passed The description covers most key sections: summary of problem/solution, detailed changes, backward compatibility, and issue reference. However, it lacks formal structure (Change Type, Scope, Security Impact, Verification, Human Verification sections from template are missing).
Linked Issues check ✅ Passed All coding requirements from issue #432 are fully met: prominent failure warning panel added to startup output, fail_on_tunnel_error parameter implemented, retry logic with exponential backoff added, health validation integrated, and explicit TunnelError exception introduced.
Out of Scope Changes check ✅ Passed All changes are directly scoped to tunnel reliability improvements: configuration fields for retry/backoff, tunnel health validation, failure signaling, display enhancements, and comprehensive tests. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bindu/penguin/bindufy.py`:
- Around line 653-655: The tunnel creation currently calls _setup_tunnel (which
performs live httpx.get() checks) before start_uvicorn_server, causing premature
health checks; change the flow so _setup_tunnel only establishes/returns the
tunnel endpoint without doing live HTTP probes (e.g., add a parameter
validate=False or split out a new validate_tunnel(tunnel_url) function), then
call the validation step only after start_uvicorn_server has started accepting
requests (or when run_server is False but launch=True handle validation
accordingly); update callers around the existing _setup_tunnel call and the code
paths noted (including the block covering lines 657-686) to perform validation
after server start and respect fail_on_tunnel_error semantics.
- Around line 211-213: The three module-level constants (_TUNNEL_MAX_ATTEMPTS,
_TUNNEL_BASE_BACKOFF, _TUNNEL_HEALTH_TIMEOUT) must be removed and read from
bindu.settings.app_settings instead; add corresponding settings (e.g.
TUNNEL_MAX_ATTEMPTS, TUNNEL_BASE_BACKOFF, TUNNEL_HEALTH_TIMEOUT) into
app_settings with the current defaults (3, 1, 3) and then replace uses of the
constants in bindufy.py with app_settings.TUNNEL_MAX_ATTEMPTS,
app_settings.TUNNEL_BASE_BACKOFF, and app_settings.TUNNEL_HEALTH_TIMEOUT (import
app_settings at top if not already present) so environments can override them
via settings.

In `@bindu/utils/display.py`:
- Around line 149-152: The warning text in bindu/utils/display.py is misleading:
update the message appended to warning_lines (the two append calls shown) to
instruct users they can "continue locally after tunnel failure" by passing
fail_on_tunnel_error=False to bindufy(), and update the matching log banner
elsewhere in the module to use the same phrasing; locate uses of
warning_lines.append and the log banner string in this file and replace "to
continue locally without this warning" / "removes the warning" wording with "to
continue locally after tunnel failure" (keep the reference to bindufy() and
fail_on_tunnel_error as-is).

In `@tests/unit/penguin/test_setup_tunnel.py`:
- Line 30: The patch decorators in tests currently target the module-local
symbol "bindu.penguin.bindufy.TunnelManager" but _setup_tunnel() imports
TunnelManager inside the function with "from bindu.tunneling.manager import
TunnelManager", so update the patch targets in the four decorators to
"bindu.tunneling.manager.TunnelManager" (apply the change where
`@patch`("bindu.penguin.bindufy.TunnelManager") appears in the four test methods)
so the function-local import is intercepted correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0a8b5a2-4e35-454e-8af5-bd3c318e5ea8

📥 Commits

Reviewing files that changed from the base of the PR and between 45656ee and e5841e1.

📒 Files selected for processing (3)
  • bindu/penguin/bindufy.py
  • bindu/utils/display.py
  • tests/unit/penguin/test_setup_tunnel.py

Comment thread bindu/penguin/bindufy.py Outdated
Comment thread bindu/penguin/bindufy.py
Comment on lines 653 to 655
# Create tunnel if enabled
tunnel_url = _setup_tunnel(tunnel_config, port, _manifest, bindu_app)
tunnel_url, tunnel_failure_reason = _setup_tunnel(tunnel_config, port, _manifest, bindu_app, fail_on_tunnel_error)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Tunnel validation runs before the server exists.

_setup_tunnel() now performs live httpx.get() checks, but this call still happens before start_uvicorn_server() is invoked. With launch=True, the health check is probing a server that isn't listening yet, so otherwise-valid tunnels will be retried as failures and the default fail_on_tunnel_error=True path will abort startup. This also makes run_server=False + launch=True unsatisfiable. Split tunnel creation from validation, or defer validation until the HTTP server is accepting requests.

Also applies to: 657-686

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bindu/penguin/bindufy.py` around lines 653 - 655, The tunnel creation
currently calls _setup_tunnel (which performs live httpx.get() checks) before
start_uvicorn_server, causing premature health checks; change the flow so
_setup_tunnel only establishes/returns the tunnel endpoint without doing live
HTTP probes (e.g., add a parameter validate=False or split out a new
validate_tunnel(tunnel_url) function), then call the validation step only after
start_uvicorn_server has started accepting requests (or when run_server is False
but launch=True handle validation accordingly); update callers around the
existing _setup_tunnel call and the code paths noted (including the block
covering lines 657-686) to perform validation after server start and respect
fail_on_tunnel_error semantics.

Comment thread bindu/utils/display.py Outdated
Comment thread tests/unit/penguin/test_setup_tunnel.py Outdated
…heck timing, update messaging, and correct test patching
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bindu/penguin/bindufy.py`:
- Around line 220-223: _bindufy_core() currently never invokes
_validate_tunnel_health() after _setup_tunnel(), so a dead or 503 tunnel can be
treated as public; update _bindufy_core() to call _validate_tunnel_health() once
the HTTP server reports readiness and before calling prepare_server_display() or
returning the tunnel URL, and ensure that any TunnelError or retry logic from
_validate_tunnel_health() is handled (falling back to setting tunnel_url=None
and tunnel_failure_reason) so prepare_server_display() will render the warning
panel; apply the same insertion point for the other similar paths that call
_setup_tunnel() (the blocks referenced around the other ranges) so health
validation always runs after server readiness and before exposing the tunnel
URL.
- Around line 256-286: The try-except currently covers both tunnel allocation
and the post-allocation mutations, causing a new tunnel to be created if a
mutation (e.g., setting manifest.url or bindu_app.url or resetting
bindu_app._agent_card_json_schema) fails; refactor so
tunnel_manager.create_tunnel(...) is the only call inside the retry loop (the
block that retries on transient create failures), then after a successful
create_tunnel returns (tunnel_url) perform the mutations (manifest.url =
tunnel_url, bindu_app.url = tunnel_url, bindu_app._agent_card_json_schema =
None) outside that retry loop or inside a separate try-except that does not
re-enter the create_tunnel retry logic, and ensure only create_tunnel failures
trigger the exponential backoff / retry logic.

In `@bindu/settings.py`:
- Around line 151-165: The three tunnel-related settings max_attempts,
base_backoff_seconds, and health_check_timeout_seconds must be validated at
settings parse time: enforce max_attempts >= 1, base_backoff_seconds >= 0, and
health_check_timeout_seconds > 0 so bad env values raise immediately. Fix by
adding Pydantic constraints/validators on those fields in the settings model
(either use conint/ge/gt or add `@validator` methods for the model) to validate
the values and raise a clear ValidationError during parsing; update the
Field(...) declarations for max_attempts, base_backoff_seconds, and
health_check_timeout_seconds (or add validators named for those fields)
accordingly so invalid config is rejected early.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 41c17b52-b240-499a-b93e-8026863f161f

📥 Commits

Reviewing files that changed from the base of the PR and between e5841e1 and ebfff26.

📒 Files selected for processing (4)
  • bindu/penguin/bindufy.py
  • bindu/settings.py
  • bindu/utils/display.py
  • tests/unit/penguin/test_setup_tunnel.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • bindu/utils/display.py
  • tests/unit/penguin/test_setup_tunnel.py

Comment thread bindu/penguin/bindufy.py
Comment on lines +220 to +223
Retries tunnel creation up to app_settings.tunnel.max_attempts times with
exponential backoff before giving up. Health validation is intentionally
deferred — it is performed separately after the HTTP server is accepting
requests (see _validate_tunnel_health).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

The deferred health check never actually runs.

The docstring says validation happens later, but _bindufy_core() never calls _validate_tunnel_health() after _setup_tunnel() returns. An unreachable or 503 tunnel is therefore still passed to prepare_server_display() and treated as public, so health-check-triggered retries, TunnelError, and the fallback warning never execute. Downstream, prepare_server_display() only renders the warning panel when tunnel_url is None and tunnel_failure_reason is set (bindu/utils/display.py:136-161), so this path still presents the agent as publicly reachable. This also makes run_server=False + launch=True look successful even though nothing is listening behind the tunnel. Call the validator after the HTTP server reaches readiness and before displaying or returning the public URL.

Also applies to: 308-333, 651-685

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bindu/penguin/bindufy.py` around lines 220 - 223, _bindufy_core() currently
never invokes _validate_tunnel_health() after _setup_tunnel(), so a dead or 503
tunnel can be treated as public; update _bindufy_core() to call
_validate_tunnel_health() once the HTTP server reports readiness and before
calling prepare_server_display() or returning the tunnel URL, and ensure that
any TunnelError or retry logic from _validate_tunnel_health() is handled
(falling back to setting tunnel_url=None and tunnel_failure_reason) so
prepare_server_display() will render the warning panel; apply the same insertion
point for the other similar paths that call _setup_tunnel() (the blocks
referenced around the other ranges) so health validation always runs after
server readiness and before exposing the tunnel URL.

Comment thread bindu/penguin/bindufy.py
Comment on lines +256 to +286
for attempt in range(1, max_attempts + 1):
try:
tunnel_url = tunnel_manager.create_tunnel(
local_port=port,
config=tunnel_config,
subdomain=tunnel_config.subdomain,
)
logger.info(
f"✅ Tunnel created (attempt {attempt}/{max_attempts}): {tunnel_url}"
)

manifest.url = tunnel_url
bindu_app.url = tunnel_url
bindu_app._agent_card_json_schema = None

return tunnel_url, None

except Exception as e:
failure_reason = str(e)
if attempt < max_attempts:
backoff = base_backoff * (2 ** (attempt - 1))
logger.info(
f"Tunnel attempt {attempt}/{max_attempts} failed "
f"({failure_reason}). Retrying in {backoff}s..."
)
time.sleep(backoff)
else:
logger.error(
f"Tunnel attempt {attempt}/{max_attempts} failed "
f"({failure_reason}). All retries exhausted."
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "bindufy.py" -type f

Repository: GetBindu/Bindu

Length of output: 84


🏁 Script executed:

cat -n ./bindu/penguin/bindufy.py | sed -n '250,290p'

Repository: GetBindu/Bindu

Length of output: 1797


🏁 Script executed:

cat -n ./bindu/penguin/bindufy.py | sed -n '200,260p'

Repository: GetBindu/Bindu

Length of output: 2482


🏁 Script executed:

rg -t py "class AgentManifest" --max-count=5

Repository: GetBindu/Bindu

Length of output: 101


🏁 Script executed:

cat -n ./bindu/common/models.py | grep -A 50 "class AgentManifest:"

Repository: GetBindu/Bindu

Length of output: 2076


🏁 Script executed:

cat -n ./bindu/common/models.py | sed -n '145,175p'

Repository: GetBindu/Bindu

Length of output: 1104


Restructure exception handling to avoid creating duplicate tunnels.

The current code wraps both TunnelManager.create_tunnel() and the subsequent manifest/app mutations in the same try-except block. If the tunnel is successfully allocated but a mutation fails (e.g., manifest.url = tunnel_url), the retry loop creates another tunnel and misattributes a local bug to a transient tunnel failure. Move the mutations outside the retry block or into a separate try-except that doesn't trigger another tunnel creation.

🧰 Tools
🪛 Ruff (0.15.7)

[warning] 273-273: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bindu/penguin/bindufy.py` around lines 256 - 286, The try-except currently
covers both tunnel allocation and the post-allocation mutations, causing a new
tunnel to be created if a mutation (e.g., setting manifest.url or bindu_app.url
or resetting bindu_app._agent_card_json_schema) fails; refactor so
tunnel_manager.create_tunnel(...) is the only call inside the retry loop (the
block that retries on transient create failures), then after a successful
create_tunnel returns (tunnel_url) perform the mutations (manifest.url =
tunnel_url, bindu_app.url = tunnel_url, bindu_app._agent_card_json_schema =
None) outside that retry loop or inside a separate try-except that does not
re-enter the create_tunnel retry logic, and ensure only create_tunnel failures
trigger the exponential backoff / retry logic.

Comment thread bindu/settings.py
Comment on lines +151 to +165
max_attempts: int = Field(
default=3,
validation_alias=AliasChoices("TUNNEL__MAX_ATTEMPTS"),
description="Maximum tunnel creation attempts before giving up",
)
base_backoff_seconds: int = Field(
default=1,
validation_alias=AliasChoices("TUNNEL__BASE_BACKOFF_SECONDS"),
description="Base backoff in seconds; doubles each retry (1s, 2s, 4s)",
)
health_check_timeout_seconds: int = Field(
default=3,
validation_alias=AliasChoices("TUNNEL__HEALTH_CHECK_TIMEOUT_SECONDS"),
description="Timeout in seconds for the post-creation health check request",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate the retry knobs when settings load.

These values come from env, but invalid inputs fail late inside the tunnel path: max_attempts=0 skips all attempts, base_backoff_seconds<0 makes time.sleep() raise, and health_check_timeout_seconds<=0 turns the health check into a config/runtime error instead of a clean tunnel failure. Add bounds here so bad config is rejected during settings parsing.

Suggested bounds
     max_attempts: int = Field(
         default=3,
+        ge=1,
         validation_alias=AliasChoices("TUNNEL__MAX_ATTEMPTS"),
         description="Maximum tunnel creation attempts before giving up",
     )
     base_backoff_seconds: int = Field(
         default=1,
+        ge=0,
         validation_alias=AliasChoices("TUNNEL__BASE_BACKOFF_SECONDS"),
         description="Base backoff in seconds; doubles each retry (1s, 2s, 4s)",
     )
     health_check_timeout_seconds: int = Field(
         default=3,
+        gt=0,
         validation_alias=AliasChoices("TUNNEL__HEALTH_CHECK_TIMEOUT_SECONDS"),
         description="Timeout in seconds for the post-creation health check request",
     )

As per coding guidelines, bindu/**/*.py: "Validate all external input and use type hints for input validation in Python files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bindu/settings.py` around lines 151 - 165, The three tunnel-related settings
max_attempts, base_backoff_seconds, and health_check_timeout_seconds must be
validated at settings parse time: enforce max_attempts >= 1,
base_backoff_seconds >= 0, and health_check_timeout_seconds > 0 so bad env
values raise immediately. Fix by adding Pydantic constraints/validators on those
fields in the settings model (either use conint/ge/gt or add `@validator` methods
for the model) to validate the values and raise a clear ValidationError during
parsing; update the Field(...) declarations for max_attempts,
base_backoff_seconds, and health_check_timeout_seconds (or add validators named
for those fields) accordingly so invalid config is rejected early.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Tunnel creation failure is silently swallowed — agent starts without public URL

1 participant