Skip to content

fix: routing overhaul — reconnection guards, owner_id proto, /etc/hosts DNS, deployment subnets#230

Merged
jhaynie merged 4 commits intomainfrom
fix/routing-overhaul-v2
Apr 19, 2026
Merged

fix: routing overhaul — reconnection guards, owner_id proto, /etc/hosts DNS, deployment subnets#230
jhaynie merged 4 commits intomainfrom
fix/routing-overhaul-v2

Conversation

@jhaynie
Copy link
Copy Markdown
Member

@jhaynie jhaynie commented Apr 19, 2026

Summary

Comprehensive routing reliability improvements from the US Central debugging session. Fixes reconnection panics, adds unprovision race guards via owner_id, improves DNS resolution in containerized environments, and adds deterministic deployment VIP computation.

Changes

Reconnection Stability

  • Panic guards: Bounds checks on controlStreams, streams, connections slices during reconnection
  • Stack trace logging on reconnection panic recovery for diagnostics
  • Discovery refresh after repeated endpoint failures (10 consecutive) to recover from stale endpoint lists

Proto Changes (gravity_session.proto)

  • owner_id added to RouteDeploymentRequest, RouteSandboxRequest, UnprovisionRequest, and ExistingDeployment — prevents stale unprovision from removing newly re-provisioned endpoints (race guard)
  • SNIHostname added to EndpointMapping gossip type for deployment d-hash TLS routing

DNS Resolution

  • /etc/hosts preference over DNS-over-HTTPS in LookupMulti — critical for environments where hosts file has correct mapping but external DNS returns different address (Lima VMs, containers with custom host entries)
  • Cached hosts file parsing with 30s TTL (avoids reading /etc/hosts on every lookup)
  • FailIfLocal respected for hosts file lookups (private/loopback IPs rejected when configured)
  • Gravity client's parseGRPCURL also resolves via /etc/hosts before falling back to DNS

Deployment Subnets (network/subnet.go)

  • ComputeDeploymentSubnet — deterministic /96 IPv6 subnet per machine for deployment VIPs
  • ComputeDeploymentVIP — deterministic IPv6 address per deployment within its machine's subnet

Testing

  • All unit tests pass (go test ./...)
  • Tested in local dev environment with 3 ions + 2 hadrons
  • Deploy success rate improved from 1/3 to 9/10
  • Cross-ion routing verified (VXLAN failover, multi-machine health checks)

Summary by CodeRabbit

  • Bug Fixes

    • Corrected peer IP selection to prioritize IPv6 addresses over IPv4 in cloud discovery.
  • New Features

    • Automatic endpoint re-discovery after repeated reconnection failures to restore connectivity.
    • Deterministic deployment subnet and VIP computation for managed networks.
    • Improved TLS server name handling for multi-endpoint connections.
  • Tests

    • Added tests for endpoint reconnection/discovery refresh and TLS server name behavior.
    • Added tests for deployment subnet and VIP computation.

…ts DNS, deployment subnets

- Reconnection panic guards: bounds checks on controlStreams, streams, connections slices
- Stack trace logging on reconnection panic recovery
- SNIHostname field on EndpointMapping gossip type
- owner_id on RouteDeployment/RouteSandbox/Unprovision/ExistingDeployment protos
- /etc/hosts preference over DNS-over-HTTPS (cached with 30s TTL)
- Respect FailIfLocal for hosts file lookups
- ComputeDeploymentSubnet/ComputeDeploymentVIP for deterministic deployment VIPs
- Discovery refresh after repeated endpoint failures
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 19, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3bb19afe-7e2f-4cae-8f8d-67b4aebd1ffe

📥 Commits

Reviewing files that changed from the base of the PR and between bcf4381 and 7d81e52.

📒 Files selected for processing (2)
  • gravity/grpc_client.go
  • gravity/tls_servername_test.go

📝 Walkthrough

Walkthrough

Removed an unused GCE response wrapper, added per-endpoint reconnection failure tracking with discovery-refresh replacement and TLS SNI handling (plus tests), and introduced deterministic IPv6 deployment subnet/VIP computation with unit tests.

Changes

Cohort / File(s) Summary
GCE discovery struct removal
gravity/discovery/gce.go
Removed the gceInstanceList struct type; no logic or public API changes.
Endpoint reconnection & discovery refresh logic
gravity/grpc_client.go
Added per-endpoint endpointFailCount tracking, endpointDiscoveryRefreshFailureThreshold, preferredTLSServerName helper, failure increment/reset helpers, and logic to refresh/replace endpoint URLs (and connectionURLs/TLSServerName) after repeated reconnection failures. Updated endpoint creation to set TLSServerName consistently.
Endpoint reconnection tests
gravity/reconnect_discovery_refresh_test.go
Added tests and helpers validating re-resolution behavior after repeated failures, covering single failing endpoint and mixed-endpoint scenarios; injects discovery resolver for assertions.
Deployment subnet & VIP utilities
network/subnet.go
Added NetworkDeploymentSubnet constant and functions ComputeDeploymentSubnet(region, machineID) netip.Prefix (deterministic /96 machine subnet) and ComputeDeploymentVIP(subnet, deploymentID) netip.Addr (deterministic VIP within /96, forces odd final byte; panics if subnet not /96).
Deployment subnet & VIP tests
network/subnet_test.go
Added unit tests for determinism, non-collision, structural constraints (prefix length and encoded type), and panic-on-invalid-prefix for the new subnet/VIP functions.
TLS SNI tests
gravity/tls_servername_test.go
Added TestPreferredTLSServerName_HostsFileDoesNotAffectSNI to validate preferredTLSServerName behavior across hostnames, IP literals, localhost, ports, and empty inputs.
🚥 Pre-merge checks | ✅ 1
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@gravity/discovery/gce.go`:
- Around line 116-123: Current loop appends the first matching address
per-interface and breaks, which picks an IPv4 if the first NIC is IPv4-only even
when a later NIC has IPv6; change the logic to first scan inst.NetworkInterfaces
for any iface.IPv6Address and append that (to peers) if found, and only if none
found then scan (or pick) an iface.NetworkIP; modify the code around the
existing loop over inst.NetworkInterfaces (references: inst.NetworkInterfaces,
iface.IPv6Address, iface.NetworkIP, peers) so you don't break out on the first
found address but instead prefer IPv6 across all NICs before falling back to
IPv4.

In `@gravity/grpc_client.go`:
- Around line 3724-3739: preferredTLSServerName currently calls parseGRPCURL
which resolves /etc/hosts and can replace the original hostname with an IP;
change preferredTLSServerName to extract the hostname directly from the original
endpointURL (use url.Parse or equivalent) and derive SNI from that raw host
(strip port with net.SplitHostPort or url.Hostname()), returning
defaultServerName on parse errors or if the host is an IP; keep the existing
TrimSpace check and only fall back to parseGRPCURL or defaultServerName as a
last resort.

In `@network/subnet.go`:
- Around line 87-103: ComputeDeploymentSubnet currently masks region into 5 bits
which silently truncates values >31; add an explicit guard at the top of
ComputeDeploymentSubnet to validate the Region fits in 5 bits (<= 31) and reject
otherwise by returning an error (change signature to
ComputeDeploymentSubnet(region Region, machineID string) (netip.Prefix, error)
and return a non-nil error when region > 31, or alternatively panic if your
codebase prefers); update all callers to handle the new error return and use a
clear constant (e.g. maxRegion := 0x1F or 31) in the check before the existing
byte packing logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bf73ae46-6cf7-436e-8fb9-3be24d6b19a0

📥 Commits

Reviewing files that changed from the base of the PR and between 526b6b6 and f122ec1.

📒 Files selected for processing (5)
  • gravity/discovery/gce.go
  • gravity/grpc_client.go
  • gravity/reconnect_discovery_refresh_test.go
  • network/subnet.go
  • network/subnet_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (go)
🔇 Additional comments (4)
network/subnet.go (2)

12-15: Network type addition looks correct.

NetworkDeploymentSubnet at 0x06 is consistent with the existing packed network-type scheme and avoids overlap with sandbox subnets.


108-122: VIP derivation and /96 contract enforcement are solid.

The function enforces prefix size and deterministically derives host bytes within the deployment subnet as expected.

network/subnet_test.go (1)

366-473: Good targeted coverage for the new deployment path.

These tests validate determinism, prefix shape, network tag encoding, containment, and panic behavior for invalid prefix sizes, which aligns well with the new APIs.

gravity/reconnect_discovery_refresh_test.go (1)

26-82: Good regression coverage for the refresh threshold path.

These tests lock in both key invariants: no refresh before the threshold, and only the failing endpoint being replaced once the threshold is hit.

Comment thread gravity/discovery/gce.go
Comment on lines 116 to 123
for _, iface := range inst.NetworkInterfaces {
if iface.IPv6Address != "" {
peers = append(peers, iface.IPv6Address)
break
}
if iface.NetworkIP != "" {
peers = append(peers, iface.NetworkIP)
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scan all NICs before falling back to IPv4.

This still returns IPv4 when the first interface is IPv4-only, even if a later interface has IPv6. On multi-NIC instances that defeats the new IPv6-preferred behavior and can keep routing peers over the wrong family.

Suggested fix
-		for _, iface := range inst.NetworkInterfaces {
-			if iface.IPv6Address != "" {
-				peers = append(peers, iface.IPv6Address)
-				break
-			}
-			if iface.NetworkIP != "" {
-				peers = append(peers, iface.NetworkIP)
-				break
-			}
-		}
+		var ipv4Fallback string
+		for _, iface := range inst.NetworkInterfaces {
+			if iface.IPv6Address != "" {
+				peers = append(peers, iface.IPv6Address)
+				ipv4Fallback = ""
+				break
+			}
+			if ipv4Fallback == "" && iface.NetworkIP != "" {
+				ipv4Fallback = iface.NetworkIP
+			}
+		}
+		if ipv4Fallback != "" {
+			peers = append(peers, ipv4Fallback)
+		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for _, iface := range inst.NetworkInterfaces {
if iface.IPv6Address != "" {
peers = append(peers, iface.IPv6Address)
break
}
if iface.NetworkIP != "" {
peers = append(peers, iface.NetworkIP)
break
var ipv4Fallback string
for _, iface := range inst.NetworkInterfaces {
if iface.IPv6Address != "" {
peers = append(peers, iface.IPv6Address)
ipv4Fallback = ""
break
}
if ipv4Fallback == "" && iface.NetworkIP != "" {
ipv4Fallback = iface.NetworkIP
}
}
if ipv4Fallback != "" {
peers = append(peers, ipv4Fallback)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gravity/discovery/gce.go` around lines 116 - 123, Current loop appends the
first matching address per-interface and breaks, which picks an IPv4 if the
first NIC is IPv4-only even when a later NIC has IPv6; change the logic to first
scan inst.NetworkInterfaces for any iface.IPv6Address and append that (to peers)
if found, and only if none found then scan (or pick) an iface.NetworkIP; modify
the code around the existing loop over inst.NetworkInterfaces (references:
inst.NetworkInterfaces, iface.IPv6Address, iface.NetworkIP, peers) so you don't
break out on the first found address but instead prefer IPv6 across all NICs
before falling back to IPv4.

Comment thread gravity/grpc_client.go
Comment thread network/subnet.go
Comment on lines +87 to +103
func ComputeDeploymentSubnet(region Region, machineID string) netip.Prefix {
machineHash := hashTo32Bits(machineID)

b := make([]byte, 16)
b[0] = 0xfd
b[1] = 0x15
b[2] = 0xd7
b[3] = 0x10
b[4] = (byte(region) << 3) | (byte(NetworkDeploymentSubnet) & 0x07)
b[5] = byte((machineHash >> 16) & 0xff)
b[6] = byte((machineHash >> 8) & 0xff)
b[7] = byte(machineHash & 0xff)
// Bytes 8-15 are zero for the prefix

addr, _ := netip.AddrFromSlice(b)
return netip.PrefixFrom(addr, 96)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a guard for region overflow before byte packing.

Line 95 packs region into 5 bits, but values > 31 will wrap/truncate and can silently collide with other region encodings. Please reject invalid input explicitly.

Suggested fix
 func ComputeDeploymentSubnet(region Region, machineID string) netip.Prefix {
+	if byte(region) > 31 {
+		panic(fmt.Sprintf("ComputeDeploymentSubnet requires region <= 31, got %d", region))
+	}
 	machineHash := hashTo32Bits(machineID)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func ComputeDeploymentSubnet(region Region, machineID string) netip.Prefix {
machineHash := hashTo32Bits(machineID)
b := make([]byte, 16)
b[0] = 0xfd
b[1] = 0x15
b[2] = 0xd7
b[3] = 0x10
b[4] = (byte(region) << 3) | (byte(NetworkDeploymentSubnet) & 0x07)
b[5] = byte((machineHash >> 16) & 0xff)
b[6] = byte((machineHash >> 8) & 0xff)
b[7] = byte(machineHash & 0xff)
// Bytes 8-15 are zero for the prefix
addr, _ := netip.AddrFromSlice(b)
return netip.PrefixFrom(addr, 96)
}
func ComputeDeploymentSubnet(region Region, machineID string) netip.Prefix {
if byte(region) > 31 {
panic(fmt.Sprintf("ComputeDeploymentSubnet requires region <= 31, got %d", region))
}
machineHash := hashTo32Bits(machineID)
b := make([]byte, 16)
b[0] = 0xfd
b[1] = 0x15
b[2] = 0xd7
b[3] = 0x10
b[4] = (byte(region) << 3) | (byte(NetworkDeploymentSubnet) & 0x07)
b[5] = byte((machineHash >> 16) & 0xff)
b[6] = byte((machineHash >> 8) & 0xff)
b[7] = byte(machineHash & 0xff)
// Bytes 8-15 are zero for the prefix
addr, _ := netip.AddrFromSlice(b)
return netip.PrefixFrom(addr, 96)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@network/subnet.go` around lines 87 - 103, ComputeDeploymentSubnet currently
masks region into 5 bits which silently truncates values >31; add an explicit
guard at the top of ComputeDeploymentSubnet to validate the Region fits in 5
bits (<= 31) and reject otherwise by returning an error (change signature to
ComputeDeploymentSubnet(region Region, machineID string) (netip.Prefix, error)
and return a non-nil error when region > 31, or alternatively panic if your
codebase prefers); update all callers to handle the new error return and use a
clear constant (e.g. maxRegion := 0x1F or 31) in the check before the existing
byte packing logic.

@jhaynie jhaynie merged commit 1481637 into main Apr 19, 2026
4 checks passed
@jhaynie jhaynie deleted the fix/routing-overhaul-v2 branch April 19, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant