Skip to content

fix(ble): Harden BLE connection lifecycle#5795

Merged
jamesarich merged 23 commits into
meshtastic:release/2.8.0from
jeremiah-k:bugfix/ble-refactor
Jun 16, 2026
Merged

fix(ble): Harden BLE connection lifecycle#5795
jamesarich merged 23 commits into
meshtastic:release/2.8.0from
jeremiah-k:bugfix/ble-refactor

Conversation

@jeremiah-k

@jeremiah-k jeremiah-k commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Overview

This PR improves BLE connection stability in long-running sessions by tightening exception classification, session cleanup, reconnect triggering, selected-node discovery, reconnect UX, and liveness recovery. The changes focus on cases where BLE operations can fail after an established connection has been active for some time, leaving stale GATT/profile state behind and requiring manual node reselection. Fatal BLE conditions now trigger bounded cleanup and reconnect paths, selected bonded nodes prefer fresh advertisements when available, and recoverable BLE failures avoid user-facing modal errors while automatic reconnect continues.

Key Changes

Fixes & Stability Improvements

  • Session-fatal BLE classification: Added isSessionFatalBleException() to identify unrecoverable BLE session failures, including NotConnectedException and fatal GATT statuses 8, 19, 22, 62, 129, and 133, with bounded cause-chain traversal.
  • BLE profile exception propagation: Updated KableMeshtasticRadioProfile so session-fatal exceptions from fromRadio and logRadio propagate to the transport layer and can trigger reconnect, while non-fatal diagnostic/log failures remain suppressed.
  • FROMNUM subscription readiness: Ensured pre-readiness FROMNUM observe failures complete subscriptionReady exceptionally, and added bounded subscription readiness handling so setup does not continue with half-initialized BLE state.
  • Session failure deduplication: Added atomic session-failure guards in BleRadioTransport to ensure racing read/write/profile failures only trigger one disconnect callback and one cleanup path per failed session.
  • Stale BLE state cleanup: Fatal session failures now clear stale radioService / fully-connected state and force GATT disconnect so the reconnect loop can proceed instead of waiting on stale state.
  • Write-path hardening: Reworked BLE write handling to re-check the active profile under the write mutex, ignore stale-session writes, and treat post-retry write failures as fatal to the current BLE session.
  • Connection gate timeout: Added bounded handling for connection-state synchronization so missing Connected state transitions clean up stale GATT state and retry instead of hanging.
  • Selected-node discovery: Added a short targeted scan for the selected BLE address before falling back to bonded address-only connection data, allowing fresh advertisement-backed connects when available without adding periodic/background scanning.
  • Reconnect UX noise reduction: Suppressed user-facing modal errors for recoverable BLE reconnect attempts and non-permanent session failures while preserving logs, state updates, reconnect behavior, and permanent-failure reporting.
  • BLE-only liveness recovery: Added guarded BLE transport restart on liveness timeout, using non-permanent recovery semantics and skipping polite-disconnect frames when the transport may already be dead.
  • Restart deduplication: Added atomic restart guarding in SharedRadioInterfaceService so overlapping liveness checks do not stack transport restarts.

Test Coverage

  • Exception classification tests: Added coverage for fatal GATT status detection, NotConnectedException, wrapped causes, and non-fatal conditions.
  • Profile exception tests: Added coverage for fatal fromRadio / logRadio propagation, transient error suppression, and FROMNUM subscription readiness failure behavior.
  • Transport reconnect tests: Expanded reconnect coverage for write failures, read/profile failures, stale service cleanup, CancellationException suppression, FROMNUM pre-readiness failures, subscription-readiness timeout, connected-gate timeout cleanup, and disconnect callback deduplication.
  • Reconnect UX tests: Added/updated coverage to ensure transient reconnect attempts and non-permanent session failures do not emit user-facing error messages while automatic retry continues.
  • Selected-node scan tests: Added coverage for preferring a freshly scanned advertisement-backed BLE device over a bonded address-only device, while preserving bonded fallback when targeted scan finds nothing.
  • Liveness recovery tests: Added coverage for BLE-only liveness restart, non-permanent recovery behavior, suppression of polite zombie writes, suppression of user-facing liveness recovery errors, non-BLE timeout isolation, inbound-data timer reset, and overlapping restart prevention.
  • Failure-injection tests: Added test coverage for read/write/observe failure injection, characteristic-specific observation failures, pre-subscription failures, and never-subscribe characteristics.

Test Infrastructure

  • FakeBleService failure controls: Added configurable read/write/observe failures, characteristic-specific observe failures, pre-subscription observe failures, and never-subscribe behavior for targeted BLE recovery tests.
  • FakeRadioTransport counters: Added close-count tracking to support liveness restart assertions.
  • Deterministic service liveness timing: Added an internal clock seam for SharedRadioInterfaceService tests without changing production injection behavior.
  • Atomic state handling: Added kotlinx.atomicfu usage where needed for session/restart guards and concurrent counters.

Dependencies

  • Added kotlinx.atomicfu to core/network for atomic session state management.
  • Added explicit BLE/Kable test dependencies where needed for failure-injection exception coverage.
  • Added direct core.testing test dependency for service liveness coverage.

Breaking Changes

None. Changes are internal to BLE transport/profile handling, selected-node BLE discovery, reconnect UX, service liveness recovery, and targeted test infrastructure.

Tested and working on Pixel 8a, Pixel 6 Pro, Pixel 4a 5G, and TCL 20 A 5G (T768S) in various configurations.

Closes: #3589

@github-actions github-actions Bot added the bugfix PR tag label Jun 14, 2026
@jeremiah-k jeremiah-k changed the title Harden BLE connection lifecycle fix(ble): Harden BLE connection lifecycle Jun 14, 2026
@jeremiah-k jeremiah-k marked this pull request as draft June 15, 2026 02:56
@jeremiah-k jeremiah-k marked this pull request as ready for review June 15, 2026 05:06
@jamesarich

Copy link
Copy Markdown
Collaborator

Hey @jeremiah-k - before this gets much further, we'll want to investigate these changes vs release/2.8.0 branch and even further upstream in meshtastic-sdk https://github.com/meshtastic/meshtastic-sdk where we're working to abstract the "guts" (node connection and communications layer).

I've got claude looking at it currently vs the release and sdk - we'll see what shakes out.

@jeremiah-k

Copy link
Copy Markdown
Contributor Author

@jamesarich Sounds good

jamesarich and others added 16 commits June 16, 2026 09:31
Opening commit for the release/2.8.0 stabilization branch. versionCode
remains git-derived (offset 29314197 + commit count); only the base
version name is bumped here. CI overrides this with the release tag.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Enables the full validate-and-build pipeline on PRs whose base is a
release branch (e.g. release/2.8.0), not just main, so milestone PRs
retargeted onto the release branch get per-PR CI before merge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tic#5586)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…shtastic#5701)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ings directory) (meshtastic#5714)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cher (meshtastic#5765)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…5675

meshtastic#5275 created :feature:discovery with implementation(projects.core.proto), but
main's meshtastic#5675 replaced the :core:proto submodule with the org.meshtastic:protobufs
Maven artifact. Rebasing left a stale module reference. Switched to
implementation(libs.meshtastic.protobufs), matching the convention (cf. feature/node).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eline for 2.8.0 (meshtastic#5775)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…AtStart

The google-flavor AppFunctionsModule registered AppFunctionStateSync
with createdAtStart = true. Eager creation needs the androidContext
binding and immediately spawns the prefs-observing sync coroutine —
so any Koin graph built outside a running app failed with
NoDefinitionFoundException for android.content.Context. That broke
KoinVerificationTest.verifyTypedBootstrapLoadsModuleGraph (the typed
koinApplication<AndroidKoinApp>() bootstrap instantiates eager
singletons), failing the shard-app CI job on this branch.

The definition is now a plain @single (the graph stays lazily
constructible) and GoogleMeshUtilApplication.onCreate resolves it once
after startKoin has bound androidContext — same production behavior,
explicit instead of implicit. It was the repo's only createdAtStart.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: James Rich <james.a.rich@gmail.com>
configureCommon() applied setMultipleConnectionPool(maxNumOfReaders = 4)
to every database, including the in-memory ones used by tests. A read on
a pooled reader connection can observe a snapshot older than the latest
write on the writer connection, so a read immediately after a write may
return stale rows.

DeviceLinkRepositoryImplTest.reconcilePrunesShortCodesNoLongerInCatalog
read [a, b] (the pre-prune state) instead of [a] after a deleteNotIn —
passing locally but flaking on CI depending on connection-assignment
timing (failed shard-core on meshtastic#5738; the identical code passed on meshtastic#5780).

In-memory builders now pass multiConnection = false so reads serialize
behind writes on one connection. Production/file databases keep the
multi-reader pool.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…uppression) (meshtastic#5793)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jamesarich

jamesarich commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Here's what claude came up with:

1. Retarget to release/2.8.0. 2.8.0 ships from there, not main (which is frozen on Renovate bumps) — as-is this misses the release. The rebase is clean except AndroidRadioControllerImpl.kt, which release/2.8.0 removed in the AIDL refactor (#5586): re-apply the MeshService.startService() line to RadioControllerImpl or drop it.

2. Scope/flag the non-BLE liveness change. checkLiveness() now returns for TCP/serial instead of signaling disconnect — a dead TCP session that goes quiet won't be caught anymore. Scope it out or confirm there's still a detection path (a non-BLE regression test would settle it).

3. Hardware soak. This is the most crash-sensitive path, validated only one device so far. Can you run against any other phone/node combos before it lands in the release branch?

@jamesarich

Copy link
Copy Markdown
Collaborator

@jeremiah-k - there's a bit of a conflict (not bad) when retargeting the release/2.8.0 branch. Let's go ahead and re-target this PR to the release and address the checkLiveness path for all transports.

If you can - test it against a few device types/phones.

@jeremiah-k jeremiah-k changed the base branch from main to release/2.8.0 June 16, 2026 19:06
…overy

Long-running BLE sessions would eventually fail writes to toRadioCharacteristic
with Kable NotConnectedException / "Disconnect detected" after hundreds of
successful writes, then report "Not connected" or surface a
JobCancellationException-like error. Manual node reselection was required to
recover.

Treat failed BLE session operations as fatal to the current session, tear down
stale GATT/profile state promptly, and force a GATT disconnect so the reconnect
loop wakes up. Internal forced cleanup is not classified as a user/manual
LocalDisconnect; duplicate disconnect callbacks are deduped via atomic CAS.

- BleExceptionClassifier: centralize fatal BLE session classification
  (NotConnectedException, fatal GATT statuses 8/19/22/62/129/133, bounded
  cause-chain recursion)
- KableMeshtasticRadioProfile: FROMNUM observe completes subscriptionReady
  only when subscription is active; pre-readiness failure completes
  exceptionally and rethrows; fatal fromRadio/logRadio exceptions propagate
- BleRadioTransport: per-attempt session failure tracking; sessionFailureCause
  is set before the sessionFailed CAS to eliminate a visibility window;
  handleFailure dedup; connected-gate timeout cleanup; fatal-write reconnect;
  FROMNUM subscription readiness timeout aborts setup; CancellationException is
  not user-facing
- SharedRadioInterfaceService: BLE-only liveness recovery for silent zombie
  sessions; skips permanent disconnect and polite ToRadio(disconnect=true);
  isRestarting atomic guard prevents overlapping restarts
- FakeBle/FakeRadioTransport: failure-injection seams for targeted tests
- Tests: exception classifier, profile exception propagation, transport
  reconnect crash scenarios, service liveness, fake failure injection
When a bonded device is found, attempt a 2-second targeted scan
(service UUID + address) before using the cached bond. This avoids
connecting to a stale BLE handle when the peripheral has restarted
or changed its GATT server.

- Add findTargetedDevice() with short timeout and graceful fallback
- Restructure findDevice() for early-return clarity
- Add tests for both scan-hit and scan-miss paths
- Update existing tests to emit scanned device for new flow
- Wrap test start/close in try/finally to prevent coroutine leaks
checkLiveness() was passing an errorMessage to onDisconnect, which
flows _connectionError → showAlert and pops a modal dialog. But
liveness recovery is self-healing — detect zombie, tear down, reconnect
— so by the time the user sees the dialog, we're already back.

Drop the errorMessage; keep the warning log and the DeviceSleep state
transition. Added a test asserting no connectionError is emitted.
Ensure that non-permanent BLE disconnects do not include an error message in the
onDisconnect callback. This prevents the UI from showing unnecessary modal
dialogs during automatic reconnection attempts, improving UX by only
surfacing errors for permanent failures.
@jeremiah-k jeremiah-k force-pushed the bugfix/ble-refactor branch from 477fc2a to 5273884 Compare June 16, 2026 19:22
… IDs, discovery abort, AQ zeros) (meshtastic#5813)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jeremiah-k

jeremiah-k commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author
  1. Retarget to release/2.8.0. 2.8.0 ships from there, not main

Done

  1. Scope/flag the non-BLE liveness change. checkLiveness() now returns for TCP/serial instead of signaling disconnect — a dead TCP session that goes quiet won't be caught anymore. Scope it out or confirm there's still a detection path (a non-BLE regression test would settle it).

I checked this path more closely, and I think we're covered here.

checkLiveness() is intentionally scoped to BLE because it is meant to handle the Android BLE/GATT "zombie" case where the stack can stop delivering data without reporting a disconnect.

TCP is handled separately in TcpTransport, not through checkLiveness(). It has its own quiet-session detection: SOCKET_TIMEOUT_MS = 5_000 and SOCKET_RETRIES = 18, so 90 seconds of socket read silence exits the read loop and disconnects. It also has keepalive, EOF handling, and write/flush error handling.

Serial similarly relies on USB/device-removal and serial I/O error paths.

There is also a non-BLE liveness regression test here:
https://github.com/jeremiah-k/Meshtastic-Android/blob/52738849817bf093a953041749ddc53cc67772a2/core/service/src/commonTest/kotlin/org/meshtastic/core/service/SharedRadioInterfaceServiceLivenessTest.kt#L476-L497

The test verifies the BLE liveness restart path does not close/restart a TCP-style transport or mutate its connection state.

  1. Hardware soak. This is the most crash-sensitive path, validated only one device so far. Can you run against any other phone/node combos before it lands in the release branch?

I've been testing the fdroid debug version on a Pixel 8a, Pixel 6 Pro, and Pixel 4a 5G for 24 hrs+ (I updated the summary now). The last few commits were ironing out UX issues from unnecessary dialogs shown during reconnection. I just installed the google debug version on a TCL 20 A 5G (T768S) and everything runs as expected.

I haven't run into any issues in any of my testing. So everything looks good from my side.

@jamesarich jamesarich force-pushed the release/2.8.0 branch 2 times, most recently from 69014e9 to 8874352 Compare June 16, 2026 20:51
…-refactor

# Conflicts:
#	core/repository/src/commonMain/kotlin/org/meshtastic/core/repository/AdminController.kt
#	core/service/src/commonMain/kotlin/org/meshtastic/core/service/AdminControllerImpl.kt
#	core/service/src/commonTest/kotlin/org/meshtastic/core/service/RadioControllerImplTest.kt

@jamesarich jamesarich left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on my end, appreciate it!

@jamesarich jamesarich merged commit 35fba4d into meshtastic:release/2.8.0 Jun 16, 2026
15 checks passed
@jeremiah-k jeremiah-k deleted the bugfix/ble-refactor branch June 16, 2026 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix PR tag

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: App sometimes refuses to reconnect to node over BLE after node reboot

2 participants