fix(ble): Harden BLE connection lifecycle by jeremiah-k · Pull Request #5795 · meshtastic/Meshtastic-Android

jeremiah-k · 2026-06-14T21:06:39Z

Overview

This PR improves BLE connection stability in long-running sessions by tightening exception classification, session cleanup, reconnect triggering, selected-node discovery, reconnect UX, and liveness recovery. The changes focus on cases where BLE operations can fail after an established connection has been active for some time, leaving stale GATT/profile state behind and requiring manual node reselection. Fatal BLE conditions now trigger bounded cleanup and reconnect paths, selected bonded nodes prefer fresh advertisements when available, and recoverable BLE failures avoid user-facing modal errors while automatic reconnect continues.

Key Changes

Fixes & Stability Improvements

Session-fatal BLE classification: Added isSessionFatalBleException() to identify unrecoverable BLE session failures, including NotConnectedException and fatal GATT statuses 8, 19, 22, 62, 129, and 133, with bounded cause-chain traversal.
BLE profile exception propagation: Updated KableMeshtasticRadioProfile so session-fatal exceptions from fromRadio and logRadio propagate to the transport layer and can trigger reconnect, while non-fatal diagnostic/log failures remain suppressed.
FROMNUM subscription readiness: Ensured pre-readiness FROMNUM observe failures complete subscriptionReady exceptionally, and added bounded subscription readiness handling so setup does not continue with half-initialized BLE state.
Session failure deduplication: Added atomic session-failure guards in BleRadioTransport to ensure racing read/write/profile failures only trigger one disconnect callback and one cleanup path per failed session.
Stale BLE state cleanup: Fatal session failures now clear stale radioService / fully-connected state and force GATT disconnect so the reconnect loop can proceed instead of waiting on stale state.
Write-path hardening: Reworked BLE write handling to re-check the active profile under the write mutex, ignore stale-session writes, and treat post-retry write failures as fatal to the current BLE session.
Connection gate timeout: Added bounded handling for connection-state synchronization so missing Connected state transitions clean up stale GATT state and retry instead of hanging.
Selected-node discovery: Added a short targeted scan for the selected BLE address before falling back to bonded address-only connection data, allowing fresh advertisement-backed connects when available without adding periodic/background scanning.
Reconnect UX noise reduction: Suppressed user-facing modal errors for recoverable BLE reconnect attempts and non-permanent session failures while preserving logs, state updates, reconnect behavior, and permanent-failure reporting.
BLE-only liveness recovery: Added guarded BLE transport restart on liveness timeout, using non-permanent recovery semantics and skipping polite-disconnect frames when the transport may already be dead.
Restart deduplication: Added atomic restart guarding in SharedRadioInterfaceService so overlapping liveness checks do not stack transport restarts.

Test Coverage

Exception classification tests: Added coverage for fatal GATT status detection, NotConnectedException, wrapped causes, and non-fatal conditions.
Profile exception tests: Added coverage for fatal fromRadio / logRadio propagation, transient error suppression, and FROMNUM subscription readiness failure behavior.
Transport reconnect tests: Expanded reconnect coverage for write failures, read/profile failures, stale service cleanup, CancellationException suppression, FROMNUM pre-readiness failures, subscription-readiness timeout, connected-gate timeout cleanup, and disconnect callback deduplication.
Reconnect UX tests: Added/updated coverage to ensure transient reconnect attempts and non-permanent session failures do not emit user-facing error messages while automatic retry continues.
Selected-node scan tests: Added coverage for preferring a freshly scanned advertisement-backed BLE device over a bonded address-only device, while preserving bonded fallback when targeted scan finds nothing.
Liveness recovery tests: Added coverage for BLE-only liveness restart, non-permanent recovery behavior, suppression of polite zombie writes, suppression of user-facing liveness recovery errors, non-BLE timeout isolation, inbound-data timer reset, and overlapping restart prevention.
Failure-injection tests: Added test coverage for read/write/observe failure injection, characteristic-specific observation failures, pre-subscription failures, and never-subscribe characteristics.

Test Infrastructure

FakeBleService failure controls: Added configurable read/write/observe failures, characteristic-specific observe failures, pre-subscription observe failures, and never-subscribe behavior for targeted BLE recovery tests.
FakeRadioTransport counters: Added close-count tracking to support liveness restart assertions.
Deterministic service liveness timing: Added an internal clock seam for SharedRadioInterfaceService tests without changing production injection behavior.
Atomic state handling: Added kotlinx.atomicfu usage where needed for session/restart guards and concurrent counters.

Dependencies

Added kotlinx.atomicfu to core/network for atomic session state management.
Added explicit BLE/Kable test dependencies where needed for failure-injection exception coverage.
Added direct core.testing test dependency for service liveness coverage.

Breaking Changes

None. Changes are internal to BLE transport/profile handling, selected-node BLE discovery, reconnect UX, service liveness recovery, and targeted test infrastructure.

Tested and working on Pixel 8a, Pixel 6 Pro, Pixel 4a 5G, and TCL 20 A 5G (T768S) in various configurations.

Closes: #3589

jamesarich · 2026-06-16T14:17:05Z

Hey @jeremiah-k - before this gets much further, we'll want to investigate these changes vs release/2.8.0 branch and even further upstream in meshtastic-sdk https://github.com/meshtastic/meshtastic-sdk where we're working to abstract the "guts" (node connection and communications layer).

I've got claude looking at it currently vs the release and sdk - we'll see what shakes out.

jeremiah-k · 2026-06-16T14:20:43Z

@jamesarich Sounds good

Opening commit for the release/2.8.0 stabilization branch. versionCode remains git-derived (offset 29314197 + commit count); only the base version name is bumped here. CI overrides this with the release tag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Enables the full validate-and-build pipeline on PRs whose base is a release branch (e.g. release/2.8.0), not just main, so milestone PRs retargeted onto the release branch get per-PR CI before merge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tic#5586) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…shtastic#5701) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ings directory) (meshtastic#5714) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…cher (meshtastic#5765) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…5675 meshtastic#5275 created :feature:discovery with implementation(projects.core.proto), but main's meshtastic#5675 replaced the :core:proto submodule with the org.meshtastic:protobufs Maven artifact. Rebasing left a stale module reference. Switched to implementation(libs.meshtastic.protobufs), matching the convention (cf. feature/node). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…eline for 2.8.0 (meshtastic#5775) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@single

…AtStart The google-flavor AppFunctionsModule registered AppFunctionStateSync with createdAtStart = true. Eager creation needs the androidContext binding and immediately spawns the prefs-observing sync coroutine — so any Koin graph built outside a running app failed with NoDefinitionFoundException for android.content.Context. That broke KoinVerificationTest.verifyTypedBootstrapLoadsModuleGraph (the typed koinApplication<AndroidKoinApp>() bootstrap instantiates eager singletons), failing the shard-app CI job on this branch. The definition is now a plain @single (the graph stays lazily constructible) and GoogleMeshUtilApplication.onCreate resolves it once after startKoin has bound androidContext — same production behavior, explicit instead of implicit. It was the repo's only createdAtStart. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: James Rich <james.a.rich@gmail.com>

configureCommon() applied setMultipleConnectionPool(maxNumOfReaders = 4) to every database, including the in-memory ones used by tests. A read on a pooled reader connection can observe a snapshot older than the latest write on the writer connection, so a read immediately after a write may return stale rows. DeviceLinkRepositoryImplTest.reconcilePrunesShortCodesNoLongerInCatalog read [a, b] (the pre-prune state) instead of [a] after a deleteNotIn — passing locally but flaking on CI depending on connection-assignment timing (failed shard-core on meshtastic#5738; the identical code passed on meshtastic#5780). In-memory builders now pass multiConnection = false so reads serialize behind writes on one connection. Production/file databases keep the multi-reader pool. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…uppression) (meshtastic#5793) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jamesarich · 2026-06-16T18:53:29Z

Here's what claude came up with:

1. Retarget to release/2.8.0. 2.8.0 ships from there, not main (which is frozen on Renovate bumps) — as-is this misses the release. The rebase is clean except AndroidRadioControllerImpl.kt, which release/2.8.0 removed in the AIDL refactor (#5586): re-apply the MeshService.startService() line to RadioControllerImpl or drop it.

2. Scope/flag the non-BLE liveness change. checkLiveness() now returns for TCP/serial instead of signaling disconnect — a dead TCP session that goes quiet won't be caught anymore. Scope it out or confirm there's still a detection path (a non-BLE regression test would settle it).

3. Hardware soak. This is the most crash-sensitive path, validated only one device so far. Can you run against any other phone/node combos before it lands in the release branch?

jamesarich · 2026-06-16T18:55:14Z

@jeremiah-k - there's a bit of a conflict (not bad) when retargeting the release/2.8.0 branch. Let's go ahead and re-target this PR to the release and address the checkLiveness path for all transports.

If you can - test it against a few device types/phones.

…overy Long-running BLE sessions would eventually fail writes to toRadioCharacteristic with Kable NotConnectedException / "Disconnect detected" after hundreds of successful writes, then report "Not connected" or surface a JobCancellationException-like error. Manual node reselection was required to recover. Treat failed BLE session operations as fatal to the current session, tear down stale GATT/profile state promptly, and force a GATT disconnect so the reconnect loop wakes up. Internal forced cleanup is not classified as a user/manual LocalDisconnect; duplicate disconnect callbacks are deduped via atomic CAS. - BleExceptionClassifier: centralize fatal BLE session classification (NotConnectedException, fatal GATT statuses 8/19/22/62/129/133, bounded cause-chain recursion) - KableMeshtasticRadioProfile: FROMNUM observe completes subscriptionReady only when subscription is active; pre-readiness failure completes exceptionally and rethrows; fatal fromRadio/logRadio exceptions propagate - BleRadioTransport: per-attempt session failure tracking; sessionFailureCause is set before the sessionFailed CAS to eliminate a visibility window; handleFailure dedup; connected-gate timeout cleanup; fatal-write reconnect; FROMNUM subscription readiness timeout aborts setup; CancellationException is not user-facing - SharedRadioInterfaceService: BLE-only liveness recovery for silent zombie sessions; skips permanent disconnect and polite ToRadio(disconnect=true); isRestarting atomic guard prevents overlapping restarts - FakeBle/FakeRadioTransport: failure-injection seams for targeted tests - Tests: exception classifier, profile exception propagation, transport reconnect crash scenarios, service liveness, fake failure injection

When a bonded device is found, attempt a 2-second targeted scan (service UUID + address) before using the cached bond. This avoids connecting to a stale BLE handle when the peripheral has restarted or changed its GATT server. - Add findTargetedDevice() with short timeout and graceful fallback - Restructure findDevice() for early-return clarity - Add tests for both scan-hit and scan-miss paths - Update existing tests to emit scanned device for new flow - Wrap test start/close in try/finally to prevent coroutine leaks

checkLiveness() was passing an errorMessage to onDisconnect, which flows _connectionError → showAlert and pops a modal dialog. But liveness recovery is self-healing — detect zombie, tear down, reconnect — so by the time the user sees the dialog, we're already back. Drop the errorMessage; keep the warning log and the DeviceSleep state transition. Added a test asserting no connectionError is emitted.

Ensure that non-permanent BLE disconnects do not include an error message in the onDisconnect callback. This prevents the UI from showing unnecessary modal dialogs during automatic reconnection attempts, improving UX by only surfacing errors for permanent failures.

… IDs, discovery abort, AQ zeros) (meshtastic#5813) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jeremiah-k · 2026-06-16T20:37:40Z

Retarget to release/2.8.0. 2.8.0 ships from there, not main

Done

Scope/flag the non-BLE liveness change. checkLiveness() now returns for TCP/serial instead of signaling disconnect — a dead TCP session that goes quiet won't be caught anymore. Scope it out or confirm there's still a detection path (a non-BLE regression test would settle it).

I checked this path more closely, and I think we're covered here.

checkLiveness() is intentionally scoped to BLE because it is meant to handle the Android BLE/GATT "zombie" case where the stack can stop delivering data without reporting a disconnect.

TCP is handled separately in TcpTransport, not through checkLiveness(). It has its own quiet-session detection: SOCKET_TIMEOUT_MS = 5_000 and SOCKET_RETRIES = 18, so 90 seconds of socket read silence exits the read loop and disconnects. It also has keepalive, EOF handling, and write/flush error handling.

Serial similarly relies on USB/device-removal and serial I/O error paths.

There is also a non-BLE liveness regression test here:
https://github.com/jeremiah-k/Meshtastic-Android/blob/52738849817bf093a953041749ddc53cc67772a2/core/service/src/commonTest/kotlin/org/meshtastic/core/service/SharedRadioInterfaceServiceLivenessTest.kt#L476-L497

The test verifies the BLE liveness restart path does not close/restart a TCP-style transport or mutate its connection state.

Hardware soak. This is the most crash-sensitive path, validated only one device so far. Can you run against any other phone/node combos before it lands in the release branch?

I've been testing the fdroid debug version on a Pixel 8a, Pixel 6 Pro, and Pixel 4a 5G for 24 hrs+ (I updated the summary now). The last few commits were ironing out UX issues from unnecessary dialogs shown during reconnection. I just installed the google debug version on a TCL 20 A 5G (T768S) and everything runs as expected.

I haven't run into any issues in any of my testing. So everything looks good from my side.

…-refactor # Conflicts: # core/repository/src/commonMain/kotlin/org/meshtastic/core/repository/AdminController.kt # core/service/src/commonMain/kotlin/org/meshtastic/core/service/AdminControllerImpl.kt # core/service/src/commonTest/kotlin/org/meshtastic/core/service/RadioControllerImplTest.kt

jamesarich

Looks good on my end, appreciate it!

github-actions Bot added the bugfix PR tag label Jun 14, 2026

jeremiah-k changed the title ~~Harden BLE connection lifecycle~~ fix(ble): Harden BLE connection lifecycle Jun 14, 2026

jeremiah-k marked this pull request as draft June 15, 2026 02:56

jeremiah-k marked this pull request as ready for review June 15, 2026 05:06

jamesarich and others added 16 commits June 16, 2026 09:31

feat(ai): Add App Functions for system AI integration (meshtastic#5585)

0794028

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat(car): Android Car App Library integration (meshtastic#5633)

cc9e86f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: FTS5 full-text message search (meshtastic#5373)

e6220c3

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

refactor: Remove AIDL API and modernize service architecture (meshtas…

9d1478b

…tic#5586) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

feat: add air quality telemetry display (PM1.0, PM2.5, PM10, CO2) (me…

f1b250a

…shtastic#5701) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat(node): msh.to device hardware links ("I want one" section + Sett…

0519051

…ings directory) (meshtastic#5714) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(discovery): mesh network discovery (meshtastic#5275)

cef2c1f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

perf: add Baseline Profile generation for :androidApp (meshtastic#5735)

1a63ed9

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

refactor(node): fetch device links from the API, drop the bundled mat…

5354cc2

…cher (meshtastic#5765) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: update repo/developer/in-app docs and repair the screenshot pip…

519507f

…eline for 2.8.0 (meshtastic#5775) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix(node): chart local-node air-quality telemetry (orphaning + zero-s…

8cffc9f

…uppression) (meshtastic#5793) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jeremiah-k changed the base branch from main to release/2.8.0 June 16, 2026 19:06

jeremiah-k added 4 commits June 16, 2026 14:21

jeremiah-k force-pushed the bugfix/ble-refactor branch from 477fc2a to 5273884 Compare June 16, 2026 19:22

fix: resolve release/2.8.0 branch-review findings (car hosts, AI node…

d49670a

… IDs, discovery abort, AQ zeros) (meshtastic#5813) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge branch 'release/2.8.0' into bugfix/ble-refactor

9cb506a

jamesarich force-pushed the release/2.8.0 branch 2 times, most recently from 69014e9 to 8874352 Compare June 16, 2026 20:51

jamesarich approved these changes Jun 16, 2026

View reviewed changes

jamesarich merged commit 35fba4d into meshtastic:release/2.8.0 Jun 16, 2026
15 checks passed

jeremiah-k deleted the bugfix/ble-refactor branch June 16, 2026 22:27

jeremiah-k mentioned this pull request Jun 17, 2026

fix(service): resolve selected-device startup race #5828

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ble): Harden BLE connection lifecycle#5795

fix(ble): Harden BLE connection lifecycle#5795
jamesarich merged 23 commits into
meshtastic:release/2.8.0from
jeremiah-k:bugfix/ble-refactor

jeremiah-k commented Jun 14, 2026 •

edited

Loading

Uh oh!

jamesarich commented Jun 16, 2026

Uh oh!

jeremiah-k commented Jun 16, 2026

Uh oh!

jamesarich commented Jun 16, 2026 •

edited

Loading

Uh oh!

jamesarich commented Jun 16, 2026

Uh oh!

jeremiah-k commented Jun 16, 2026 •

edited

Loading

Uh oh!

jamesarich left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jeremiah-k commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Changes

Fixes & Stability Improvements

Test Coverage

Test Infrastructure

Dependencies

Breaking Changes

Uh oh!

jamesarich commented Jun 16, 2026

Uh oh!

jeremiah-k commented Jun 16, 2026

Uh oh!

jamesarich commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesarich commented Jun 16, 2026

Uh oh!

jeremiah-k commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesarich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeremiah-k commented Jun 14, 2026 •

edited

Loading

jamesarich commented Jun 16, 2026 •

edited

Loading

jeremiah-k commented Jun 16, 2026 •

edited

Loading