fix(ble): Harden BLE connection lifecycle#5795
Conversation
|
Hey @jeremiah-k - before this gets much further, we'll want to investigate these changes vs I've got claude looking at it currently vs the release and sdk - we'll see what shakes out. |
|
@jamesarich Sounds good |
Opening commit for the release/2.8.0 stabilization branch. versionCode remains git-derived (offset 29314197 + commit count); only the base version name is bumped here. CI overrides this with the release tag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Enables the full validate-and-build pipeline on PRs whose base is a release branch (e.g. release/2.8.0), not just main, so milestone PRs retargeted onto the release branch get per-PR CI before merge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tic#5586) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…shtastic#5701) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ings directory) (meshtastic#5714) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cher (meshtastic#5765) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…5675 meshtastic#5275 created :feature:discovery with implementation(projects.core.proto), but main's meshtastic#5675 replaced the :core:proto submodule with the org.meshtastic:protobufs Maven artifact. Rebasing left a stale module reference. Switched to implementation(libs.meshtastic.protobufs), matching the convention (cf. feature/node). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eline for 2.8.0 (meshtastic#5775) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…AtStart The google-flavor AppFunctionsModule registered AppFunctionStateSync with createdAtStart = true. Eager creation needs the androidContext binding and immediately spawns the prefs-observing sync coroutine — so any Koin graph built outside a running app failed with NoDefinitionFoundException for android.content.Context. That broke KoinVerificationTest.verifyTypedBootstrapLoadsModuleGraph (the typed koinApplication<AndroidKoinApp>() bootstrap instantiates eager singletons), failing the shard-app CI job on this branch. The definition is now a plain @single (the graph stays lazily constructible) and GoogleMeshUtilApplication.onCreate resolves it once after startKoin has bound androidContext — same production behavior, explicit instead of implicit. It was the repo's only createdAtStart. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: James Rich <james.a.rich@gmail.com>
configureCommon() applied setMultipleConnectionPool(maxNumOfReaders = 4) to every database, including the in-memory ones used by tests. A read on a pooled reader connection can observe a snapshot older than the latest write on the writer connection, so a read immediately after a write may return stale rows. DeviceLinkRepositoryImplTest.reconcilePrunesShortCodesNoLongerInCatalog read [a, b] (the pre-prune state) instead of [a] after a deleteNotIn — passing locally but flaking on CI depending on connection-assignment timing (failed shard-core on meshtastic#5738; the identical code passed on meshtastic#5780). In-memory builders now pass multiConnection = false so reads serialize behind writes on one connection. Production/file databases keep the multi-reader pool. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…uppression) (meshtastic#5793) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Here's what claude came up with: 1. Retarget to 2. Scope/flag the non-BLE liveness change. 3. Hardware soak. This is the most crash-sensitive path, validated only one device so far. Can you run against any other phone/node combos before it lands in the release branch? |
|
@jeremiah-k - there's a bit of a conflict (not bad) when retargeting the If you can - test it against a few device types/phones. |
…overy Long-running BLE sessions would eventually fail writes to toRadioCharacteristic with Kable NotConnectedException / "Disconnect detected" after hundreds of successful writes, then report "Not connected" or surface a JobCancellationException-like error. Manual node reselection was required to recover. Treat failed BLE session operations as fatal to the current session, tear down stale GATT/profile state promptly, and force a GATT disconnect so the reconnect loop wakes up. Internal forced cleanup is not classified as a user/manual LocalDisconnect; duplicate disconnect callbacks are deduped via atomic CAS. - BleExceptionClassifier: centralize fatal BLE session classification (NotConnectedException, fatal GATT statuses 8/19/22/62/129/133, bounded cause-chain recursion) - KableMeshtasticRadioProfile: FROMNUM observe completes subscriptionReady only when subscription is active; pre-readiness failure completes exceptionally and rethrows; fatal fromRadio/logRadio exceptions propagate - BleRadioTransport: per-attempt session failure tracking; sessionFailureCause is set before the sessionFailed CAS to eliminate a visibility window; handleFailure dedup; connected-gate timeout cleanup; fatal-write reconnect; FROMNUM subscription readiness timeout aborts setup; CancellationException is not user-facing - SharedRadioInterfaceService: BLE-only liveness recovery for silent zombie sessions; skips permanent disconnect and polite ToRadio(disconnect=true); isRestarting atomic guard prevents overlapping restarts - FakeBle/FakeRadioTransport: failure-injection seams for targeted tests - Tests: exception classifier, profile exception propagation, transport reconnect crash scenarios, service liveness, fake failure injection
When a bonded device is found, attempt a 2-second targeted scan (service UUID + address) before using the cached bond. This avoids connecting to a stale BLE handle when the peripheral has restarted or changed its GATT server. - Add findTargetedDevice() with short timeout and graceful fallback - Restructure findDevice() for early-return clarity - Add tests for both scan-hit and scan-miss paths - Update existing tests to emit scanned device for new flow - Wrap test start/close in try/finally to prevent coroutine leaks
checkLiveness() was passing an errorMessage to onDisconnect, which flows _connectionError → showAlert and pops a modal dialog. But liveness recovery is self-healing — detect zombie, tear down, reconnect — so by the time the user sees the dialog, we're already back. Drop the errorMessage; keep the warning log and the DeviceSleep state transition. Added a test asserting no connectionError is emitted.
Ensure that non-permanent BLE disconnects do not include an error message in the onDisconnect callback. This prevents the UI from showing unnecessary modal dialogs during automatic reconnection attempts, improving UX by only surfacing errors for permanent failures.
477fc2a to
5273884
Compare
… IDs, discovery abort, AQ zeros) (meshtastic#5813) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Done
I checked this path more closely, and I think we're covered here.
TCP is handled separately in Serial similarly relies on USB/device-removal and serial I/O error paths. There is also a non-BLE liveness regression test here: The test verifies the BLE liveness restart path does not close/restart a TCP-style transport or mutate its connection state.
I've been testing the fdroid debug version on a Pixel 8a, Pixel 6 Pro, and Pixel 4a 5G for 24 hrs+ (I updated the summary now). The last few commits were ironing out UX issues from unnecessary dialogs shown during reconnection. I just installed the google debug version on a TCL 20 A 5G (T768S) and everything runs as expected. I haven't run into any issues in any of my testing. So everything looks good from my side. |
69014e9 to
8874352
Compare
…-refactor # Conflicts: # core/repository/src/commonMain/kotlin/org/meshtastic/core/repository/AdminController.kt # core/service/src/commonMain/kotlin/org/meshtastic/core/service/AdminControllerImpl.kt # core/service/src/commonTest/kotlin/org/meshtastic/core/service/RadioControllerImplTest.kt
jamesarich
left a comment
There was a problem hiding this comment.
Looks good on my end, appreciate it!
Overview
This PR improves BLE connection stability in long-running sessions by tightening exception classification, session cleanup, reconnect triggering, selected-node discovery, reconnect UX, and liveness recovery. The changes focus on cases where BLE operations can fail after an established connection has been active for some time, leaving stale GATT/profile state behind and requiring manual node reselection. Fatal BLE conditions now trigger bounded cleanup and reconnect paths, selected bonded nodes prefer fresh advertisements when available, and recoverable BLE failures avoid user-facing modal errors while automatic reconnect continues.
Key Changes
Fixes & Stability Improvements
isSessionFatalBleException()to identify unrecoverable BLE session failures, includingNotConnectedExceptionand fatal GATT statuses8,19,22,62,129, and133, with bounded cause-chain traversal.KableMeshtasticRadioProfileso session-fatal exceptions fromfromRadioandlogRadiopropagate to the transport layer and can trigger reconnect, while non-fatal diagnostic/log failures remain suppressed.subscriptionReadyexceptionally, and added bounded subscription readiness handling so setup does not continue with half-initialized BLE state.BleRadioTransportto ensure racing read/write/profile failures only trigger one disconnect callback and one cleanup path per failed session.radioService/ fully-connected state and force GATT disconnect so the reconnect loop can proceed instead of waiting on stale state.Connectedstate transitions clean up stale GATT state and retry instead of hanging.SharedRadioInterfaceServiceso overlapping liveness checks do not stack transport restarts.Test Coverage
NotConnectedException, wrapped causes, and non-fatal conditions.fromRadio/logRadiopropagation, transient error suppression, and FROMNUM subscription readiness failure behavior.CancellationExceptionsuppression, FROMNUM pre-readiness failures, subscription-readiness timeout, connected-gate timeout cleanup, and disconnect callback deduplication.Test Infrastructure
SharedRadioInterfaceServicetests without changing production injection behavior.kotlinx.atomicfuusage where needed for session/restart guards and concurrent counters.Dependencies
kotlinx.atomicfutocore/networkfor atomic session state management.core.testingtest dependency for service liveness coverage.Breaking Changes
None. Changes are internal to BLE transport/profile handling, selected-node BLE discovery, reconnect UX, service liveness recovery, and targeted test infrastructure.
Tested and working on Pixel 8a, Pixel 6 Pro, Pixel 4a 5G, and TCL 20 A 5G (T768S) in various configurations.
Closes: #3589