Skip to content

fix(service): resolve selected-device startup race#5828

Merged
jamesarich merged 4 commits into
meshtastic:mainfrom
jeremiah-k:bugfix/startup-lifecycle-race
Jun 17, 2026
Merged

fix(service): resolve selected-device startup race#5828
jamesarich merged 4 commits into
meshtastic:mainfrom
jeremiah-k:bugfix/startup-lifecycle-race

Conversation

@jeremiah-k

@jeremiah-k jeremiah-k commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Overview

This PR fixes a selected-device startup/lifecycle race where the app could show a connected or connecting radio state while NodeDB, channels, and config data failed to load.

The issue could happen when MeshService started before the persisted selected-device address had fully propagated through the service layer. In that window, the orchestrator could initialize against the default/null database or stop the service because no device appeared selected, while transport recovery listeners could still start radio connection work later.

This PR makes selected-device startup ordering explicit, centralizes no-device/sentinel address handling, and gates transport recovery so Bluetooth/network events cannot restart transport after explicit disconnect or device deselection. It also disables Android’s multi-reader Room pool after follow-up testing showed reader acquisition timeouts that could prevent version/config/NodeDB data from finishing loading even after BLE connected.

Background

These issues surfaced while retargeting the BLE hardening PR (#5795) to release/2.8.0 (branch has since been merged to main). After the BLE reconnect/liveness fixes were running on the release branch, manual testing exposed a separate startup/lifecycle problem: the radio transport could appear connected or connecting, but NodeDB, channels, and config data would not load because selected-device address resolution, per-device DB switching, and service/orchestrator startup were not consistently ordered.

A later test pass also surfaced an Android-specific Room reader-pool issue on one phone: BLE connected, but version/config/NodeDB state kept spinning and logs showed repeated reader acquisition timeouts with the reader pool reporting capacity=4, permits=0 while all reader connections were listed as Free.

This PR addresses those release-branch startup/database issues separately from the BLE transport hardening work.

Key Changes

Startup / Lifecycle Fixes

  • MeshService now defers the no-device stop decision when the current device address is not yet valid, waits briefly for a valid selected address from currentDeviceAddressFlow, and only proceeds once address resolution succeeds.
  • MeshServiceOrchestrator.start() now waits for a valid selected-device address, switches to that device’s database, loads cached NodeDB from that database, and only then connects the radio transport.
  • SharedRadioInterfaceService now mirrors selected-device address state independently of connect(), so startup code can observe the persisted selected address before transport startup begins.
  • Bluetooth/network recovery listeners are now gated by an explicit connection intent flag, preventing unintended transport restarts after explicit disconnect() or device deselection.
  • setDeviceAddress(null) / setDeviceAddress("n") now clears the active connection intent and stops transport without allowing later Bluetooth/network recovery events to resurrect the session.

Database / Address Handling

  • buildDbName now normalizes nullable input addresses before database-name generation.
  • No-device/sentinel address variants such as null, blank, "n", "null", ".n", and "default" are consistently mapped to DatabaseConstants.DEFAULT_DB_NAME.
  • No-device sentinel detection is now centralized through a shared predicate used by both isValidDeviceAddress() and buildDbName(), so foreground-service stay-alive decisions and DB-name resolution cannot drift.
  • Real selected-device addresses continue to be hashed under the standard DB_PREFIX_ per-device database naming scheme.
  • Added shared isValidDeviceAddress(address: String?) validation so service startup and database selection use the same no-device/sentinel rules.
  • Android production now uses single-connection Room behavior instead of Room KMP’s multi-reader connection pool, avoiding the observed reader-pool permit exhaustion where DB-backed version/config/NodeDB reads could time out indefinitely.

Runtime Safety

  • Active-service address updates are observed and used to switch database state for valid selected addresses, with failures logged so the observer remains alive.
  • Environmental recovery still works while a connection is desired: Bluetooth/network restoration can restart transport only when the service is actively maintaining a selected-device connection.
  • Explicit disconnect and device deselection now prevent transport recovery from starting a new session later.
  • New selected-device INFO logs in MeshService anonymize device addresses before logging.

Service Cleanup

  • MeshService now owns a dedicated service scope for deferred address resolution work.
  • Pending address-resolution work is cancelled during shutdown.
  • Wake lock / foreground-service cleanup remains tied to whether startup actually proceeds with a valid selected device.

Test Coverage

  • AddressUtilsTest

    • Directly verifies isValidDeviceAddress() accepts real device addresses.
    • Directly verifies isValidDeviceAddress() rejects no-device sentinel variants such as null, blank, "n", "null", ".n", and "default".
  • BuildDbNameTest

    • Covers sentinel-to-default database mapping.
    • Covers case/format normalization for BLE MAC-style addresses.
    • Verifies real device addresses produce non-default per-device DB names under DB_PREFIX_.
  • MeshServiceOrchestratorTest

    • Verifies cold-start ordering: switch active DB → load cached NodeDB → connect radio.
    • Verifies address changes during an active orchestrator session switch active DB.
    • Verifies a stopped orchestrator is not restarted just because connection state later reports connected.
  • SharedRadioInterfaceServiceLivenessTest

    • Verifies explicit disconnect() prevents later BLE recovery from restarting transport.
    • Verifies explicit disconnect() prevents later network recovery from restarting transport.
    • Verifies Bluetooth/network environmental recovery still restarts transport while a connection is desired.
    • Verifies setDeviceAddress("n") / device deselection clears connection intent and prevents later recovery restarts.

Breaking Changes / Migration Notes

No database migration, destructive cleanup, or user-data deletion is added.

The only database-selection behavior change is that no-device/sentinel address variants are now normalized consistently to the default database name. If older builds accidentally created separate DB names for sentinel variants such as "default" or ".n", those variants will now resolve to the default DB instead of preserving separate sentinel-specific database files.

Android production DB access now uses single-connection Room behavior for release reliability. This may reduce DB read concurrency on Android, but avoids the observed reader-pool permit exhaustion where NodeDB/version/config reads could time out indefinitely.

BLE reconnect/liveness behavior, selected-node scanning, and TCP/serial liveness behavior are not intentionally changed.


I've loaded this on multiple phones for testing and everything is working well for now. I will update the status later.

Update: After ~2 hrs of testing so far, everything is running well. Device selection seems more responsive too.

Update 2: One phone later connected over BLE but never finished loading version/config/NodeDB data. Logs showed repeated Room reader acquisition timeouts with the reader pool at capacity=4, permits=0 while all reader connections were reported Free. I added an Android-only DB reliability fix so production Android now uses single-connection Room behavior instead of the multi-reader pool. Testing has resumed with that fix applied.

Update 3: Addressed maintainer review nits: selected-device INFO logs are now anonymized, no-device sentinel detection is shared between isValidDeviceAddress() and buildDbName(), and direct isValidDeviceAddress() tests were added for real addresses and sentinel variants.

Continued in #5841

…covery

Three coupled cold-start defects caused 'connected but NodeDB/channels
don't load' failures:

1. MeshService.onStartCommand used drop(1) on the address StateFlow,
   which could skip an already-current valid address arriving between
   the synchronous read and collector subscription. The service would
   then time out after 5s and stop despite a valid device being selected.

2. MeshServiceOrchestrator.start() called connect() before a valid
   selected address was guaranteed, starting the BLE transport while
   the active DB was still set to default/null.

3. SharedRadioInterfaceService only synced currentDeviceAddressFlow
   from radioPrefs.devAddr inside initStateListeners(), which was only
   invoked from connect() — creating a circular dependency where the
   flow could not update until connect() ran.

Fixes:
- Remove drop(1); first(::isValidDeviceAddress) naturally suspends on
  invalid values and returns immediately if already valid.
- Add early init{} address-sync listener in SharedRadioInterfaceService
  that mirrors radioPrefs.devAddr into currentDeviceAddressFlow without
  starting transport.
- Restructure MeshServiceOrchestrator.start() to wait for a valid
  address via currentDeviceAddressFlow.first(), then switchActiveDatabase,
  then loadCachedNodeDB, then connect() — enforcing the invariant that
  transport must not start until the active DB matches the selected device.
- Add connectionRequested gate so Bluetooth/network state listeners and
  liveness recovery cannot restart a transport after explicit disconnect
  or device deselection (setDeviceAddress(null)).
- Extract isValidDeviceAddress to shared utility in AddressUtils.kt,
  delegating to normalizeAddress for single-source sentinel handling.
- Normalize no-device sentinels in buildDbName so null/blank/n/N/null/
  .n/default all collapse to DEFAULT_DB_NAME consistently.
- Wrap mid-session switchActiveDatabase in safeCatching to prevent
  collector death from transient DB errors.

Tests:
- Event recorder verifies startup ordering: DB switch before cached
  node load before radio connect.
- BLE and network recovery regression tests confirm transport does not
  restart after explicit disconnect or device deselection.
- Environmental recovery tests confirm transport restarts while
  connection is still desired.
- BuildDbNameTest covers all sentinel forms including default/DEFAULT.
@github-actions github-actions Bot added the bugfix PR tag label Jun 17, 2026
@jeremiah-k jeremiah-k marked this pull request as draft June 17, 2026 14:42
Android production Room database builder now uses single-connection mode
(multiConnection = false) instead of the default multi-reader pool.

Under coroutine cancellation churn — particularly during DB switches via
flatMapLatest in SwitchingNodeInfoReadDataSource — the Room KMP reader-pool
permit semaphore can wedge: all reader connections report Free but permits=0,
so every read acquisition times out indefinitely. This blocks all
NodeDB/MyNodeInfo/channel/config flows from loading even though BLE is connected.

Single-connection mode serializes reads and writes through one connection,
eliminating the separate reader permit pool that can exhaust. JVM/iOS
production behavior is unchanged. In-memory databases already used
single-connection mode for deterministic read-after-write.

@jamesarich jamesarich left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline suggestions from the review. One should-fix — anonymize the new device-address logs (PII at INFO) — plus two optional follow-ups: de-dupe the no-device sentinel set so isValidDeviceAddress and buildDbName can't drift, and add a direct isValidDeviceAddress test. The startup/lifecycle logic itself looks correct; these are hardening + maintainability nits. The de-dupe and its imports are interdependent (noted inline).

Comment thread core/service/src/androidMain/kotlin/org/meshtastic/core/service/MeshService.kt Outdated
Comment thread core/service/src/androidMain/kotlin/org/meshtastic/core/service/MeshService.kt Outdated
Comment thread core/service/src/androidMain/kotlin/org/meshtastic/core/service/MeshService.kt Outdated
Comment thread core/common/src/commonMain/kotlin/org/meshtastic/core/common/util/AddressUtils.kt Outdated
Anonymize selected-device addresses in MeshService INFO logs to avoid
exposing BLE MAC/IP addresses in logcat.

Extract a shared isNoDeviceSentinel predicate in AddressUtils.kt so
isValidDeviceAddress and buildDbName cannot drift on sentinel handling.

Add direct isValidDeviceAddress unit tests covering real device
addresses and all no-device sentinel forms.
@jeremiah-k

Copy link
Copy Markdown
Contributor Author

Inline suggestions from the review. One should-fix — anonymize the new device-address logs (PII at INFO) — plus two optional follow-ups: de-dupe the no-device sentinel set so isValidDeviceAddress and buildDbName can't drift, and add a direct isValidDeviceAddress test. The startup/lifecycle logic itself looks correct; these are hardening + maintainability nits. The de-dupe and its imports are interdependent (noted inline).

These have been addressed.

@jeremiah-k jeremiah-k marked this pull request as ready for review June 17, 2026 16:29
@jamesarich jamesarich added this pull request to the merge queue Jun 17, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 17, 2026
@jamesarich jamesarich added this pull request to the merge queue Jun 17, 2026
Merged via the queue into meshtastic:main with commit acf9ff9 Jun 17, 2026
18 checks passed
@jeremiah-k jeremiah-k deleted the bugfix/startup-lifecycle-race branch June 17, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix PR tag

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants