fix(service): resolve selected-device startup race by jeremiah-k · Pull Request #5828 · meshtastic/Meshtastic-Android

jeremiah-k · 2026-06-17T12:59:39Z

Overview

This PR fixes a selected-device startup/lifecycle race where the app could show a connected or connecting radio state while NodeDB, channels, and config data failed to load.

The issue could happen when MeshService started before the persisted selected-device address had fully propagated through the service layer. In that window, the orchestrator could initialize against the default/null database or stop the service because no device appeared selected, while transport recovery listeners could still start radio connection work later.

This PR makes selected-device startup ordering explicit, centralizes no-device/sentinel address handling, and gates transport recovery so Bluetooth/network events cannot restart transport after explicit disconnect or device deselection. It also disables Android’s multi-reader Room pool after follow-up testing showed reader acquisition timeouts that could prevent version/config/NodeDB data from finishing loading even after BLE connected.

Background

These issues surfaced while retargeting the BLE hardening PR (#5795) to release/2.8.0 (branch has since been merged to main). After the BLE reconnect/liveness fixes were running on the release branch, manual testing exposed a separate startup/lifecycle problem: the radio transport could appear connected or connecting, but NodeDB, channels, and config data would not load because selected-device address resolution, per-device DB switching, and service/orchestrator startup were not consistently ordered.

A later test pass also surfaced an Android-specific Room reader-pool issue on one phone: BLE connected, but version/config/NodeDB state kept spinning and logs showed repeated reader acquisition timeouts with the reader pool reporting capacity=4, permits=0 while all reader connections were listed as Free.

This PR addresses those release-branch startup/database issues separately from the BLE transport hardening work.

Key Changes

Startup / Lifecycle Fixes

MeshService now defers the no-device stop decision when the current device address is not yet valid, waits briefly for a valid selected address from currentDeviceAddressFlow, and only proceeds once address resolution succeeds.
MeshServiceOrchestrator.start() now waits for a valid selected-device address, switches to that device’s database, loads cached NodeDB from that database, and only then connects the radio transport.
SharedRadioInterfaceService now mirrors selected-device address state independently of connect(), so startup code can observe the persisted selected address before transport startup begins.
Bluetooth/network recovery listeners are now gated by an explicit connection intent flag, preventing unintended transport restarts after explicit disconnect() or device deselection.
setDeviceAddress(null) / setDeviceAddress("n") now clears the active connection intent and stops transport without allowing later Bluetooth/network recovery events to resurrect the session.

Database / Address Handling

buildDbName now normalizes nullable input addresses before database-name generation.
No-device/sentinel address variants such as null, blank, "n", "null", ".n", and "default" are consistently mapped to DatabaseConstants.DEFAULT_DB_NAME.
No-device sentinel detection is now centralized through a shared predicate used by both isValidDeviceAddress() and buildDbName(), so foreground-service stay-alive decisions and DB-name resolution cannot drift.
Real selected-device addresses continue to be hashed under the standard DB_PREFIX_ per-device database naming scheme.
Added shared isValidDeviceAddress(address: String?) validation so service startup and database selection use the same no-device/sentinel rules.
Android production now uses single-connection Room behavior instead of Room KMP’s multi-reader connection pool, avoiding the observed reader-pool permit exhaustion where DB-backed version/config/NodeDB reads could time out indefinitely.

Runtime Safety

Active-service address updates are observed and used to switch database state for valid selected addresses, with failures logged so the observer remains alive.
Environmental recovery still works while a connection is desired: Bluetooth/network restoration can restart transport only when the service is actively maintaining a selected-device connection.
Explicit disconnect and device deselection now prevent transport recovery from starting a new session later.
New selected-device INFO logs in MeshService anonymize device addresses before logging.

Service Cleanup

MeshService now owns a dedicated service scope for deferred address resolution work.
Pending address-resolution work is cancelled during shutdown.
Wake lock / foreground-service cleanup remains tied to whether startup actually proceeds with a valid selected device.

Test Coverage

AddressUtilsTest
- Directly verifies isValidDeviceAddress() accepts real device addresses.
- Directly verifies isValidDeviceAddress() rejects no-device sentinel variants such as null, blank, "n", "null", ".n", and "default".
BuildDbNameTest
- Covers sentinel-to-default database mapping.
- Covers case/format normalization for BLE MAC-style addresses.
- Verifies real device addresses produce non-default per-device DB names under DB_PREFIX_.
MeshServiceOrchestratorTest
- Verifies cold-start ordering: switch active DB → load cached NodeDB → connect radio.
- Verifies address changes during an active orchestrator session switch active DB.
- Verifies a stopped orchestrator is not restarted just because connection state later reports connected.
SharedRadioInterfaceServiceLivenessTest
- Verifies explicit disconnect() prevents later BLE recovery from restarting transport.
- Verifies explicit disconnect() prevents later network recovery from restarting transport.
- Verifies Bluetooth/network environmental recovery still restarts transport while a connection is desired.
- Verifies setDeviceAddress("n") / device deselection clears connection intent and prevents later recovery restarts.

Breaking Changes / Migration Notes

No database migration, destructive cleanup, or user-data deletion is added.

The only database-selection behavior change is that no-device/sentinel address variants are now normalized consistently to the default database name. If older builds accidentally created separate DB names for sentinel variants such as "default" or ".n", those variants will now resolve to the default DB instead of preserving separate sentinel-specific database files.

Android production DB access now uses single-connection Room behavior for release reliability. This may reduce DB read concurrency on Android, but avoids the observed reader-pool permit exhaustion where NodeDB/version/config reads could time out indefinitely.

BLE reconnect/liveness behavior, selected-node scanning, and TCP/serial liveness behavior are not intentionally changed.

I've loaded this on multiple phones for testing and everything is working well for now. I will update the status later.

Update: After ~2 hrs of testing so far, everything is running well. Device selection seems more responsive too.

Update 2: One phone later connected over BLE but never finished loading version/config/NodeDB data. Logs showed repeated Room reader acquisition timeouts with the reader pool at capacity=4, permits=0 while all reader connections were reported Free. I added an Android-only DB reliability fix so production Android now uses single-connection Room behavior instead of the multi-reader pool. Testing has resumed with that fix applied.

Update 3: Addressed maintainer review nits: selected-device INFO logs are now anonymized, no-device sentinel detection is shared between isValidDeviceAddress() and buildDbName(), and direct isValidDeviceAddress() tests were added for real addresses and sentinel variants.

Continued in #5841

…covery Three coupled cold-start defects caused 'connected but NodeDB/channels don't load' failures: 1. MeshService.onStartCommand used drop(1) on the address StateFlow, which could skip an already-current valid address arriving between the synchronous read and collector subscription. The service would then time out after 5s and stop despite a valid device being selected. 2. MeshServiceOrchestrator.start() called connect() before a valid selected address was guaranteed, starting the BLE transport while the active DB was still set to default/null. 3. SharedRadioInterfaceService only synced currentDeviceAddressFlow from radioPrefs.devAddr inside initStateListeners(), which was only invoked from connect() — creating a circular dependency where the flow could not update until connect() ran. Fixes: - Remove drop(1); first(::isValidDeviceAddress) naturally suspends on invalid values and returns immediately if already valid. - Add early init{} address-sync listener in SharedRadioInterfaceService that mirrors radioPrefs.devAddr into currentDeviceAddressFlow without starting transport. - Restructure MeshServiceOrchestrator.start() to wait for a valid address via currentDeviceAddressFlow.first(), then switchActiveDatabase, then loadCachedNodeDB, then connect() — enforcing the invariant that transport must not start until the active DB matches the selected device. - Add connectionRequested gate so Bluetooth/network state listeners and liveness recovery cannot restart a transport after explicit disconnect or device deselection (setDeviceAddress(null)). - Extract isValidDeviceAddress to shared utility in AddressUtils.kt, delegating to normalizeAddress for single-source sentinel handling. - Normalize no-device sentinels in buildDbName so null/blank/n/N/null/ .n/default all collapse to DEFAULT_DB_NAME consistently. - Wrap mid-session switchActiveDatabase in safeCatching to prevent collector death from transient DB errors. Tests: - Event recorder verifies startup ordering: DB switch before cached node load before radio connect. - BLE and network recovery regression tests confirm transport does not restart after explicit disconnect or device deselection. - Environmental recovery tests confirm transport restarts while connection is still desired. - BuildDbNameTest covers all sentinel forms including default/DEFAULT.

Android production Room database builder now uses single-connection mode (multiConnection = false) instead of the default multi-reader pool. Under coroutine cancellation churn — particularly during DB switches via flatMapLatest in SwitchingNodeInfoReadDataSource — the Room KMP reader-pool permit semaphore can wedge: all reader connections report Free but permits=0, so every read acquisition times out indefinitely. This blocks all NodeDB/MyNodeInfo/channel/config flows from loading even though BLE is connected. Single-connection mode serializes reads and writes through one connection, eliminating the separate reader permit pool that can exhaust. JVM/iOS production behavior is unchanged. In-memory databases already used single-connection mode for deterministic read-after-write.

jamesarich

Inline suggestions from the review. One should-fix — anonymize the new device-address logs (PII at INFO) — plus two optional follow-ups: de-dupe the no-device sentinel set so isValidDeviceAddress and buildDbName can't drift, and add a direct isValidDeviceAddress test. The startup/lifecycle logic itself looks correct; these are hardening + maintainability nits. The de-dupe and its imports are interdependent (noted inline).

Anonymize selected-device addresses in MeshService INFO logs to avoid exposing BLE MAC/IP addresses in logcat. Extract a shared isNoDeviceSentinel predicate in AddressUtils.kt so isValidDeviceAddress and buildDbName cannot drift on sentinel handling. Add direct isValidDeviceAddress unit tests covering real device addresses and all no-device sentinel forms.

jeremiah-k · 2026-06-17T16:29:00Z

Inline suggestions from the review. One should-fix — anonymize the new device-address logs (PII at INFO) — plus two optional follow-ups: de-dupe the no-device sentinel set so isValidDeviceAddress and buildDbName can't drift, and add a direct isValidDeviceAddress test. The startup/lifecycle logic itself looks correct; these are hardening + maintainability nits. The de-dupe and its imports are interdependent (noted inline).

These have been addressed.

github-actions Bot added the bugfix PR tag label Jun 17, 2026

jeremiah-k marked this pull request as draft June 17, 2026 14:42

jeremiah-k added 2 commits June 17, 2026 09:42

Merge branch 'main' into bugfix/startup-lifecycle-race

a919949

jamesarich reviewed Jun 17, 2026

View reviewed changes

jeremiah-k marked this pull request as ready for review June 17, 2026 16:29

jamesarich approved these changes Jun 17, 2026

View reviewed changes

jamesarich added this pull request to the merge queue Jun 17, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 17, 2026

jamesarich added this pull request to the merge queue Jun 17, 2026

Merged via the queue into meshtastic:main with commit acf9ff9 Jun 17, 2026
18 checks passed

jeremiah-k deleted the bugfix/startup-lifecycle-race branch June 17, 2026 18:05

jeremiah-k mentioned this pull request Jun 18, 2026

fix(database): defer FTS backfill on cold start and enforce single-connection pool #5841

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(service): resolve selected-device startup race#5828

fix(service): resolve selected-device startup race#5828
jamesarich merged 4 commits into
meshtastic:mainfrom
jeremiah-k:bugfix/startup-lifecycle-race

jeremiah-k commented Jun 17, 2026 •

edited

Loading

Uh oh!

jamesarich left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeremiah-k commented Jun 17, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jeremiah-k commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Background

Key Changes

Test Coverage

Breaking Changes / Migration Notes

Uh oh!

jamesarich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeremiah-k commented Jun 17, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeremiah-k commented Jun 17, 2026 •

edited

Loading