fix(service): resolve selected-device startup race#5828
Conversation
…covery
Three coupled cold-start defects caused 'connected but NodeDB/channels
don't load' failures:
1. MeshService.onStartCommand used drop(1) on the address StateFlow,
which could skip an already-current valid address arriving between
the synchronous read and collector subscription. The service would
then time out after 5s and stop despite a valid device being selected.
2. MeshServiceOrchestrator.start() called connect() before a valid
selected address was guaranteed, starting the BLE transport while
the active DB was still set to default/null.
3. SharedRadioInterfaceService only synced currentDeviceAddressFlow
from radioPrefs.devAddr inside initStateListeners(), which was only
invoked from connect() — creating a circular dependency where the
flow could not update until connect() ran.
Fixes:
- Remove drop(1); first(::isValidDeviceAddress) naturally suspends on
invalid values and returns immediately if already valid.
- Add early init{} address-sync listener in SharedRadioInterfaceService
that mirrors radioPrefs.devAddr into currentDeviceAddressFlow without
starting transport.
- Restructure MeshServiceOrchestrator.start() to wait for a valid
address via currentDeviceAddressFlow.first(), then switchActiveDatabase,
then loadCachedNodeDB, then connect() — enforcing the invariant that
transport must not start until the active DB matches the selected device.
- Add connectionRequested gate so Bluetooth/network state listeners and
liveness recovery cannot restart a transport after explicit disconnect
or device deselection (setDeviceAddress(null)).
- Extract isValidDeviceAddress to shared utility in AddressUtils.kt,
delegating to normalizeAddress for single-source sentinel handling.
- Normalize no-device sentinels in buildDbName so null/blank/n/N/null/
.n/default all collapse to DEFAULT_DB_NAME consistently.
- Wrap mid-session switchActiveDatabase in safeCatching to prevent
collector death from transient DB errors.
Tests:
- Event recorder verifies startup ordering: DB switch before cached
node load before radio connect.
- BLE and network recovery regression tests confirm transport does not
restart after explicit disconnect or device deselection.
- Environmental recovery tests confirm transport restarts while
connection is still desired.
- BuildDbNameTest covers all sentinel forms including default/DEFAULT.
Android production Room database builder now uses single-connection mode (multiConnection = false) instead of the default multi-reader pool. Under coroutine cancellation churn — particularly during DB switches via flatMapLatest in SwitchingNodeInfoReadDataSource — the Room KMP reader-pool permit semaphore can wedge: all reader connections report Free but permits=0, so every read acquisition times out indefinitely. This blocks all NodeDB/MyNodeInfo/channel/config flows from loading even though BLE is connected. Single-connection mode serializes reads and writes through one connection, eliminating the separate reader permit pool that can exhaust. JVM/iOS production behavior is unchanged. In-memory databases already used single-connection mode for deterministic read-after-write.
jamesarich
left a comment
There was a problem hiding this comment.
Inline suggestions from the review. One should-fix — anonymize the new device-address logs (PII at INFO) — plus two optional follow-ups: de-dupe the no-device sentinel set so isValidDeviceAddress and buildDbName can't drift, and add a direct isValidDeviceAddress test. The startup/lifecycle logic itself looks correct; these are hardening + maintainability nits. The de-dupe and its imports are interdependent (noted inline).
Anonymize selected-device addresses in MeshService INFO logs to avoid exposing BLE MAC/IP addresses in logcat. Extract a shared isNoDeviceSentinel predicate in AddressUtils.kt so isValidDeviceAddress and buildDbName cannot drift on sentinel handling. Add direct isValidDeviceAddress unit tests covering real device addresses and all no-device sentinel forms.
These have been addressed. |
Overview
This PR fixes a selected-device startup/lifecycle race where the app could show a connected or connecting radio state while NodeDB, channels, and config data failed to load.
The issue could happen when
MeshServicestarted before the persisted selected-device address had fully propagated through the service layer. In that window, the orchestrator could initialize against the default/null database or stop the service because no device appeared selected, while transport recovery listeners could still start radio connection work later.This PR makes selected-device startup ordering explicit, centralizes no-device/sentinel address handling, and gates transport recovery so Bluetooth/network events cannot restart transport after explicit disconnect or device deselection. It also disables Android’s multi-reader Room pool after follow-up testing showed reader acquisition timeouts that could prevent version/config/NodeDB data from finishing loading even after BLE connected.
Background
These issues surfaced while retargeting the BLE hardening PR (#5795) to
release/2.8.0(branch has since been merged tomain). After the BLE reconnect/liveness fixes were running on the release branch, manual testing exposed a separate startup/lifecycle problem: the radio transport could appear connected or connecting, but NodeDB, channels, and config data would not load because selected-device address resolution, per-device DB switching, and service/orchestrator startup were not consistently ordered.A later test pass also surfaced an Android-specific Room reader-pool issue on one phone: BLE connected, but version/config/NodeDB state kept spinning and logs showed repeated reader acquisition timeouts with the reader pool reporting
capacity=4, permits=0while all reader connections were listed asFree.This PR addresses those release-branch startup/database issues separately from the BLE transport hardening work.
Key Changes
Startup / Lifecycle Fixes
MeshServicenow defers the no-device stop decision when the current device address is not yet valid, waits briefly for a valid selected address fromcurrentDeviceAddressFlow, and only proceeds once address resolution succeeds.MeshServiceOrchestrator.start()now waits for a valid selected-device address, switches to that device’s database, loads cached NodeDB from that database, and only then connects the radio transport.SharedRadioInterfaceServicenow mirrors selected-device address state independently ofconnect(), so startup code can observe the persisted selected address before transport startup begins.disconnect()or device deselection.setDeviceAddress(null)/setDeviceAddress("n")now clears the active connection intent and stops transport without allowing later Bluetooth/network recovery events to resurrect the session.Database / Address Handling
buildDbNamenow normalizes nullable input addresses before database-name generation.null, blank,"n","null",".n", and"default"are consistently mapped toDatabaseConstants.DEFAULT_DB_NAME.isValidDeviceAddress()andbuildDbName(), so foreground-service stay-alive decisions and DB-name resolution cannot drift.DB_PREFIX_per-device database naming scheme.isValidDeviceAddress(address: String?)validation so service startup and database selection use the same no-device/sentinel rules.Runtime Safety
MeshServiceanonymize device addresses before logging.Service Cleanup
MeshServicenow owns a dedicated service scope for deferred address resolution work.Test Coverage
AddressUtilsTestisValidDeviceAddress()accepts real device addresses.isValidDeviceAddress()rejects no-device sentinel variants such as null, blank,"n","null",".n", and"default".BuildDbNameTestDB_PREFIX_.MeshServiceOrchestratorTestSharedRadioInterfaceServiceLivenessTestdisconnect()prevents later BLE recovery from restarting transport.disconnect()prevents later network recovery from restarting transport.setDeviceAddress("n")/ device deselection clears connection intent and prevents later recovery restarts.Breaking Changes / Migration Notes
No database migration, destructive cleanup, or user-data deletion is added.
The only database-selection behavior change is that no-device/sentinel address variants are now normalized consistently to the default database name. If older builds accidentally created separate DB names for sentinel variants such as
"default"or".n", those variants will now resolve to the default DB instead of preserving separate sentinel-specific database files.Android production DB access now uses single-connection Room behavior for release reliability. This may reduce DB read concurrency on Android, but avoids the observed reader-pool permit exhaustion where NodeDB/version/config reads could time out indefinitely.
BLE reconnect/liveness behavior, selected-node scanning, and TCP/serial liveness behavior are not intentionally changed.
I've loaded this on multiple phones for testing and everything is working well for now. I will update the status later.
Update: After ~2 hrs of testing so far, everything is running well. Device selection seems more responsive too.
Update 2: One phone later connected over BLE but never finished loading version/config/NodeDB data. Logs showed repeated Room reader acquisition timeouts with the reader pool at
capacity=4, permits=0while all reader connections were reportedFree. I added an Android-only DB reliability fix so production Android now uses single-connection Room behavior instead of the multi-reader pool. Testing has resumed with that fix applied.Update 3: Addressed maintainer review nits: selected-device INFO logs are now anonymized, no-device sentinel detection is shared between
isValidDeviceAddress()andbuildDbName(), and directisValidDeviceAddress()tests were added for real addresses and sentinel variants.Continued in #5841