Fall back to single-DB mode when Azure master is inaccessible (#857)#858
Fall back to single-DB mode when Azure master is inaccessible (#857)#858erikdarlingdata merged 1 commit intodevfrom
Conversation
…857) On Azure SQL DB, some logins (e.g. Microsoft Dynamics 365 FO) are granted access only to a specific user database and not to master. The three collectors that enumerate databases via master — query_stats, database_size_stats, file_io_stats — would fail the first time and produce an empty screen. GetAzureDatabaseListAsync now catches known access-denied/login-failed errors from the master connection, caches the per-server decision, and returns the connection's InitialCatalog as a single-element list. The three callers already loop per-database, so single-DB mode works without further changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erikdarlingdata
left a comment
There was a problem hiding this comment.
Review: #858 — Azure master fallback
What this does
On Azure SQL DB, GetAzureDatabaseListAsync now catches specific SQL errors indicating master is inaccessible, caches that decision per server under a lock, and returns the connection's InitialCatalog as a single-element list. The three per-database callers (query_stats, database_size_stats, file_io_stats) already iterate per DB, so they work unchanged. Server-level collectors (which don't call this method) are unaffected.
What's good
- Scope is right: Lite-only change, single file, narrowly targeted at the issue.
- Base branch is
dev. No schema, install, upgrade, or workflow changes. - No PlanAnalyzer or Dashboard/Lite-duplicated code touched — sync rule N/A.
- Exception filter is specific (
when (IsMasterAccessDeniedError(ex))), so unrelatedSqlExceptions still bubble up rather than being swallowed. usingdisposal forSqlConnection/SqlCommand/SqlDataReaderis correct.- Cache is behind a lock; no
async void; no cross-thread UI work. - Log message distinguishes the info-level "collecting from X only" case from the warn-level "nothing to collect" case.
What needs attention
- Error list is too broad (
IsMasterAccessDeniedError, RemoteCollectorService.cs#L649-L656): 18456 and 40615 are generic auth/firewall failures, not "master-only" lockouts. When the connection is outright broken, the fallback cascades into failed per-DB opens but logs a misleading "master inaccessible" message first. Consider narrowing to 229/230/916/4060/40613. Inline comment has detail. - Cache never clears (RemoteCollectorService.cs#L105-L108): Regranting master requires a service restart to pick up. Intentional per the PR description; worth a one-line XML-doc note so it isn't a surprise later.
Dictionary<int, bool>reads as tri-state but isn't — value is alwaystruewhen present.HashSet<int>would be clearer.- Double-log on concurrent first failure — two racing cycles will each emit the Info/Warn line once. Cosmetic, but the fix is a single check-and-set under the lock.
- No tests added.
Lite.Testshas noRemoteCollectorServicecoverage at all, so this is consistent with existing patterns rather than a regression, but the fallback logic (error-code dispatch,SingleDbOrEmpty, cache behavior) is pure and testable without a live Azure endpoint. Worth adding at least unit coverage forIsMasterAccessDeniedErrorandSingleDbOrEmptyif you want a safety net for future tuning of the error list.
Test plan check
The manual test plan in the PR description covers the right matrix (master accessible / not × target DB set / not) and explicitly checks the cache, which is the non-obvious part. No automated coverage to run.
Nothing blocking from my side — maintainer's call.
Generated by Claude Code
| /// <summary> | ||
| /// Per-server flag indicating that master DB enumeration has failed with an access-denied | ||
| /// error and should not be retried. Used on Azure SQL DB where per-database logins may not | ||
| /// have master access (e.g. Microsoft Dynamics 365 FO). See issue #857. |
There was a problem hiding this comment.
Nit: the value is always true when the key is present — a HashSet<int> would express intent more clearly and avoid the misleading TryGetValue(..., out knownInaccessible) pattern where the false branch actually means "not yet determined" rather than "accessible". Not blocking.
Generated by Claude Code
| return ex.Number switch | ||
| { | ||
| 229 => true, // Permission denied on object | ||
| 230 => true, // Permission denied on column | ||
| 916 => true, // Server principal is not able to access the database under the current security context | ||
| 4060 => true, // Cannot open database requested by the login | ||
| 18456 => true, // Login failed for user | ||
| 40613 => true, // Database 'master' on server is not currently available |
There was a problem hiding this comment.
Two of these aren't really "master inaccessible" signals:
- 18456 (Login failed for user) — the login is wrong/disabled/locked. The per-DB connection opened later by
OpenAzureDatabaseConnectionAsyncwill fail for the exact same reason. - 40615 (Cannot open server — login denied, firewall/auth) — usually a firewall IP rule or tenant-auth issue. Again, per-DB connections will also fail.
Treating these as "fallback to single DB" masks a broken connection as "master inaccessible (SQL error 18456) — collecting from '…' only", and then every per-DB collector fails on the next line. Consider letting 18456/40615 bubble up as real connection errors, and reserving the fallback for errors that genuinely indicate master-only lockout (229/230/916/4060/40613).
Also worth noting: the error list only ever grows the inaccessibility cache; if a login is later granted master access, this process won't pick it up until restart. Intentional per the PR description, but worth a one-line doc note on _azureMasterInaccessible.
Generated by Claude Code
| private static List<string> SingleDbOrEmpty(string? targetDb) | ||
| { | ||
| if (string.IsNullOrEmpty(targetDb) || string.Equals(targetDb, "master", StringComparison.OrdinalIgnoreCase)) | ||
| return new List<string>(); | ||
| return new List<string> { targetDb }; |
There was a problem hiding this comment.
Minor: comparing targetDb against "master" case-insensitively is good, but tempdb/msdb/model would also be meaningless fallback targets on Azure SQL DB (not that they'd normally appear in InitialCatalog, but defensive). More practically: what if targetDb is quoted/bracketed like [mydb]? SqlConnectionStringBuilder.InitialCatalog strips the brackets so this should be fine — just noting the assumption.
Generated by Claude Code
| catch (SqlException ex) when (IsMasterAccessDeniedError(ex)) | ||
| { | ||
| lock (_azureMasterInaccessibleLock) | ||
| { | ||
| _azureMasterInaccessible[serverId] = true; | ||
| } |
There was a problem hiding this comment.
Two concurrent cycles racing here both open master, both fail, both grab the lock in turn, both set [serverId] = true, and both emit the AppLogger.Info/Warn line — so the first cycle after permission loss can double-log. Not a correctness problem, just cosmetic noise in the log on the transition. If you want to suppress the second log, check-and-set under a single lock and only log when the flag flips from unset to set.
Generated by Claude Code
On Azure SQL Database, logins without access to master can't resolve cross-database rows returned by sys.dm_exec_requests, which caused the Live Snapshot button and the query snapshots collector to error in D365FO-style environments (reported by @TrudAX in #857 after PR #858). BuildQuerySnapshotsQuery now takes an isAzureSqlDatabase flag and emits AND der.database_id = DB_ID() only when true. Boxed SQL Server, MI, and elastic pool behavior is unchanged. The Live Snapshot button path gets the flag through a new ServerTab constructor parameter wired from the cached ServerConnectionStatus.SqlEngineEdition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
GetAzureDatabaseListAsyncnow catches known access-denied/login-failed SQL errors from the master connection, caches the per-server decision, and returns the connection'sInitialCatalogas a single-element list.query_stats,database_size_stats,file_io_stats) already iterate per-database — single-DB mode works without further changes.Behavior matrix
Test plan
GetAzureDatabaseListAsync).🤖 Generated with Claude Code