Describe the bug
_cache_dbr_capabilities unconditionally writes to the class-level _dbr_capabilities_cache dict, even when _query_dbr_version returns None. Once written, the if http_path not in cls._dbr_capabilities_cache guard prevents any subsequent call from overwriting it. This permanently disables all capability-gated features for that http_path for the lifetime of the dbt process.
The _try_cache_dbr_capabilities method introduced in #1355 correctly guards against this with if dbr_version is not None, but only for the eager path in _create_fresh_connection(). The authoritative path in open() still uses the unguarded _cache_dbr_capabilities.
Root cause
connections.py:200-209:
@classmethod
def _cache_dbr_capabilities(cls, creds, http_path):
if http_path not in cls._dbr_capabilities_cache:
is_cluster = is_cluster_http_path(http_path, creds.cluster_id)
dbr_version = cls._query_dbr_version(creds, http_path) # can return None
cls._dbr_capabilities_cache[http_path] = DBRCapabilities(
dbr_version=dbr_version, # None written to cache
is_sql_warehouse=not is_cluster,
)
DBRCapabilities(dbr_version=None) makes has_capability() return False for all version-gated capabilities (dbr_capabilities.py:104).
The check-then-write guard (if http_path not in cls._dbr_capabilities_cache) prevents subsequent calls from correcting the poisoned entry.
Trigger conditions
_query_dbr_version returns None when:
credentials_manager is None at call time (raises DbtRuntimeError, caught by except Exception: pass at connections.py:195-196)
- Cluster is starting up / not yet reachable
- Transient network error during the
SET spark.databricks.clusterUsageTags.sparkVersion query
Impact
All 7 capability-gated features are permanently disabled for the affected http_path:
| Capability |
Min Version |
Effect When Disabled |
ICEBERG |
14.3 |
Hard error: "iceberg requires DBR 14.3+" |
INSERT_BY_NAME |
12.2 |
Silent fallback to positional insert (risk of wrong column mapping with schema drift) |
REPLACE_ON |
17.1 |
Silent fallback to non-replace_on strategy |
COMMENT_ON_COLUMN |
16.1 |
Column comments silently skipped |
JSON_COLUMN_METADATA |
16.2 |
Fallback to legacy column introspection |
TIMESTAMPDIFF |
10.4 |
Fallback to non-timestampdiff SQL |
STREAMING_TABLE_JSON_METADATA |
17.1 |
Fallback to legacy metadata path |
ICEBERG is the most visible — it raises a hard error. Others silently degrade.
Contributing factor: silent exception swallowing
connections.py:195-196:
This catches ALL errors in _query_dbr_version without any logging, making the cache poisoning invisible to users. There is no diagnostic output when the version query fails — the user only sees the downstream capability error.
Suggested fix
Apply the same None-guard pattern that _try_cache_dbr_capabilities already uses:
@classmethod
def _cache_dbr_capabilities(cls, creds, http_path):
if http_path not in cls._dbr_capabilities_cache:
is_cluster = is_cluster_http_path(http_path, creds.cluster_id)
dbr_version = cls._query_dbr_version(creds, http_path)
if dbr_version is not None:
cls._dbr_capabilities_cache[http_path] = DBRCapabilities(
dbr_version=dbr_version,
is_sql_warehouse=not is_cluster,
)
This allows subsequent calls (on retry or from another thread) to attempt the version query again rather than being blocked by a poisoned entry. The retry logic already exists in open() via exponential_backoff.
Additionally, adding debug-level logging to the except Exception block in _query_dbr_version would make failures diagnosable.
Related
System information
Identified via code review of connections.py on main branch (post-#1355 merge). Affects all versions with the capability caching system.
Describe the bug
_cache_dbr_capabilitiesunconditionally writes to the class-level_dbr_capabilities_cachedict, even when_query_dbr_versionreturnsNone. Once written, theif http_path not in cls._dbr_capabilities_cacheguard prevents any subsequent call from overwriting it. This permanently disables all capability-gated features for thathttp_pathfor the lifetime of the dbt process.The
_try_cache_dbr_capabilitiesmethod introduced in #1355 correctly guards against this withif dbr_version is not None, but only for the eager path in_create_fresh_connection(). The authoritative path inopen()still uses the unguarded_cache_dbr_capabilities.Root cause
connections.py:200-209:DBRCapabilities(dbr_version=None)makeshas_capability()returnFalsefor all version-gated capabilities (dbr_capabilities.py:104).The check-then-write guard (
if http_path not in cls._dbr_capabilities_cache) prevents subsequent calls from correcting the poisoned entry.Trigger conditions
_query_dbr_versionreturnsNonewhen:credentials_managerisNoneat call time (raisesDbtRuntimeError, caught byexcept Exception: passatconnections.py:195-196)SET spark.databricks.clusterUsageTags.sparkVersionqueryImpact
All 7 capability-gated features are permanently disabled for the affected
http_path:ICEBERGINSERT_BY_NAMEREPLACE_ONCOMMENT_ON_COLUMNJSON_COLUMN_METADATATIMESTAMPDIFFSTREAMING_TABLE_JSON_METADATAICEBERGis the most visible — it raises a hard error. Others silently degrade.Contributing factor: silent exception swallowing
connections.py:195-196:This catches ALL errors in
_query_dbr_versionwithout any logging, making the cache poisoning invisible to users. There is no diagnostic output when the version query fails — the user only sees the downstream capability error.Suggested fix
Apply the same
None-guard pattern that_try_cache_dbr_capabilitiesalready uses:This allows subsequent calls (on retry or from another thread) to attempt the version query again rather than being blocked by a poisoned entry. The retry logic already exists in
open()viaexponential_backoff.Additionally, adding debug-level logging to the
except Exceptionblock in_query_dbr_versionwould make failures diagnosable.Related
databricks_compute) #1355 — introduced_try_cache_dbr_capabilitieswith theNoneguard, but only for the eager pathSystem information
Identified via code review of
connections.pyon main branch (post-#1355 merge). Affects all versions with the capability caching system.