Skip to content

bug: _cache_dbr_capabilities permanently poisons capability cache when version query fails #1398

@sd-db

Description

@sd-db

Describe the bug

_cache_dbr_capabilities unconditionally writes to the class-level _dbr_capabilities_cache dict, even when _query_dbr_version returns None. Once written, the if http_path not in cls._dbr_capabilities_cache guard prevents any subsequent call from overwriting it. This permanently disables all capability-gated features for that http_path for the lifetime of the dbt process.

The _try_cache_dbr_capabilities method introduced in #1355 correctly guards against this with if dbr_version is not None, but only for the eager path in _create_fresh_connection(). The authoritative path in open() still uses the unguarded _cache_dbr_capabilities.

Root cause

connections.py:200-209:

@classmethod
def _cache_dbr_capabilities(cls, creds, http_path):
    if http_path not in cls._dbr_capabilities_cache:
        is_cluster = is_cluster_http_path(http_path, creds.cluster_id)
        dbr_version = cls._query_dbr_version(creds, http_path)  # can return None

        cls._dbr_capabilities_cache[http_path] = DBRCapabilities(
            dbr_version=dbr_version,        # None written to cache
            is_sql_warehouse=not is_cluster,
        )

DBRCapabilities(dbr_version=None) makes has_capability() return False for all version-gated capabilities (dbr_capabilities.py:104).

The check-then-write guard (if http_path not in cls._dbr_capabilities_cache) prevents subsequent calls from correcting the poisoned entry.

Trigger conditions

_query_dbr_version returns None when:

  1. credentials_manager is None at call time (raises DbtRuntimeError, caught by except Exception: pass at connections.py:195-196)
  2. Cluster is starting up / not yet reachable
  3. Transient network error during the SET spark.databricks.clusterUsageTags.sparkVersion query

Impact

All 7 capability-gated features are permanently disabled for the affected http_path:

Capability Min Version Effect When Disabled
ICEBERG 14.3 Hard error: "iceberg requires DBR 14.3+"
INSERT_BY_NAME 12.2 Silent fallback to positional insert (risk of wrong column mapping with schema drift)
REPLACE_ON 17.1 Silent fallback to non-replace_on strategy
COMMENT_ON_COLUMN 16.1 Column comments silently skipped
JSON_COLUMN_METADATA 16.2 Fallback to legacy column introspection
TIMESTAMPDIFF 10.4 Fallback to non-timestampdiff SQL
STREAMING_TABLE_JSON_METADATA 17.1 Fallback to legacy metadata path

ICEBERG is the most visible — it raises a hard error. Others silently degrade.

Contributing factor: silent exception swallowing

connections.py:195-196:

except Exception:
    pass

This catches ALL errors in _query_dbr_version without any logging, making the cache poisoning invisible to users. There is no diagnostic output when the version query fails — the user only sees the downstream capability error.

Suggested fix

Apply the same None-guard pattern that _try_cache_dbr_capabilities already uses:

@classmethod
def _cache_dbr_capabilities(cls, creds, http_path):
    if http_path not in cls._dbr_capabilities_cache:
        is_cluster = is_cluster_http_path(http_path, creds.cluster_id)
        dbr_version = cls._query_dbr_version(creds, http_path)

        if dbr_version is not None:
            cls._dbr_capabilities_cache[http_path] = DBRCapabilities(
                dbr_version=dbr_version,
                is_sql_warehouse=not is_cluster,
            )

This allows subsequent calls (on retry or from another thread) to attempt the version query again rather than being blocked by a poisoned entry. The retry logic already exists in open() via exponential_backoff.

Additionally, adding debug-level logging to the except Exception block in _query_dbr_version would make failures diagnosable.

Related

System information

Identified via code review of connections.py on main branch (post-#1355 merge). Affects all versions with the capability caching system.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions