Skip to content

fix: gemma-4 with exllamav3#1672

Merged
AlpinDale merged 3 commits into
mainfrom
fix/gemma4-exl3
May 7, 2026
Merged

fix: gemma-4 with exllamav3#1672
AlpinDale merged 3 commits into
mainfrom
fix/gemma4-exl3

Conversation

@AlpinDale
Copy link
Copy Markdown
Collaborator

No description provided.

@AlpinDale AlpinDale merged commit c16f370 into main May 7, 2026
1 check failed
@AlpinDale AlpinDale deleted the fix/gemma4-exl3 branch May 7, 2026 02:12
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the EXL3 quantization logic to support models with K=V attention, such as Gemma 4, where the v_proj tensor may be absent and duplicated from k_proj. Feedback was provided to optimize the implementation by using short-circuiting and reducing redundant lookups to improve model initialization performance.

Comment on lines +464 to +470
has_q = self._is_exl3_prefix(f"{base}.q_proj")
has_k = self._is_exl3_prefix(f"{base}.k_proj")
has_v = self._is_exl3_prefix(f"{base}.v_proj")
# Gemma 4 full-attention layers can use K=V attention and store
# only q_proj/k_proj tensors. The model loader duplicates K into
# V, so the fused qkv_proj still needs EXL3 parameters.
return has_q and has_k and (has_v or self._storage_entry(f"{base}.v_proj") is None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation performs redundant lookups and lacks short-circuiting, which can negatively impact model initialization performance. Specifically, _is_exl3_prefix is called for all three projections regardless of the results, and _storage_entry is called twice for v_proj when it is missing from the storage. Since _storage_entry involves multiple string operations and dictionary lookups, it is more efficient to use short-circuiting and store the entry for v_proj to avoid redundant work.

Suggested change
has_q = self._is_exl3_prefix(f"{base}.q_proj")
has_k = self._is_exl3_prefix(f"{base}.k_proj")
has_v = self._is_exl3_prefix(f"{base}.v_proj")
# Gemma 4 full-attention layers can use K=V attention and store
# only q_proj/k_proj tensors. The model loader duplicates K into
# V, so the fused qkv_proj still needs EXL3 parameters.
return has_q and has_k and (has_v or self._storage_entry(f"{base}.v_proj") is None)
if not (self._is_exl3_prefix(f"{base}.q_proj") and
self._is_exl3_prefix(f"{base}.k_proj")):
return False
v_entry = self._storage_entry(f"{base}.v_proj")
# Gemma 4 full-attention layers can use K=V attention and store
# only q_proj/k_proj tensors. The model loader duplicates K into
# V, so the fused qkv_proj still needs EXL3 parameters.
return v_entry is None or v_entry.get("quant_format") == "exl3"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant