bugfix: vlm models(language-only part) inference error, without --ena…#1230
bugfix: vlm models(language-only part) inference error, without --ena…#1230hiworldwzj merged 1 commit intomainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical bug affecting Vision-Language Models (VLMs) when they are configured to operate solely in language mode, without multimodal capabilities enabled. The fix ensures that the inference process can proceed smoothly by gracefully handling the absence of a multimodal embedding cache client, preventing runtime errors and improving the robustness of the system under varying operational configurations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a bug that occurs when running VLM models in language-only mode by handling cases where cpu_embed_cache_client is None. The fix prevents a potential AttributeError. However, the same fix has been applied to three different files, resulting in code duplication. My review includes a suggestion to refactor this duplicated logic to improve code maintainability.
| cpu_embed_cache_client = g_infer_context.cpu_embed_cache_client | ||
| cpu_embed_cache_tensor = ( | ||
| torch.empty((0, 0, hidden_size), dtype=dtype, device=device) | ||
| if cpu_embed_cache_client is None | ||
| else cpu_embed_cache_client.cpu_embed_cache_tensor | ||
| ) |
There was a problem hiding this comment.
While this logic correctly handles the case where cpu_embed_cache_client is None, it is duplicated in qwen3_vl and qwen_vl pre-layer inference files. To improve maintainability and avoid repeating code, consider extracting this logic into a shared helper method in the base class LlamaMultimodalPreLayerInfer.
| cpu_embed_cache_client = g_infer_context.cpu_embed_cache_client | ||
| cpu_embed_cache_tensor = ( | ||
| torch.empty((0, 0, hidden_size), dtype=dtype, device=device) | ||
| if cpu_embed_cache_client is None | ||
| else cpu_embed_cache_client.cpu_embed_cache_tensor | ||
| ) |
There was a problem hiding this comment.
| cpu_embed_cache_client = g_infer_context.cpu_embed_cache_client | ||
| cpu_embed_cache_tensor = ( | ||
| torch.empty((0, 0, hidden_size), dtype=dtype, device=device) | ||
| if cpu_embed_cache_client is None | ||
| else cpu_embed_cache_client.cpu_embed_cache_tensor | ||
| ) |
There was a problem hiding this comment.
This logic for safely initializing cpu_embed_cache_tensor is duplicated in the subclasses Gemma3PreLayerInfer and Qwen3VLMultimodalPreLayerInfer. Since this is the base class, you could define a protected helper method here (e.g., _get_cpu_embed_cache_tensor) to encapsulate this logic. The subclasses can then call this method, which would eliminate the code duplication and make future changes easier.
…ble_multimodal