Skip to content

fix: use patch count to decide whether to drop CLS token#206

Open
ccjolley wants to merge 1 commit intoTinyLLaVA:mainfrom
ccjolley:fix/siglip-cls-token
Open

fix: use patch count to decide whether to drop CLS token#206
ccjolley wants to merge 1 commit intoTinyLLaVA:mainfrom
ccjolley:fix/siglip-cls-token

Conversation

@ccjolley
Copy link
Copy Markdown
Contributor

The existing code in tinyllava/model/vision_tower/base.py has the following in VisionTower.forward():

if kwargs.get('vision_feature_select_strategy', 'patch') == 'patch':
image_features = image_features[:, 1:]

This is correct for CLIP-style models (where the first token is a CLS), but not for SigLIP-style models that don't use a CLS token. The updated version has:

    if kwargs.get('vision_feature_select_strategy', 'patch') == 'patch':
        # Drop CLS prefix token(s) if present (CLIP-style).
        # SigLIP has no CLS token, so all tokens are patches — nothing to drop.
        n_patches = (self._vision_tower.config.image_size // self._vision_tower.config.patch_size) ** 2
        image_features = image_features[:, image_features.shape[1] - n_patches:]

This logic works for both CLIP (where image_features.shape[1] = n_patches+1) and SigLIP.

@BobYue-01
Copy link
Copy Markdown
Collaborator

Thanks for your contribution and the thoughtful fix!

You are right that the current implementation implicitly assumes the presence of a [CLS] token and performs image_features[:, 1:], which can lead to silent feature loss for models like SigLIP (as discussed in #203). Your fix addresses this issue in a reasonable way under the current structure.

At the same time, we are currently working on a relatively large-scale architectural refactor (v2) on the develop branch, aiming to systematically address several accumulated design issues in the codebase, including module structure, argument management, and model loading workflows. During this process, some components will be reorganized, while we will try our best to maintain compatibility as much as possible.

For this PR:

  • The change is closely coupled with parts of the current implementation that are being actively refactored. Merging it at this stage would likely introduce additional maintenance overhead and potential conflicts during the transition.
  • The underlying issue is also related to argument handling (e.g., vision_feature_select_strategy not being consistently propagated or properly managed), which is also reflected in vision_feature_select_strategy #180.
  • We plan to address this in a more systematic way as part of the new configuration design, instead of handling it via local fixes in individual modules.

That said, the fix itself is valid and helpful — we've noted it and will consider incorporating the idea into the new implementation or test suites to ensure it properly handles models without [CLS] tokens.

We will move forward with the v2 refactor and revisit these changes once the new structure stabilizes.

If you're interested, feel free to follow the progress on the develop branch. We'll also share a v2 roadmap in Discussions soon — feedback and contributions are very welcome!

Thanks again for your contribution 🤝

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants