fix: use patch count to decide whether to drop CLS token by ccjolley · Pull Request #206 · TinyLLaVA/TinyLLaVA_Factory

ccjolley · 2026-03-25T14:39:13Z

The existing code in tinyllava/model/vision_tower/base.py has the following in VisionTower.forward():

if kwargs.get('vision_feature_select_strategy', 'patch') == 'patch':
image_features = image_features[:, 1:]

This is correct for CLIP-style models (where the first token is a CLS), but not for SigLIP-style models that don't use a CLS token. The updated version has:

    if kwargs.get('vision_feature_select_strategy', 'patch') == 'patch':
        # Drop CLS prefix token(s) if present (CLIP-style).
        # SigLIP has no CLS token, so all tokens are patches — nothing to drop.
        n_patches = (self._vision_tower.config.image_size // self._vision_tower.config.patch_size) ** 2
        image_features = image_features[:, image_features.shape[1] - n_patches:]

This logic works for both CLIP (where image_features.shape[1] = n_patches+1) and SigLIP.

…SigLIP

BobYue-01 · 2026-03-26T07:50:38Z

Thanks for your contribution and the thoughtful fix!

You are right that the current implementation implicitly assumes the presence of a [CLS] token and performs image_features[:, 1:], which can lead to silent feature loss for models like SigLIP (as discussed in #203). Your fix addresses this issue in a reasonable way under the current structure.

At the same time, we are currently working on a relatively large-scale architectural refactor (v2) on the develop branch, aiming to systematically address several accumulated design issues in the codebase, including module structure, argument management, and model loading workflows. During this process, some components will be reorganized, while we will try our best to maintain compatibility as much as possible.

For this PR:

The change is closely coupled with parts of the current implementation that are being actively refactored. Merging it at this stage would likely introduce additional maintenance overhead and potential conflicts during the transition.
The underlying issue is also related to argument handling (e.g., vision_feature_select_strategy not being consistently propagated or properly managed), which is also reflected in vision_feature_select_strategy #180.
We plan to address this in a more systematic way as part of the new configuration design, instead of handling it via local fixes in individual modules.

That said, the fix itself is valid and helpful — we've noted it and will consider incorporating the idea into the new implementation or test suites to ensure it properly handles models without [CLS] tokens.

We will move forward with the v2 refactor and revisit these changes once the new structure stabilizes.

If you're interested, feel free to follow the progress on the develop branch. We'll also share a v2 roadmap in Discussions soon — feedback and contributions are very welcome!

Thanks again for your contribution 🤝

fix: use patch count to decide whether to drop CLS token in CLIP vs. …

1f7e8b0

…SigLIP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use patch count to decide whether to drop CLS token#206

fix: use patch count to decide whether to drop CLS token#206
ccjolley wants to merge 1 commit intoTinyLLaVA:mainfrom
ccjolley:fix/siglip-cls-token

ccjolley commented Mar 25, 2026

Uh oh!

BobYue-01 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ccjolley commented Mar 25, 2026

Uh oh!

BobYue-01 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants