fix: use patch count to decide whether to drop CLS token#206
fix: use patch count to decide whether to drop CLS token#206ccjolley wants to merge 1 commit intoTinyLLaVA:mainfrom
Conversation
|
Thanks for your contribution and the thoughtful fix! You are right that the current implementation implicitly assumes the presence of a At the same time, we are currently working on a relatively large-scale architectural refactor (v2) on the For this PR:
That said, the fix itself is valid and helpful — we've noted it and will consider incorporating the idea into the new implementation or test suites to ensure it properly handles models without We will move forward with the v2 refactor and revisit these changes once the new structure stabilizes. If you're interested, feel free to follow the progress on the Thanks again for your contribution 🤝 |
The existing code in tinyllava/model/vision_tower/base.py has the following in VisionTower.forward():
if kwargs.get('vision_feature_select_strategy', 'patch') == 'patch':
image_features = image_features[:, 1:]
This is correct for CLIP-style models (where the first token is a CLS), but not for SigLIP-style models that don't use a CLS token. The updated version has:
This logic works for both CLIP (where image_features.shape[1] = n_patches+1) and SigLIP.