I'm a PhD student in Artificial Intelligence at MICC, University of Florence, working under the guidance of Prof. Andrew D. Bagdanov and Prof. Marco Bertini. With a background in Computer Engineering and AI, my research focuses on pushing the boundaries of Multimodal Vision-Language Models (like CLIP) and their real-world applications.
My work has been published in top-tier venues including CVPR 2026, ICLR 2026, ICLR 2025, ECCV 2024, and NeurIPS 2023 (workshop).
I recently completed an Applied Scientist Internship at Amazon (RufusX Team, London), where I worked on foundational research and development in Generative AI and Multimodal Large Language Models (MLLMs) as part of the Amazon Rufus initiative.
For more information, feel free to visit my website: marcomistretta.github.io
-
IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment CVPR 2026 (main conference) Authors: Magistri S., Goswami D., Mistretta M., Twardowski B., van de Weijer J., Bagdanov A. D.
-
SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery ICLR 2026 (main conference) Authors: Caselli L., Mistretta M., Magistri S., Bagdanov A. D. Code: GitHub Repository
-
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion ICLR 2025 (main conference) Authors: Mistretta M.*, Baldrati A.*, Agnolucci L.*, Bertini M., Bagdanov A. D. Code: GitHub Repository
-
Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation ECCV 2024 (main conference) Authors: Mistretta M.*, Baldrati A.*, Bertini M., Bagdanov A. D. Code: GitHub Repository
-
RE-tune: Incremental Fine Tuning of Biomedical Vision-Language Models for Multi-label Chest X-ray Classification NeurIPS 2023, Medical Imaging meets NeurIPS Workshop Authors: Mistretta M., Bagdanov A. D.
July 2025 β December 2025
- Worked on Generative AI and Multimodal Large Language Models (MLLMs) within the Amazon Rufus initiative.
- Fine-tuned, evaluated, and deployed large-scale multimodal models impacting millions of customers.
- Collaborated with scientists and engineers to advance real-world multimodal reasoning and generation.
- π§ Multimodal Learning: Combining visual and language data for richer model understanding.
- π¬ Prompt Learning: Tuning learnable parameters to maximize VLM performance.
- πΌοΈ Contrastive Self-Supervised Learning: Finding patterns in unlabeled data.
- β»οΈ Incremental Learning: Allowing AI models to keep learning without forgetting.
- π― Few-Shot Adaptation: Quickly adapting AI to new tasks with minimal examples.
- Programming Languages: Python, Java, C++, MATLAB, R
- Frameworks & Tools: PyTorch, TensorFlow, NumPy, OpenCV, Git, Docker
