PhD Candidate in Computer Science (D-INFK), ETH Zurich
Research focus: Multimodal Large Language Models (MLLMs), Vision-Language Reasoning, and Efficient VLM Systems.
I build lightweight, reproducible tools for multimodal research workflows, with an emphasis on:
- retrieval and grounding for document-centric QA
- evaluation pipelines for VLM experiments
- efficiency-oriented methods for visual token reduction
- 2024: Started building compact research utilities for multimodal retrieval and evaluation.
- 2025: Expanded to reusable CLI tools, testable pipelines, and benchmark-style experimentation.
- 2026: Focusing on robust multimodal systems for long-context documents and efficient inference.
multimodal-doc-rag: citation-aware multimodal retrieval and context building toolkit.vlm-eval-lite: minimal and reproducible multimodal QA evaluation runner.sparse-vl: simulation toolkit for visual-token sparsification strategies in VLM inference.
- grounded multimodal RAG for PDFs and technical reports
- long-context VLM evaluation and failure analysis
- practical methods for reducing multimodal serving cost
Last updated: February 2026.