Skip to content

Request on Open-source Evaluation Script #4

@ProvenceStar

Description

@ProvenceStar

Hi,

Thank for your great work! May I know if there are any plans for updating the evaluation script?

I try to reproduce the results from the paper, using Qwen3-VL as VLM judge as well as the curated prompt mentioned in the paper, but I found out that the final results are much lower. For example, as for S_ad metric, I cannot get one 10 score sample at all. I got 51% 9.5 samples and 17.6% 8.5 samples, thus it is impossible to get 94.3 as reported in the paper.

Could you please share the evaluation script or inform me where I got wrong? Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions