Skip to content

DOC: Scoring Evaluations Blog#1617

Open
jsong468 wants to merge 5 commits intomicrosoft:mainfrom
jsong468:scoring_blog
Open

DOC: Scoring Evaluations Blog#1617
jsong468 wants to merge 5 commits intomicrosoft:mainfrom
jsong468:scoring_blog

Conversation

@jsong468
Copy link
Copy Markdown
Contributor

Description

This PR adds a blog documenting our scorer evaluation background, story, and process!

Tests and Documentation

N/A

@jsong468 jsong468 marked this pull request as ready for review April 15, 2026 19:34
Comment thread doc/blog/2026_04_14_scoring_scorers.md
Comment thread doc/blog/2026_04_14_scoring_scorers.md
Comment thread doc/blog/2026_04_14.md Outdated
Comment thread doc/blog/2026_04_14.md Outdated
Comment thread doc/blog/2026_04_14_scoring_scorers.md
Comment thread doc/blog/2026_04_14.md Outdated
Comment thread doc/blog/2026_04_14_scoring_scorers.md
Comment thread doc/blog/2026_04_14_scoring_scorers.md
Comment thread doc/blog/2026_04_14_scoring_scorers.md
Comment thread doc/blog/2026_04_14.md Outdated
Comment thread doc/blog/2026_04_14.md Outdated
Comment thread doc/blog/2026_04_14.md Outdated
## Viewing Scoring Metrics

There are a few different ways to view metrics for specific scoring configurations.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link to the docs here?

metrics = await my_scorer.evaluate_async(num_scorer_trials=3)
```

The framework checks the JSONL registry for an existing entry matching the scorer's evaluation hash before running. It only re-runs the evaluation if no entry exists, the dataset version changed, the harm definition version changed, or the requested number of trials exceeds what's stored. You can skip this registry check entirely with `update_registry_behavior=RegistryUpdateBehavior.NEVER_UPDATE` if you're experimenting and don't want to write to the registry. (This should most often be the case since metrics saved to the registry are managed directly by Microsoft's AI Red Team. Please don't hesitate to reach out however if you'd like to add metrics for new scoring configurations to our registries.) The below shows a simple example of running an evaluation:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would take this further because I think this could be a hook for people who have never used pyrit.

Do you want to see how accurate your judge is compared to pyrit or anything else? We collected human responses and have a framework you can use, just adapt your judge or evaluation into a pyrit scorer and run it.

Copy link
Copy Markdown
Contributor

@rlundeen2 rlundeen2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd wait for one more approval here, but looks good! I have two notes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants