Skip to content

refine the benchmark eval UX #156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 21, 2025
Merged

refine the benchmark eval UX #156

merged 6 commits into from
Feb 21, 2025

Conversation

SLR722
Copy link
Contributor

@SLR722 SLR722 commented Feb 21, 2025

What does this PR do?

Refine the benchmark eval CLI to have a better user experience to run benchmark eval on some standard benchmarks.

The benchmarks need to be defined as resource in the distro template

improvements include:

  • user don't need to pass in arbitrary eval-task-config, they only need to pass in the list of benchmarks they'd like to eval, the model id to be evaluated on and the output dir to store the eval results
  • output aggregate results to the output file. aggregate results are typically what user care most

Test Plan

spin up a llama stack server with eval benchmarks defined
run llama-stack-client --endpoint xxxx eval run-benchmark "meta-reference-simpleqa" --model_id "meta-llama/Llama-3.1-8B-Instruct" --output_dir "/home/markchen1015/" --num_examples 5

return
Screenshot 2025-02-20 at 4 29 35 PM

what are inside the output file

Screenshot 2025-02-20 at 4 30 08 PM Screenshot 2025-02-20 at 4 17 05 PM

@SLR722 SLR722 changed the title [WIP] refine the benchmark eva; UX [WIP] refine the benchmark eval UX Feb 21, 2025
@SLR722 SLR722 marked this pull request as ready for review February 21, 2025 00:23
@SLR722 SLR722 changed the title [WIP] refine the benchmark eval UX refine the benchmark eval UX Feb 21, 2025
Copy link
Contributor

@yanxi0830 yanxi0830 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@SLR722 SLR722 merged commit c645726 into main Feb 21, 2025
2 checks passed
@SLR722 SLR722 deleted the open_benchmark branch February 21, 2025 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants