Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][V0] Add guidance backend for structured output #14589

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

russellb
Copy link
Member

This commit is based on the PR #10217. It is updated to be compatible
with main.

Signed-off-by: Russell Bryant rbryant@redhat.com
Co-authored-by: Loc Huynh lohuynh@microsoft.com
Co-authored-by: Michal Moskal michal@moskal.me


I started looking at this after talking to @joerunde about some performance issues observed in production. While the ultimate goal is to get everyone to V1 where we expect to provide more drastic improvements, I wanted to see if we could do something to help in V0 in the meantime.

In my testing, the behavior I see is roughly:

  1. xgrammar is still fastest, but has more limited jsonschema support, so it doesn't solve the challenge here.

  2. guidance provides a signiificant improvement to TTFT, which is the biggest concern here. They observed large and complex jsonschemas taking down the server for an excessively long time.

  3. guidance is showing a hit to TPOT for structured output requests.

I'm posting here so further testing can be done to validate whether the performance characteristics provide enough short term benefit to justify including this in tree.

@russellb russellb requested a review from mgoin as a code owner March 11, 2025 01:02
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link

mergify bot commented Mar 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 11, 2025
@russellb russellb force-pushed the llguidance-v0-integration branch from 5244b7f to 2db1a0e Compare March 11, 2025 01:14
@mergify mergify bot removed the needs-rebase label Mar 11, 2025
@Harsha-Nori
Copy link

Hey @russellb this looks fantastic! On point 3 -- is it possible to share an example of a workload where TPOT is reducing vs. xgrammar? I don't typically see this in JSON schemas so it'd be a helpful case to look for performance improvements.

@russellb
Copy link
Member Author

Hey @russellb this looks fantastic! On point 3 -- is it possible to share an example of a workload where TPOT is reducing vs. xgrammar? I don't typically see this in JSON schemas so it'd be a helpful case to look for performance improvements.

I'm using these changes to the benchmark suite on top of this PR: #14567

I'm running the following command, changing the structured output backend to guidance, xgrammar, or outlines, and also running with a request rate of 1, 5, 10.

python3 benchmarks/benchmark_serving_structured_output.py --port 8432 --model meta-llama/Llama-3.1-8B-Instruct --dataset json-unique --structured-output-ratio 1.0 --structured-output-backend guidance --output-len 300 --num-prompts 90 --request-rate 1

This commit is based on the PR vllm-project#10217. It is updated to be compatible
with `main`.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Loc Huynh <lohuynh@microsoft.com>
Co-authored-by: Michal Moskal <michal@moskal.me>
@russellb russellb force-pushed the llguidance-v0-integration branch from 2db1a0e to d66d439 Compare March 11, 2025 13:40
@russellb
Copy link
Member Author

I'm using these changes to the benchmark suite on top of this PR: #14567

This has been merged and I rebased this branch on top of main to include it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants