Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the metric of QA task #1

Open
tianzhangwu opened this issue Jan 21, 2025 · 3 comments
Open

About the metric of QA task #1

tianzhangwu opened this issue Jan 21, 2025 · 3 comments

Comments

@tianzhangwu
Copy link

Golden answer: ['Red, Blue and Green', 'red blue and green', 'red blue and green']
Model output: The three primary colors of light are red, green, and blue.
Correct: False

Just switching the order makes it wrong. Is this reasonable?

@tianzhangwu
Copy link
Author

And this one:

Golden answer: ['24,900 miles', '24 900 miles', '24 900 miles']
Model output: The approximate circumference of the Earth is about 24,901 miles or 40,075 kilometers.
Correct: False

I believe 24,901 miles is the more precise answer.

@UltraEval
Copy link
Collaborator

The first evaluation for speech QA tasks was established by 【Nachmani, Eliya, et al. "Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM." arXiv preprint arXiv:2305.15255 (2023).】

Image

Later, Moshi and GLM-4-Voice followed this evaluation framework, and we have also adopted this legacy.

Your observation is correct and crucial, yet it highlights a challenge that is both difficult and common in QA evaluation. We acknowledge this and will address it as part of our future work!

@Allessyer
Copy link

What is a ELO score in Audio Arena? Couldn't find any information on Github page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants