About the metric of QA task #1

tianzhangwu · 2025-01-21T07:30:17Z

Golden answer: ['Red, Blue and Green', 'red blue and green', 'red blue and green']
Model output: The three primary colors of light are red, green, and blue.
Correct: False

Just switching the order makes it wrong. Is this reasonable？

tianzhangwu · 2025-01-21T07:32:36Z

And this one:

Golden answer: ['24,900 miles', '24 900 miles', '24 900 miles']
Model output: The approximate circumference of the Earth is about 24,901 miles or 40,075 kilometers.
Correct: False

I believe 24,901 miles is the more precise answer.

UltraEval · 2025-01-23T03:24:34Z

The first evaluation for speech QA tasks was established by 【Nachmani, Eliya, et al. "Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM." arXiv preprint arXiv:2305.15255 (2023).】

Later, Moshi and GLM-4-Voice followed this evaluation framework, and we have also adopted this legacy.

Your observation is correct and crucial, yet it highlights a challenge that is both difficult and common in QA evaluation. We acknowledge this and will address it as part of our future work!

Allessyer · 2025-02-12T13:17:48Z

What is a ELO score in Audio Arena? Couldn't find any information on Github page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the metric of QA task #1

About the metric of QA task #1

tianzhangwu commented Jan 21, 2025

tianzhangwu commented Jan 21, 2025

UltraEval commented Jan 23, 2025

Allessyer commented Feb 12, 2025

About the metric of QA task #1

About the metric of QA task #1

Comments

tianzhangwu commented Jan 21, 2025

tianzhangwu commented Jan 21, 2025

UltraEval commented Jan 23, 2025

Allessyer commented Feb 12, 2025