You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Golden answer: ['Red, Blue and Green', 'red blue and green', 'red blue and green']
Model output: The three primary colors of light are red, green, and blue.
Correct: False
Just switching the order makes it wrong. Is this reasonable?
The text was updated successfully, but these errors were encountered:
Golden answer: ['24,900 miles', '24 900 miles', '24 900 miles']
Model output: The approximate circumference of the Earth is about 24,901 miles or 40,075 kilometers.
Correct: False
I believe 24,901 miles is the more precise answer.
The first evaluation for speech QA tasks was established by 【Nachmani, Eliya, et al. "Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM." arXiv preprint arXiv:2305.15255 (2023).】
Later, Moshi and GLM-4-Voice followed this evaluation framework, and we have also adopted this legacy.
Your observation is correct and crucial, yet it highlights a challenge that is both difficult and common in QA evaluation. We acknowledge this and will address it as part of our future work!
Golden answer: ['Red, Blue and Green', 'red blue and green', 'red blue and green']
Model output: The three primary colors of light are red, green, and blue.
Correct: False
Just switching the order makes it wrong. Is this reasonable?
The text was updated successfully, but these errors were encountered: