Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BuilderConfig questions #13

Open
xiaos16 opened this issue Mar 14, 2025 · 9 comments
Open

BuilderConfig questions #13

xiaos16 opened this issue Mar 14, 2025 · 9 comments

Comments

@xiaos16
Copy link

xiaos16 commented Mar 14, 2025

hello, When I test tedlium and fleurs-zh, I get these errors,

ERROR - load args is {'path': 'TwinkStart/tedlium', 'name': 'release1'}load dataset error: BuilderConfig 'release1' not found. Available: ['default']

ERROR - load args is {'path': 'google/fleurs', 'name': 'cmn_hans_cn', 'split': 'test'}load dataset error: BuilderConfig 'cmn_hans_cn' not found. Available: ['default']

how to solve it ? thanks !

@UltraEval
Copy link
Collaborator

UltraEval commented Mar 18, 2025

Unable to Reproduce the Bug

To verify the issue, please run the following code and share your results:

from datasets import get_dataset_config_names

configs = get_dataset_config_names("TwinkStart/tedlium")
print(configs)

from datasets import get_dataset_config_names

configs = get_dataset_config_names("google/fleurs")
print(configs)

@xiaos16
Copy link
Author

xiaos16 commented Mar 18, 2025

from datasets import get_dataset_config_names
configs = get_dataset_config_names("TwinkStart/tedlium")
print(configs)
['default']
from datasets import get_dataset_config_names
configs = get_dataset_config_names("google/fleurs")
print(configs)
['default']

and I use comands with
'''
python audio_evals/main.py --dataset fleurs-zh --prompt mini-cpm-omni-asr-zh --model MiniCPMo2_6-audio
python audio_evals/main.py --dataset tedlium-release1 --prompt mini-cpm-omni-asr-en --model MiniCPMo2_6-audio
'''

in /UltraEval-Audio-main/registry/dataset/fleurs.yaml, it is this:

fleurs-zh:
class: audio_evals.dataset.huggingface.Huggingface
args:
subset: cmn_hans_cn
default_task: asr-zh
name: google/fleurs
ref_col: raw_transcription
split: test

@UltraEval
Copy link
Collaborator

from datasets import get_dataset_config_names
configs = get_dataset_config_names("TwinkStart/tedlium")
print(configs)
['default']
from datasets import get_dataset_config_names
configs = get_dataset_config_names("google/fleurs")
print(configs)
['default']

and I use comands with ''' python audio_evals/main.py --dataset fleurs-zh --prompt mini-cpm-omni-asr-zh --model MiniCPMo2_6-audio python audio_evals/main.py --dataset tedlium-release1 --prompt mini-cpm-omni-asr-en --model MiniCPMo2_6-audio '''

in /UltraEval-Audio-main/registry/dataset/fleurs.yaml, it is this:

fleurs-zh: class: audio_evals.dataset.huggingface.Huggingface args: subset: cmn_hans_cn default_task: asr-zh name: google/fleurs ref_col: raw_transcription split: test

I know this config, you should run the following code with your python shell:

from datasets import get_dataset_config_names

configs = get_dataset_config_names("TwinkStart/tedlium")
print(configs)

from datasets import get_dataset_config_names

configs = get_dataset_config_names("google/fleurs")
print(configs)

and share your results

@xiaos16
Copy link
Author

xiaos16 commented Mar 18, 2025

yes, the results are both ['default']

from datasets import get_dataset_config_names
configs = get_dataset_config_names("TwinkStart/tedlium")
print(configs)
['default']
from datasets import get_dataset_config_names
configs = get_dataset_config_names("google/fleurs")
print(configs)
['default']

@xiaos16
Copy link
Author

xiaos16 commented Mar 18, 2025

Image

@UltraEval
Copy link
Collaborator

yes, the results are both ['default']

from datasets import get_dataset_config_names configs = get_dataset_config_names("TwinkStart/tedlium") print(configs) ['default'] from datasets import get_dataset_config_names configs = get_dataset_config_names("google/fleurs") print(configs) ['default']

You need check:

it should be like:

Image

  • upgrade datasets package

Image

@xiaos16
Copy link
Author

xiaos16 commented Mar 18, 2025

which verison do you use? I use datasets ==3.3.2. I also try 3.4.1, but it is not ok for me.

I've downloaded the data locally.

./TwinkStart/tedlium/release1/test-00000-of-00001.parquet

./google/fleurs/data/cmn_hans_cn/audio/test.tar.gz

@UltraEval
Copy link
Collaborator

which verison do you use? I use datasets ==3.3.2. I also try 3.4.1, but it is not ok for me.

I've downloaded the data locally.

./TwinkStart/tedlium/release1/test-00000-of-00001.parquet

./google/fleurs/data/cmn_hans_cn/audio/test.tar.gz

you can try download hf data with following code

save_path='xx'
dataset = load_dataset('TwinkStart/tedlium', name='release1',  cache_dir=save_path)

@xiaos16
Copy link
Author

xiaos16 commented Mar 20, 2025

which verison do you use? I use datasets ==3.3.2. I also try 3.4.1, but it is not ok for me.
I've downloaded the data locally.
./TwinkStart/tedlium/release1/test-00000-of-00001.parquet
./google/fleurs/data/cmn_hans_cn/audio/test.tar.gz

you can try download hf data with following code

save_path='xx'
dataset = load_dataset('TwinkStart/tedlium', name='release1', cache_dir=save_path)

I will try it, thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants