-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot load the checkpoint #782
Comments
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory When this message shows up, it usually implies that one of the checkpoint files is incomplete (e.g. broken during transfer). Can you check the local files? |
i have error is: |
How big is your GPU? You need a rather large GPU to load a 20B model, and it seems you simply don’t have enough VRAM. |
Hi @StellaAthena , I'm trynig to run inference and finetunning using 20B with 8 x NVIDIA A10G 23GB VRAM and still got the
None of the bellow config work:
I'm running Version 2.0 of GPT-NeoX |
I was able tu run using HF version |
Describe the bug
It generate the error when running the generate program
To Reproduce
Steps to reproduce the behavior:
Loading extension module utils...
Loading extension module utils...
Loading extension module utils...
Loading extension module utils...
Traceback (most recent call last):
File "generate.py", line 91, in
main()
File "generate.py", line 33, in main
model, neox_args = setup_for_inference_or_eval(use_cache=True)
File "/work/c272987/gpt-neox/megatron/utils.py", line 440, in setup_for_inference_or_eval
model, _, _ = setup_model_and_optimizer(
File "/work//gpt-neox/megatron/training.py", line 447, in setup_model_and_optimizer
neox_args.iteration = load_checkpoint(
File "/work//gpt-neox/megatron/checkpointing.py", line 239, in load_checkpoint
checkpoint_name, state_dict = model.load_checkpoint(
File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1523, in load_checkpoint
load_path, client_states = self._load_checkpoint(load_dir,
File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1558, in _load_checkpoint
self.load_module_state_dict(state_dict=checkpoint['module'],
File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1278, in load_module_state_dict
self.module.load_state_dir(load_dir=self._curr_ckpt_path, strict=strict)
File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 571, in load_state_dir
layer.load_state_dict(torch.load(model_ckpt_path,
File "/work//gpt-neox/venv/lib/python3.8/site-packages/torch/serialization.py", line 778, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/work//gpt-neox/venv/lib/python3.8/site-packages/torch/serialization.py", line 282, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 179, in
main()
File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 169, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
Expected behavior
runs smoothly
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: