Cannot load the checkpoint #782

jmlongriver12 · 2023-02-06T16:00:42Z

Describe the bug
It generate the error when running the generate program
To Reproduce
Steps to reproduce the behavior:

run "./deepy.py generate.py ./configs/20B.yml -i prompt.txt -o sample_outputs.txt"
the error is raised as below:
Loading extension module utils...
Loading extension module utils...
Loading extension module utils...
Loading extension module utils...
Traceback (most recent call last):
File "generate.py", line 91, in
main()
File "generate.py", line 33, in main
model, neox_args = setup_for_inference_or_eval(use_cache=True)
File "/work/c272987/gpt-neox/megatron/utils.py", line 440, in setup_for_inference_or_eval
model, _, _ = setup_model_and_optimizer(
File "/work//gpt-neox/megatron/training.py", line 447, in setup_model_and_optimizer
neox_args.iteration = load_checkpoint(
File "/work//gpt-neox/megatron/checkpointing.py", line 239, in load_checkpoint
checkpoint_name, state_dict = model.load_checkpoint(
File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1523, in load_checkpoint
load_path, client_states = self._load_checkpoint(load_dir,
File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1558, in _load_checkpoint
self.load_module_state_dict(state_dict=checkpoint['module'],
File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1278, in load_module_state_dict
self.module.load_state_dir(load_dir=self._curr_ckpt_path, strict=strict)
File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 571, in load_state_dir
layer.load_state_dict(torch.load(model_ckpt_path,
File "/work//gpt-neox/venv/lib/python3.8/site-packages/torch/serialization.py", line 778, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/work//gpt-neox/venv/lib/python3.8/site-packages/torch/serialization.py", line 282, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 179, in
main()
File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 169, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)

Expected behavior
runs smoothly

Environment (please complete the following information):

GPUs: 4
Configs: 20B

syskn · 2023-02-11T22:25:28Z

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

When this message shows up, it usually implies that one of the checkpoint files is incomplete (e.g. broken during transfer). Can you check the local files?

cywjava · 2023-03-22T00:13:14Z

i have error is:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 23.70 GiB total capacity; 7.89 GiB already allocated; 39.19 MiB free; 7.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

StellaAthena · 2023-03-24T04:21:57Z

i have error is: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 23.70 GiB total capacity; 7.89 GiB already allocated; 39.19 MiB free; 7.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How big is your GPU? You need a rather large GPU to load a 20B model, and it seems you simply don’t have enough VRAM.

heukirne · 2023-04-05T08:39:08Z

Hi @StellaAthena , I'm trynig to run inference and finetunning using 20B with 8 x NVIDIA A10G 23GB VRAM and still got the

RuntimeError: CUDA out of memory. Tried to allocate 9.59 GiB (GPU 0; 22.04 GiB total capacity; 14.39 GiB already allocated; 7.00 GiB free; 14.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

None of the bellow config work:

"pipe-parallel-size": 8|4|2|1,
"model-parallel-size": 1|2|4|8,

I'm running Version 2.0 of GPT-NeoX
Do you have any tips on how to improve config and be able to run it?

heukirne · 2023-04-06T16:24:26Z

I was able tu run using HF version
https://github.com/mallorbc/GPTNeoX20B_HuggingFace

jmlongriver12 added the bug Something isn't working label Feb 6, 2023

Mutinifni mentioned this issue Apr 23, 2023

20B pretrained model inference OOM on 8xA100 40GB #901

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot load the checkpoint #782

Cannot load the checkpoint #782

jmlongriver12 commented Feb 6, 2023

syskn commented Feb 11, 2023

cywjava commented Mar 22, 2023

StellaAthena commented Mar 24, 2023 •

edited

Loading

heukirne commented Apr 5, 2023 •

edited

Loading

heukirne commented Apr 6, 2023

Cannot load the checkpoint #782

Cannot load the checkpoint #782

Comments

jmlongriver12 commented Feb 6, 2023

syskn commented Feb 11, 2023

cywjava commented Mar 22, 2023

StellaAthena commented Mar 24, 2023 • edited Loading

heukirne commented Apr 5, 2023 • edited Loading

heukirne commented Apr 6, 2023

StellaAthena commented Mar 24, 2023 •

edited

Loading

heukirne commented Apr 5, 2023 •

edited

Loading