Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Weight data type float16 produce error #402

Closed
seniorsolt opened this issue Jul 23, 2024 · 5 comments · Fixed by #409
Closed

[Bug]: Weight data type float16 produce error #402

seniorsolt opened this issue Jul 23, 2024 · 5 comments · Fixed by #409
Labels
bug Something isn't working

Comments

@seniorsolt
Copy link
Contributor

seniorsolt commented Jul 23, 2024

What happened?

Weight data type float16 produce error

config attached
default.json

model used - majicmixRealistic_v7-inpainting.safetensors fp16
https://civitai.com/models/43331?modelVersionId=221343

Think it's because of some places where dtype is explicitly set to .float()

image

What did you expect would happen?

That float16 training would work)

Relevant log output

Traceback (most recent call last):
  File "C:\Users\Max\Desktop\OneTrainer_test\modules\ui\TrainUI.py", line 538, in __training_thread_function
    trainer.train()
  File "C:\Users\Max\Desktop\OneTrainer_test\modules\trainer\GenericTrainer.py", line 574, in train
    model_output_data = self.model_setup.predict(self.model, batch, self.config, train_progress)
  File "C:\Users\Max\Desktop\OneTrainer_test\modules\modelSetup\BaseStableDiffusionSetup.py", line 351, in predict
    predicted_latent_noise = model.unet(
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\src\diffusers\src\diffusers\models\unets\unet_2d_condition.py", line 1135, in forward
    emb = self.time_embedding(t_emb, timestep_cond)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\src\diffusers\src\diffusers\models\embeddings.py", line 376, in forward
    sample = self.linear_1(sample)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\lib\site-packages\torch\nn\modules\linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

Output of pip freeze

No response

@seniorsolt seniorsolt added the bug Something isn't working label Jul 23, 2024
@mx
Copy link
Collaborator

mx commented Jul 23, 2024

This is possibly a problem with using a cache that was created in fp32. I noticed your config has clear_cache_before_training: false. Try clearing the cache and see if that fixes your problem.

@seniorsolt
Copy link
Contributor Author

Clearing cache didn't helped, but I notice that error is gone when I turn on masked training. It seems that without masking batch['latent_mask'] has dtype float32 as it comes from GenerateImageLike mgds node.

So
latent_input = torch.concat(
[scaled_noisy_latent_image, batch['latent_mask'], scaled_latent_conditioning_image], 1
)

produces float32 tensor and unet with float16 gives error

@djp3k05
Copy link

djp3k05 commented Jul 23, 2024

I also can confirm this issue (with any model). but I ignored it as I usually go with bf16. Also clear cache would not help.

@seniorsolt
Copy link
Contributor Author

seniorsolt commented Jul 23, 2024

Also similar errors arise in debug process and sampling process. It can be seen after fixing GenerateImageLike node

@seniorsolt
Copy link
Contributor Author

During sampling:

Traceback (most recent call last): | 0/1 [00:00<?, ?it/s]
File "C:\Users\Max\Desktop\OneTrainer\modules\trainer\GenericTrainer.py", line 245, in __sample_loop
self.model_sampler.sample(
File "C:\Users\Max\Desktop\OneTrainer\modules\modelSampler\StableDiffusionSampler.py", line 454, in sample
image = self.__sample_inpainting(
File "C:\Users\Max\Desktop\OneTrainer\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\Max\Desktop\OneTrainer\modules\modelSampler\StableDiffusionSampler.py", line 272, in __sample_inpainting
eroded_mask = erode_kernel(mask)
.......
File "C:\Users\Max\Desktop\OneTrainer\venv\lib\site-packages\torch\nn\modules\conv.py", line 453, in _conv_forward
return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same
Error during sampling, proceeding without sampling

during debug process:
Traceback (most recent call last):
File "C:\Users\Max\Desktop\OneTrainer\modules\ui\TrainUI.py", line 538, in __training_thread_function
trainer.train()
File "C:\Users\Max\Desktop\OneTrainer\modules\trainer\GenericTrainer.py", line 574, in train
model_output_data = self.model_setup.predict(self.model, batch, self.config, train_progress)
File "C:\Users\Max\Desktop\OneTrainer\modules\modelSetup\BaseStableDiffusionSetup.py", line 446, in predict
predicted_image = model.vae.decode(predicted_latent_image).sample
.........
File "C:\Users\Max\Desktop\OneTrainer\venv\lib\site-packages\torch\nn\modules\conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (float) and bias type (struct c10::Half) should be the same

seniorsolt added a commit to seniorsolt/OneTrainer_fork that referenced this issue Jul 24, 2024
@Nerogar Nerogar reopened this Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants