[Bug]: Weight data type float16 produce error #402

seniorsolt · 2024-07-23T18:57:24Z

What happened?

Weight data type float16 produce error

config attached
default.json

model used - majicmixRealistic_v7-inpainting.safetensors fp16
https://civitai.com/models/43331?modelVersionId=221343

Think it's because of some places where dtype is explicitly set to .float()

What did you expect would happen?

That float16 training would work)

Relevant log output

Traceback (most recent call last):
  File "C:\Users\Max\Desktop\OneTrainer_test\modules\ui\TrainUI.py", line 538, in __training_thread_function
    trainer.train()
  File "C:\Users\Max\Desktop\OneTrainer_test\modules\trainer\GenericTrainer.py", line 574, in train
    model_output_data = self.model_setup.predict(self.model, batch, self.config, train_progress)
  File "C:\Users\Max\Desktop\OneTrainer_test\modules\modelSetup\BaseStableDiffusionSetup.py", line 351, in predict
    predicted_latent_noise = model.unet(
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\src\diffusers\src\diffusers\models\unets\unet_2d_condition.py", line 1135, in forward
    emb = self.time_embedding(t_emb, timestep_cond)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\src\diffusers\src\diffusers\models\embeddings.py", line 376, in forward
    sample = self.linear_1(sample)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Max\Desktop\OneTrainer_test\venv\lib\site-packages\torch\nn\modules\linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

Output of `pip freeze`

No response

The text was updated successfully, but these errors were encountered:

mx · 2024-07-23T20:52:39Z

This is possibly a problem with using a cache that was created in fp32. I noticed your config has clear_cache_before_training: false. Try clearing the cache and see if that fixes your problem.

seniorsolt · 2024-07-23T21:12:45Z

Clearing cache didn't helped, but I notice that error is gone when I turn on masked training. It seems that without masking batch['latent_mask'] has dtype float32 as it comes from GenerateImageLike mgds node.

So
latent_input = torch.concat(
[scaled_noisy_latent_image, batch['latent_mask'], scaled_latent_conditioning_image], 1
)

produces float32 tensor and unet with float16 gives error

djp3k05 · 2024-07-23T21:25:07Z

I also can confirm this issue (with any model). but I ignored it as I usually go with bf16. Also clear cache would not help.

seniorsolt · 2024-07-23T21:57:42Z

Also similar errors arise in debug process and sampling process. It can be seen after fixing GenerateImageLike node

seniorsolt · 2024-07-24T16:59:38Z

During sampling:

Traceback (most recent call last): | 0/1 [00:00<?, ?it/s]
File "C:\Users\Max\Desktop\OneTrainer\modules\trainer\GenericTrainer.py", line 245, in __sample_loop
self.model_sampler.sample(
File "C:\Users\Max\Desktop\OneTrainer\modules\modelSampler\StableDiffusionSampler.py", line 454, in sample
image = self.__sample_inpainting(
File "C:\Users\Max\Desktop\OneTrainer\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\Max\Desktop\OneTrainer\modules\modelSampler\StableDiffusionSampler.py", line 272, in __sample_inpainting
eroded_mask = erode_kernel(mask)
.......
File "C:\Users\Max\Desktop\OneTrainer\venv\lib\site-packages\torch\nn\modules\conv.py", line 453, in _conv_forward
return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same
Error during sampling, proceeding without sampling

during debug process:
Traceback (most recent call last):
File "C:\Users\Max\Desktop\OneTrainer\modules\ui\TrainUI.py", line 538, in __training_thread_function
trainer.train()
File "C:\Users\Max\Desktop\OneTrainer\modules\trainer\GenericTrainer.py", line 574, in train
model_output_data = self.model_setup.predict(self.model, batch, self.config, train_progress)
File "C:\Users\Max\Desktop\OneTrainer\modules\modelSetup\BaseStableDiffusionSetup.py", line 446, in predict
predicted_image = model.vae.decode(predicted_latent_image).sample
.........
File "C:\Users\Max\Desktop\OneTrainer\venv\lib\site-packages\torch\nn\modules\conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (float) and bias type (struct c10::Half) should be the same

seniorsolt added the bug Something isn't working label Jul 23, 2024

seniorsolt mentioned this issue Jul 24, 2024

Set dtype in GenerateImageLike Nerogar/mgds#7

Merged

seniorsolt added a commit to seniorsolt/OneTrainer_fork that referenced this issue Jul 24, 2024

fix Nerogar#402

a4e44f7

Nerogar closed this as completed in Nerogar/mgds@9c51999 Jul 24, 2024

Nerogar reopened this Jul 24, 2024

seniorsolt mentioned this issue Jul 27, 2024

Fix errors with float16 #409

Merged

Nerogar closed this as completed in #409 Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Weight data type float16 produce error #402

[Bug]: Weight data type float16 produce error #402

seniorsolt commented Jul 23, 2024 •

edited

Loading

mx commented Jul 23, 2024

seniorsolt commented Jul 23, 2024

djp3k05 commented Jul 23, 2024

seniorsolt commented Jul 23, 2024 •

edited

Loading

seniorsolt commented Jul 24, 2024

[Bug]: Weight data type float16 produce error #402

[Bug]: Weight data type float16 produce error #402

Comments

seniorsolt commented Jul 23, 2024 • edited Loading

What happened?

What did you expect would happen?

Relevant log output

Output of pip freeze

mx commented Jul 23, 2024

seniorsolt commented Jul 23, 2024

djp3k05 commented Jul 23, 2024

seniorsolt commented Jul 23, 2024 • edited Loading

seniorsolt commented Jul 24, 2024

seniorsolt commented Jul 23, 2024 •

edited

Loading

Output of `pip freeze`

seniorsolt commented Jul 23, 2024 •

edited

Loading