Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triple Clip Encoding by CLIP Text Encode for SD3 Family models Might Not Work with Long Prompts #6325

Open
Mithrillion opened this issue Jan 3, 2025 · 0 comments
Labels
Potential Bug User is reporting a bug. This should be tested.

Comments

@Mithrillion
Copy link

Mithrillion commented Jan 3, 2025

Expected Behavior

When encoding text, the encoder should output conditioning tensors of consistent size, as specified by the technical papers. For SD3/SD3.5, the text embedding shape should be [1, 154, 4096], which is the size we will get when inputting a short prompt. There seems to be a limit as to how long the token dimension can be before the models will break.

Actual Behavior

When the prompt is very long, CLIP Text Encode node appears to concatenate the individual embeddings directly, leading to seemingly unusual embedding shapes such as [1, 465, 4096]. When passed to SD3.5, the prompt still functions, but the resulting images will contain edge artifacts. It is observed that slightly overshooting 154 tokens might be fine, but there seems to be a limit how high the token numbers can be without causing issues. Around the 400 token mark, broken edges will start to appear on some generated images.
glitched:
glitched
ex_00002_
normal:
without glitch
ex_00004_

Steps to Reproduce

The reproducing workflow (normal):
sd35_glitch.json
To produce the glitch, add the following sample prompt to the positive prompt text box:
best quality, double exposure, realistic, whimsical, fantastic, splash art, intricate detailed, hyperdetailed, maximalist style, psychedelic, post-apocalyptic, photorealistic, concept art, sharp focus, harmony, serenity, tranquility, mysterious glow, ambient occlusion, halation, cozy ambient lighting, dynamic lighting,masterpiece, liiv1, linquivera, metix, mentixis, excellent composition, finest details, highest aesthetics, strong muted Highlighter of the turquoise ocean around, Blue hue, mystical glowing, best quality sharp focus, high contrast, stylized, clear, colorful, surreal, ultra quality, 8k, best quality, a breathtaking masterpiece, award winning

Observe the different conditioning tensor shape in the console.

Debug Logs

[START] Security scan
[DONE] Security scan
## ComfyUI-Manager: installing dependencies done.
** ComfyUI startup time: 2025-01-03 11:52:31.457522
** Platform: Linux
** Python version: 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:27:36) [GCC 13.3.0]
** Python executable: /mnt/Nova/Envs/comfy/bin/python
** ComfyUI Path: /mnt/Nova/Apps/comfy
** Log path: /mnt/Nova/Apps/comfy/comfyui.log

Prestartup times for custom nodes:
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/rgthree-comfy
   1.3 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Manager

Total VRAM 24151 MB, total RAM 64200 MB
pytorch version: 2.5.0
xformers version: 0.0.28.post2
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3090 : cudaMallocAsync
Using xformers attention
[Prompt Server] web root: /mnt/Nova/Apps/comfy/web
### Loading: ComfyUI-Manager (V2.56.1)
### ComfyUI Version: v0.3.10-21-g0f11d60a | Released on '2025-01-01'
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/alter-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/model-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/github-stats.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/extension-node-map.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/custom-node-list.json
WAS Node Suite: OpenCV Python FFMPEG support is enabled
WAS Node Suite Warning: `ffmpeg_bin_path` is not set in `/mnt/Nova/Apps/comfy/custom_nodes/was-node-suite-comfyui/was_suite_config.json` config file. Will attempt to use system ffmpeg binaries if available.
WAS Node Suite: Finished. Loaded 220 nodes successfully.

        "Creativity takes courage." - Henri Matisse

------------------------------------------
Comfyroll Studio v1.76 :  175 Nodes Loaded
------------------------------------------
** For changes, please see patch notes at https://github.com/Suzie1/ComfyUI_Comfyroll_CustomNodes/blob/main/Patch_Notes.md
** For help, please see the wiki at https://github.com/Suzie1/ComfyUI_Comfyroll_CustomNodes/wiki
------------------------------------------
Searge-SDXL v4.3.1 in /mnt/Nova/Apps/comfy/custom_nodes/SeargeSDXL
[comfyui_controlnet_aux] | INFO -> Using ckpts path: /mnt/Nova/Apps/comfy/custom_nodes/comfyui_controlnet_aux/ckpts
[comfyui_controlnet_aux] | INFO -> Using symlinks: False
[comfyui_controlnet_aux] | INFO -> Using ort providers: ['CUDAExecutionProvider', 'DirectMLExecutionProvider', 'OpenVINOExecutionProvider', 'ROCMExecutionProvider', 'CPUExecutionProvider', 'CoreMLExecutionProvider']
### Loading: ComfyUI-Impact-Pack (V7.12)
### Loading: ComfyUI-Impact-Pack (Subpack: V0.8)
[Impact Pack] Wildcards loading done.

[rgthree-comfy] Loaded 42 fantastic nodes. 🎉

### Loading: ComfyUI-Inspire-Pack (V1.9.1)
Total VRAM 24151 MB, total RAM 64200 MB
pytorch version: 2.5.0
xformers version: 0.0.28.post2
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3090 : cudaMallocAsync
[Crystools INFO] Crystools version: 1.21.0
[Crystools INFO] CPU: AMD Ryzen 9 3950X 16-Core Processor - Arch: x86_64 - OS: Linux 6.11.8-1-default
[Crystools INFO] Pynvml (Nvidia) initialized.
[Crystools INFO] GPU/s:
[Crystools INFO] 0) NVIDIA GeForce RTX 3090
[Crystools INFO] NVIDIA Driver: 560.35.05
FizzleDorf Custom Nodes: Loaded
/mnt/Nova/Envs/comfy/lib/python3.11/site-packages/albumentations/__init__.py:24: UserWarning: A new version of Albumentations is available: 1.4.24 (you have 1.4.21). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
  check_for_updates()
[ReActor] - STATUS - Running v0.5.2-a1 in ComfyUI
Torch version: 2.5.0
All packages from requirements.txt are installed and up to date.
llama-cpp installed
All packages from requirements.txt are installed and up to date.
(pysssss:WD14Tagger) [DEBUG] Available ORT providers: TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider
(pysssss:WD14Tagger) [DEBUG] Using ORT providers: CUDAExecutionProvider, CPUExecutionProvider

🌊 DEPTHFLOW NODES 🌊


Depthcrafter Nodes Loaded


Import times for custom nodes:
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/websocket_image_save.py
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-APGScaling
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/sdxl_prompt_styler
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ControlNet-LLLite-ComfyUI
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Image-Selector
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_Noise
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/Vector_Sculptor_ComfyUI
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/Skimmed_CFG
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_AdvancedRefluxControl
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/lora-info
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/tiled_ksampler
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/stability-ComfyUI-nodes
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Inpaint-CropAndStitch
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/cg-image-picker
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfy-image-saver
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_TiledKSampler
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_SimpleTiles
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui-et_stringutils
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/masquerade-nodes-comfyui
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-WD14-Tagger
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-InstantX-IPAdapter-SD3
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_InstantID
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-AutomaticCFG
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/AuraSR-ComfyUI
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-IC-Light
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui-portrait-master
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-TiledDiffusion
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui-inpaint-nodes
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_IPAdapter_plus
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-DepthAnythingV2
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyMath
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_FizzNodes
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Custom-Scripts
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Florence2
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-APISR-KJ
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_essentials
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_UltimateSDUpscale
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Frame-Interpolation
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/x-flux-comfyui
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-LivePortraitKJ
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_InstantIR_Wrapper
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/rgthree-comfy
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-KJNodes
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-LTXVideo
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Advanced-ControlNet
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-DepthCrafter-Nodes
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-segment-anything-2
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_ExtraModels
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui_controlnet_aux
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-AnimateDiff-Evolved
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Inspire-Pack
   0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui-prompt-control
   0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui_segment_anything
   0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-VideoHelperSuite
   0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Manager
   0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Crystools
   0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_VLM_nodes
   0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-SUPIR
   0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/SeargeSDXL
   0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui_milehighstyler
   0.2 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Impact-Pack
   0.3 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-PhotoMaker-Plus
   0.3 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-CCSR
   0.3 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_Comfyroll_CustomNodes
   0.4 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-YALLM-node
   0.4 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-layerdiffuse
   0.4 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui-reactor-node
   0.4 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Inspyrenet-Rembg
   0.4 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Depthflow-Nodes
   1.5 seconds: /mnt/Nova/Apps/comfy/custom_nodes/was-node-suite-comfyui

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float16, manual cast: None
model_type FLOW
Using xformers attention in VAE
Using xformers attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
no CLIP/text encoder weights in checkpoint, the text encoder model will not be loaded.
Requested to load SD3ClipModel_
loaded completely 9.5367431640625e+25 10644.189453125 True
CLIP model load device: cuda:0, offload device: cpu, current: cuda:0, dtype: torch.float16
clip missing: ['text_projection.weight']
Shapes found: [[1, 465, 4096], [1, 2048]]
Requested to load SD3
loaded completely 9.5367431640625e+25 15366.797485351562 True
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [01:10<00:00,  2.27s/it]
Requested to load AutoencodingEngine
loaded completely 9.5367431640625e+25 159.87335777282715 True
Prompt executed in 89.17 seconds

Other

The text encoder for Flux appears to always return consistent conditioning length regardless of embedding text length. Perhaps a similar implementation is possible for SD3 family as well, as the models apparently do not handle elongated conditioning tensors gracefully.

My initial test appears to suggest that this issue is not related to specific model version / quantisation or the precision selected in the diffusion model loader node, and is not related to the VAE as I initially suspected. It is reproducible across multiple aspect ratios, and may appear along top & left, or bottom & right edges. Using only CLIP encoders and not T5 appears to not result in such issues.

Here are the output shapes of SDXL, SD3 and Flux using clip_l and t5 for encoding, where the two outputs per model correspond to long prompt and short prompt respectively:

Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
Shapes found: [[1, 77, 2048], [1, 1280]]
Shapes found: [[1, 231, 2048], [1, 1280]]
Prompt executed in 4.06 seconds
got prompt
CLIP model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
clip missing: ['text_projection.weight']
Requested to load SD3ClipModel_
loaded completely 9.5367431640625e+25 4777.53759765625 True
Shapes found: [[1, 154, 4096], [1, 2048]]
Shapes found: [[1, 465, 4096], [1, 2048]]
Prompt executed in 7.48 seconds
got prompt
CLIP model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
loaded completely 9.5367431640625e+25 4777.53759765625 True
Shapes found: [[1, 256, 4096], [1, 768]]
Shapes found: [[1, 256, 4096], [1, 768]]
Prompt executed in 2.36 seconds

(I think it is possible that SD3/SD3.5 models were never intended to work with very long prompts, but architecturally there are no constraints limiting the length of the token sequence, so when given long prompts the model behaves weirdly. It is possible this issue is on Stability's side, but it is still a bit strange when it appears that only SD3 family models have this problem.)

@Mithrillion Mithrillion added the Potential Bug User is reporting a bug. This should be tested. label Jan 3, 2025
@Mithrillion Mithrillion changed the title Triple Clip Encoding by CLIP Text Encode for SD3 Family models Might be Misconfigured Triple Clip Encoding by CLIP Text Encode for SD3 Family models Might Not Work with Long Prompts Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Potential Bug User is reporting a bug. This should be tested.
Projects
None yet
Development

No branches or pull requests

1 participant