You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When encoding text, the encoder should output conditioning tensors of consistent size, as specified by the technical papers. For SD3/SD3.5, the text embedding shape should be [1, 154, 4096], which is the size we will get when inputting a short prompt. There seems to be a limit as to how long the token dimension can be before the models will break.
Actual Behavior
When the prompt is very long, CLIP Text Encode node appears to concatenate the individual embeddings directly, leading to seemingly unusual embedding shapes such as [1, 465, 4096]. When passed to SD3.5, the prompt still functions, but the resulting images will contain edge artifacts. It is observed that slightly overshooting 154 tokens might be fine, but there seems to be a limit how high the token numbers can be without causing issues. Around the 400 token mark, broken edges will start to appear on some generated images.
glitched:
normal:
Steps to Reproduce
The reproducing workflow (normal): sd35_glitch.json
To produce the glitch, add the following sample prompt to the positive prompt text box: best quality, double exposure, realistic, whimsical, fantastic, splash art, intricate detailed, hyperdetailed, maximalist style, psychedelic, post-apocalyptic, photorealistic, concept art, sharp focus, harmony, serenity, tranquility, mysterious glow, ambient occlusion, halation, cozy ambient lighting, dynamic lighting,masterpiece, liiv1, linquivera, metix, mentixis, excellent composition, finest details, highest aesthetics, strong muted Highlighter of the turquoise ocean around, Blue hue, mystical glowing, best quality sharp focus, high contrast, stylized, clear, colorful, surreal, ultra quality, 8k, best quality, a breathtaking masterpiece, award winning
Observe the different conditioning tensor shape in the console.
Debug Logs
[START] Security scan
[DONE] Security scan
## ComfyUI-Manager: installing dependencies done.** ComfyUI startup time: 2025-01-0311:52:31.457522** Platform: Linux
** Python version: 3.11.10| packaged by conda-forge | (main, Oct 162024,01:27:36) [GCC13.3.0]
** Python executable: /mnt/Nova/Envs/comfy/bin/python
** ComfyUI Path: /mnt/Nova/Apps/comfy
** Log path: /mnt/Nova/Apps/comfy/comfyui.log
Prestartup times for custom nodes:
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/rgthree-comfy
1.3 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Manager
Total VRAM 24151 MB, total RAM 64200 MB
pytorch version: 2.5.0
xformers version: 0.0.28.post2
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3090 : cudaMallocAsync
Using xformers attention
[PromptServer] web root: /mnt/Nova/Apps/comfy/web
### Loading: ComfyUI-Manager (V2.56.1)### ComfyUI Version: v0.3.10-21-g0f11d60a | Released on '2025-01-01'
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/alter-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/model-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/github-stats.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/extension-node-map.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/custom-node-list.json
WAS Node Suite: OpenCV Python FFMPEG support is enabled
WAS Node Suite Warning: `ffmpeg_bin_path` is not set in`/mnt/Nova/Apps/comfy/custom_nodes/was-node-suite-comfyui/was_suite_config.json` config file. Will attempt to use system ffmpeg binaries if available.
WAS Node Suite: Finished. Loaded 220 nodes successfully.
"Creativity takes courage."- Henri Matisse
------------------------------------------
Comfyroll Studio v1.76 : 175 Nodes Loaded
------------------------------------------**For changes, please see patch notes at https://github.com/Suzie1/ComfyUI_Comfyroll_CustomNodes/blob/main/Patch_Notes.md
**For help, please see the wiki at https://github.com/Suzie1/ComfyUI_Comfyroll_CustomNodes/wiki
------------------------------------------
Searge-SDXL v4.3.1in/mnt/Nova/Apps/comfy/custom_nodes/SeargeSDXL
[comfyui_controlnet_aux] | INFO -> Using ckpts path: /mnt/Nova/Apps/comfy/custom_nodes/comfyui_controlnet_aux/ckpts
[comfyui_controlnet_aux] | INFO -> Using symlinks: False
[comfyui_controlnet_aux] | INFO -> Using ort providers: ['CUDAExecutionProvider','DirectMLExecutionProvider','OpenVINOExecutionProvider','ROCMExecutionProvider','CPUExecutionProvider','CoreMLExecutionProvider']
### Loading: ComfyUI-Impact-Pack (V7.12)### Loading: ComfyUI-Impact-Pack (Subpack: V0.8)
[ImpactPack] Wildcards loading done.
[rgthree-comfy] Loaded 42 fantastic nodes. 🎉
### Loading: ComfyUI-Inspire-Pack (V1.9.1)
Total VRAM 24151 MB, total RAM 64200 MB
pytorch version: 2.5.0
xformers version: 0.0.28.post2
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3090 : cudaMallocAsync
[CrystoolsINFO] Crystools version: 1.21.0
[CrystoolsINFO] CPU: AMD Ryzen 9 3950X 16-Core Processor - Arch: x86_64 - OS: Linux 6.11.8-1-default
[CrystoolsINFO] Pynvml (Nvidia) initialized.
[CrystoolsINFO] GPU/s:
[CrystoolsINFO] 0) NVIDIA GeForce RTX 3090
[CrystoolsINFO] NVIDIA Driver: 560.35.05
FizzleDorf Custom Nodes: Loaded
/mnt/Nova/Envs/comfy/lib/python3.11/site-packages/albumentations/__init__.py:24: UserWarning: A new version of Albumentations is available: 1.4.24 (you have 1.4.21). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
[ReActor] - STATUS - Running v0.5.2-a1 in ComfyUI
Torch version: 2.5.0
All packages from requirements.txt are installed and up to date.
llama-cpp installed
All packages from requirements.txt are installed and up to date.
(pysssss:WD14Tagger) [DEBUG] Available ORT providers: TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider
(pysssss:WD14Tagger) [DEBUG] Using ORT providers: CUDAExecutionProvider, CPUExecutionProvider
🌊 DEPTHFLOW NODES 🌊
Depthcrafter Nodes Loaded
Import times for custom nodes:
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/websocket_image_save.py
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-APGScaling
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/sdxl_prompt_styler
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ControlNet-LLLite-ComfyUI
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Image-Selector
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_Noise
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/Vector_Sculptor_ComfyUI
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/Skimmed_CFG
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_AdvancedRefluxControl
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/lora-info
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/tiled_ksampler
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/stability-ComfyUI-nodes
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Inpaint-CropAndStitch
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/cg-image-picker
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfy-image-saver
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_TiledKSampler
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_SimpleTiles
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui-et_stringutils
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/masquerade-nodes-comfyui
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-WD14-Tagger
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-InstantX-IPAdapter-SD3
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_InstantID
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-AutomaticCFG
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/AuraSR-ComfyUI
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-IC-Light
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui-portrait-master
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-TiledDiffusion
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui-inpaint-nodes
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_IPAdapter_plus
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-DepthAnythingV2
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyMath
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_FizzNodes
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Custom-Scripts
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Florence2
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-APISR-KJ
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_essentials
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_UltimateSDUpscale
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Frame-Interpolation
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/x-flux-comfyui
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-LivePortraitKJ
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_InstantIR_Wrapper
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/rgthree-comfy
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-KJNodes
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-LTXVideo
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Advanced-ControlNet
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-DepthCrafter-Nodes
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-segment-anything-20.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_ExtraModels
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui_controlnet_aux
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-AnimateDiff-Evolved
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Inspire-Pack
0.0 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui-prompt-control
0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui_segment_anything
0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-VideoHelperSuite
0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Manager
0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Crystools
0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_VLM_nodes
0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-SUPIR
0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/SeargeSDXL
0.1 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui_milehighstyler
0.2 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Impact-Pack
0.3 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-PhotoMaker-Plus
0.3 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-CCSR
0.3 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI_Comfyroll_CustomNodes
0.4 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-YALLM-node
0.4 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-layerdiffuse
0.4 seconds: /mnt/Nova/Apps/comfy/custom_nodes/comfyui-reactor-node
0.4 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Inspyrenet-Rembg
0.4 seconds: /mnt/Nova/Apps/comfy/custom_nodes/ComfyUI-Depthflow-Nodes
1.5 seconds: /mnt/Nova/Apps/comfy/custom_nodes/was-node-suite-comfyui
Starting server
To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float16, manual cast: None
model_type FLOW
Using xformers attention in VAE
Using xformers attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
no CLIP/text encoder weights in checkpoint, the text encoder model will not be loaded.
Requested to load SD3ClipModel_
loaded completely 9.5367431640625e+2510644.189453125 True
CLIP model load device: cuda:0, offload device: cpu, current: cuda:0, dtype: torch.float16
clip missing: ['text_projection.weight']
Shapes found: [[1,465,4096], [1,2048]]
Requested to load SD3
loaded completely 9.5367431640625e+2515366.797485351562 True
100%|███████████████████████████████████████████████████████████████████████████████████████████████|31/31 [01:10<00:00,2.27s/it]
Requested to load AutoencodingEngine
loaded completely 9.5367431640625e+25159.87335777282715 True
Prompt executed in89.17 seconds
Other
The text encoder for Flux appears to always return consistent conditioning length regardless of embedding text length. Perhaps a similar implementation is possible for SD3 family as well, as the models apparently do not handle elongated conditioning tensors gracefully.
My initial test appears to suggest that this issue is not related to specific model version / quantisation or the precision selected in the diffusion model loader node, and is not related to the VAE as I initially suspected. It is reproducible across multiple aspect ratios, and may appear along top & left, or bottom & right edges. Using only CLIP encoders and not T5 appears to not result in such issues.
Here are the output shapes of SDXL, SD3 and Flux using clip_l and t5 for encoding, where the two outputs per model correspond to long prompt and short prompt respectively:
(I think it is possible that SD3/SD3.5 models were never intended to work with very long prompts, but architecturally there are no constraints limiting the length of the token sequence, so when given long prompts the model behaves weirdly. It is possible this issue is on Stability's side, but it is still a bit strange when it appears that only SD3 family models have this problem.)
The text was updated successfully, but these errors were encountered:
Mithrillion
changed the title
Triple Clip Encoding by CLIP Text Encode for SD3 Family models Might be Misconfigured
Triple Clip Encoding by CLIP Text Encode for SD3 Family models Might Not Work with Long Prompts
Jan 4, 2025
Expected Behavior
When encoding text, the encoder should output conditioning tensors of consistent size, as specified by the technical papers. For SD3/SD3.5, the text embedding shape should be
[1, 154, 4096]
, which is the size we will get when inputting a short prompt. There seems to be a limit as to how long the token dimension can be before the models will break.Actual Behavior
When the prompt is very long,
CLIP Text Encode
node appears to concatenate the individual embeddings directly, leading to seemingly unusual embedding shapes such as[1, 465, 4096]
. When passed to SD3.5, the prompt still functions, but the resulting images will contain edge artifacts. It is observed that slightly overshooting 154 tokens might be fine, but there seems to be a limit how high the token numbers can be without causing issues. Around the 400 token mark, broken edges will start to appear on some generated images.glitched:
normal:
Steps to Reproduce
The reproducing workflow (normal):
sd35_glitch.json
To produce the glitch, add the following sample prompt to the positive prompt text box:
best quality, double exposure, realistic, whimsical, fantastic, splash art, intricate detailed, hyperdetailed, maximalist style, psychedelic, post-apocalyptic, photorealistic, concept art, sharp focus, harmony, serenity, tranquility, mysterious glow, ambient occlusion, halation, cozy ambient lighting, dynamic lighting,masterpiece, liiv1, linquivera, metix, mentixis, excellent composition, finest details, highest aesthetics, strong muted Highlighter of the turquoise ocean around, Blue hue, mystical glowing, best quality sharp focus, high contrast, stylized, clear, colorful, surreal, ultra quality, 8k, best quality, a breathtaking masterpiece, award winning
Observe the different conditioning tensor shape in the console.
Debug Logs
Other
The text encoder for Flux appears to always return consistent conditioning length regardless of embedding text length. Perhaps a similar implementation is possible for SD3 family as well, as the models apparently do not handle elongated conditioning tensors gracefully.
My initial test appears to suggest that this issue is not related to specific model version / quantisation or the precision selected in the diffusion model loader node, and is not related to the VAE as I initially suspected. It is reproducible across multiple aspect ratios, and may appear along top & left, or bottom & right edges. Using only CLIP encoders and not T5 appears to not result in such issues.
Here are the output shapes of SDXL, SD3 and Flux using clip_l and t5 for encoding, where the two outputs per model correspond to long prompt and short prompt respectively:
(I think it is possible that SD3/SD3.5 models were never intended to work with very long prompts, but architecturally there are no constraints limiting the length of the token sequence, so when given long prompts the model behaves weirdly. It is possible this issue is on Stability's side, but it is still a bit strange when it appears that only SD3 family models have this problem.)
The text was updated successfully, but these errors were encountered: