Wan2.2 i2v (clarifications needed regarding settings on low vram system)

hey fellas, how are you all doing?

i’ll get right to the point. im new to i2v but after a few weeks of trial and error im starting to get the hang of things. there is a lot of information available out there but my god is it contradictory. sometimes ive gotten better results by doing the exact opposite of what a lot of people absolutely swear by. it doesnt help that im on a very modest entry level setup (8gb 4060 laptop). a lot of forum posts and articles seem to be written with heftier setups in mind. so i think i have reached the limit of what i can achieve through personal experimentation and following advice not designed for my quantized setup. things like “oh, just increase cfg for more obedience” while i increased gradually one decimal at a time with the same exact seed and prompt till cfg3 and saw absolutely ZERO difference. my render time is way too slow to effectively do micro tweaks and get any real results in an acceptable time frame. (im working on getting a 24gb setup in about 4 months but i dont wanna sit idle in the meantime)

“I’m running a highly optimized Wan I2V setup. It’s a GGUF-based workflow using the wan2.2-rapid-aio weights. quantized Q4_K GGUF. I’ve got SageAttention and BlockSwapping enabled to handle the VRAM load.”

i gotta be honest, ive gotten very good results on occasion, even when it comes to very specific things. my problem is consistency. i will work on one picture for an entire day through trial and error, then as soon as its a slightly different picture i have to start from scratch. im using sa_solver (beta) 4 steps, 1.0cfg, denoise 0.6, sd3 shift 8. i know this sounds ridiculous but i swear of all the things ive tried this is the only one that gets me any results so far (and quite quickly as well). the reason i use ksampler and not ksampler advance is because when i used wan2.1 the video was exactly as the source picture but after transitioning to 2.2 the video was way blurrier , less vivid, less sharp, felt less hd and the general hue is much more “reddish”. the denoise option set to 0.6 helps with all that

i just wanna know what should be my settings starting point and what should i be gradually increasing to see improvement. the most important aspect for me is face consistency and obedience to prompts. remember, i have an 8gb card so things start to turn into a deep fried lsd fever dream with cfg past 3. would you suggest any additional nodes to my setup? different sampler? different settings? my goal is to be 100% authentic to the source image. no embellishment from ai. no scifi themes or fantasy or anything like that. i basically wanna make the picture move, thats it. if im in a scenario where the ai isnt obeying prompts what should i do? (other than “try different prompts”. cause im trying to isolate how far i can push the ai before refining prompts. i wanna get a good baseline first)

please give me some pointers for my specific setup and goals, thank you very much in advance

4-step variation might not be suitable for your purpose:


Wan2.2 I2V on 8GB VRAM: practical baseline for source-faithful animation

For your exact goal — make the picture move, keep the same face, keep the same identity, keep the same lighting/background/clothing, and avoid AI embellishment — I would not tune this like a normal high-VRAM Wan2.2 setup.

Your current results are not strange:

  • CFG from 1 to ~3 doing almost nothing is plausible in a 4-step Rapid/Lightning-style workflow.
  • CFG above ~3 turning the output into overcooked chaos is also plausible.
  • Denoise around 0.6 helping sharpness/color/source fidelity is not ridiculous.
  • Different source images needing different settings usually means the workflow has too many interacting variables: GGUF quantization, Rapid/distilled weights, sampler, scheduler, shift, text encoder quality, VAE, offloading, source-image difficulty, and the Wan2.2 High/Low-noise expert split.

The core point:

Do not treat CFG as the main “obedience knob” in your setup.
For 8GB VRAM + GGUF + 4-step Rapid/Lightning-style I2V, CFG is a small final adjustment, not the steering wheel.

The knobs I would tune first are:

  1. source image quality / crop
  2. denoise
  3. motion size
  4. shift
  5. Low-noise step count / Low-noise quantization
  6. sampler branch
  7. text encoder quantization
  8. CFG last

Useful references:


1. Why your current setup is hard to tune

You are not simply running “Wan2.2.” You are running a stacked compromise:

Wan2.2-style I2V
+ Rapid/AIO or distilled behavior
+ GGUF quantization
+ Q4-class compression
+ 4-step sampling
+ SageAttention
+ BlockSwap/offload
+ 8GB laptop VRAM
+ denoise below 1.0
+ SD3 shift
+ image conditioning

That matters because one setting can appear useless when another part of the stack is dominating.

For example, CFG may appear to do nothing because:

  • the model was distilled/merged for CFG 1
  • 4 steps are too few for CFG to gradually steer the output
  • image conditioning dominates the text
  • the negative prompt is weak or mostly inactive at CFG 1
  • quantization reduces sensitivity to small guidance changes
  • the sampler/scheduler/shift combination matters more than CFG
  • the High/Low-noise split is doing more than the text guidance

Some Rapid/AIO model cards explicitly say their models are intended for CFG 1 and 4 steps. See the WAN2.2 Rapid All-in-One model card. Wan2.2-Lightning similarly describes a 4-step distilled path, so it should not be tuned like a normal 20–30 step diffusion workflow. See Wan2.2-Lightning.

So your observation — “CFG 1 to 3 did nothing, then above 3 broke everything” — is consistent with this kind of workflow.


2. The most important Wan2.2 idea: High-noise vs Low-noise experts

Wan2.2 A14B uses a Mixture-of-Experts style denoising structure. The official Wan2.2 repo describes MoE as separating the denoising process across timesteps with specialized expert models. See Wan2.2 official GitHub.

In practical I2V terms:

Part Mostly affects If weak/wrong, you may see
High-noise expert broad motion, layout, pose, composition, camera direction scene drift, pose weirdness, motion chaos, composition changes
Low-noise expert face detail, eyes, mouth, skin, clothing texture, color, final sharpness face melting, blur, color shift, unstable eyes/mouth, loss of likeness

For your goal, Low-noise behavior is extremely important.

If the face changes, the first fix is usually not “raise CFG.” More likely fixes are:

  • lower denoise
  • reduce the requested motion
  • add more Low-noise steps
  • use a better Low-noise quant if possible
  • check the VAE
  • crop/use a clearer source face
  • avoid cinematic/camera-heavy prompts
  • avoid LoRAs until the baseline is stable

WanMoeKSampler is relevant if you are using separate High/Low Wan2.2 A14B models. Its README says it is designed for Wan2.2 A14B-style MoE workflows and avoids manually guessing the High-to-Low switch point. See WanMoeKSampler.


3. Best starting point for your actual goal

Your goal is not “maximum cinematic transformation.” Your goal is:

same person
same face
same identity
same clothing
same lighting
same background
small natural movement
static camera
no embellishment

So I would start conservative.

Recommended baseline for your current Rapid/AIO-style setup

Sampler: sa_solver / beta, if that is your current most reliable branch
Steps: 4
CFG: 1.0
Denoise: 0.55–0.60
SD3 shift: 8 as current control, then test 5 and 6
Resolution: 512–640px long side while testing
Frames: 33–49 while testing
FPS: 12–16
Motion: subtle
Camera: static
LoRAs: none during baseline
Upscaling/interpolation: none during baseline
Face restore: none during baseline

This is not meant to be the final “best possible” setup. It is the control setup. You need a repeatable control before changing settings.


4. Do not micro-tweak CFG

On your hardware, micro-tweaking CFG by 0.1 is a bad use of time.

Instead of:

1.0
1.1
1.2
1.3
1.4
...

Use coarse tests:

CFG 1.0
CFG 1.5
CFG 2.0
CFG 2.5
CFG 3.0 only as a limit test

For your setup, I would treat CFG like this:

CFG Practical meaning
1.0 safest Rapid/Lightning-style baseline
1.5 mild text pressure
2.0 moderate text pressure
2.5 upper useful range to test
3.0 stress-test boundary
>3.0 likely to overcook identity, color, texture, or motion

If CFG 1.5–2.5 gives no meaningful obedience improvement, stop chasing CFG. The bottleneck is probably elsewhere.


5. Denoise is probably more important than CFG for you

For source-faithful I2V, denoise is one of the strongest identity controls.

Denoise Expected behavior
0.40–0.50 most faithful, least motion, may look stiff
0.50–0.60 best starting zone for “make the image move”
0.60–0.70 more motion, more identity risk
0.70+ more transformation, more AI invention

Since you already found 0.6 useful, I would not abandon it. I would test:

Denoise 0.50
Denoise 0.55
Denoise 0.60
Denoise 0.65

Pick the best identity/motion balance.

If the face changes:

lower denoise first
reduce motion second
add Low-noise steps third
only then try CFG changes

If there is no movement:

raise denoise slightly
make the action simpler and more literal
avoid cinematic wording

6. Shift: test coarse values only

Do not test tiny shift increments. Test meaningful jumps.

For your current setup:

Shift 5
Shift 6
Shift 8

The LightX2V Wan2.2 I2V working-guide discussion recommends:

Euler sampler
Simple scheduler
Shift 5
2 High steps
2 Low steps

Source: LightX2V Wan2.2 I2V working guide discussion

That does not automatically mean shift 5 is best for your current Rapid/AIO branch, but it is a strong branch to test.


7. Sampler advice

For your current Rapid/AIO branch

If sa_solver / beta / 4 steps / CFG 1 / denoise 0.6 / shift 8 is the only thing giving you usable results, keep it as the control.

Do not throw it away just because it sounds weird.

Rapid/distilled/merged models can have very specific intended recipes. The model card for the Rapid AIO family says the models are intended for CFG 1 and 4 steps, and different versions list different sampler recommendations. See WAN2.2 Rapid All-in-One.

For a Lightning-style branch

Test this separately:

Sampler: Euler
Scheduler: Simple
Steps: 4
CFG: 1.0
Shift: 5
Denoise: 0.55–0.60

That lines up with public LightX2V/Wan2.2-Lightning guidance. See Wan2.2-Lightning and the LightX2V working-guide discussion.

Compare this branch against your current sa_solver / beta control. Do not mix the two while testing.


8. Low-noise steps may help face consistency more than CFG

If your workflow exposes the High/Low split, test this before pushing CFG:

Test High steps Low steps Purpose
A 2 2 fastest 4-step baseline
B 2 4 more face/detail finishing
C 4 4 balanced reference
D 4 6 stronger finishing if time allows
E 6 4 more broad structure/motion

For your goal, I would test:

2 High / 2 Low
2 High / 4 Low
4 High / 4 Low

If 2/2 is blurry but 2/4 improves face/detail, that tells you the Low-noise stage was underpowered.


9. Quantization: Q4_K_M is not automatically best on 8GB

On paper, higher quantization quality is better. In practice, on an 8GB laptop GPU, a heavier quant can cause more offload pressure, swapping, instability, or unusable render times.

The QuantStack Wan2.2 I2V A14B GGUF repo lists approximate model sizes such as:

Q3_K_S: 6.52 GB
Q3_K_M: 7.18 GB
Q4_K_S: 8.75 GB
Q4_K_M: 9.65 GB
Q5_K_S: 10.1 GB
Q5_K_M: 10.8 GB
Q6_K: 12 GB
Q8_0: 15.4 GB

Source: QuantStack Wan2.2 I2V A14B GGUF

For an 8GB 4060 laptop, I would test:

Test High-noise Low-noise Why
A Q3_K_M Q3_K_M safest low-VRAM baseline
B Q4_K_S Q4_K_S better quality if stable
C Q3_K_M Q4_K_S prioritize face/detail
D Q4_K_S Q3_K_M prioritize structure/motion
E Q4_K_M Q4_K_M only if the above are stable

For your priority, I would try:

High-noise: Q3_K_M
Low-noise: Q4_K_S

before assuming:

High-noise: Q4_K_M
Low-noise: Q4_K_M

Why: Low-noise has more influence on final face detail, skin, eyes, mouth, color, and sharpness. If you can only “spend” quality somewhere, spend it on Low-noise first.


10. Text encoder quantization matters for prompt obedience

If prompt obedience feels weak, do not only blame CFG. The text encoder can matter too.

The city96 UMT5 XXL encoder GGUF card recommends Q5_K_M or larger for best results, while noting that smaller models may still be acceptable in resource-constrained situations. It lists Q3_K_M around 3.06GB, Q4_K_M around 3.66GB, and Q5_K_M around 4.15GB. See city96 UMT5 XXL encoder GGUF.

For your system:

UMT5 Q3_K_M: safest
UMT5 Q4_K_M: reasonable baseline
UMT5 Q5_K_M: better prompt understanding if RAM/offload behavior is tolerable

If CFG does not improve obedience, a better text encoder may help more than CFG micro-tweaks.


11. VAE check: important for color and softness

If Wan2.2 looks redder, softer, or less vivid than expected, check the VAE.

The official ComfyUI Wan2.2 guide distinguishes the model components for different workflows. The 14B I2V workflow uses separate High/Low I2V models and a Wan VAE component; the 5B TI2V workflow uses its own 5B model/VAE setup. See ComfyUI official Wan2.2 guide.

A VAE mismatch can show up as:

red/yellow color cast
soft decode
loss of vividness
skin tone shift
general haze
reconstruction blur

If color is your issue, test VAE/workflow correctness before trying to fix it with prompt words like “neutral color” or “no red tint.”


12. Source image quality matters more than people admit

For face consistency, the source image should have:

clear face
visible eyes
visible mouth
not too small in frame
not heavily compressed
not extreme side profile
not harsh shadow over one eye
not heavy motion blur
not strong fisheye distortion
not sunglasses covering identity
not hands blocking the face

A simple rule:

If the source face is small or unclear, the model has to invent face detail during motion.
When it invents face detail, identity changes.

For baseline testing, use a clean portrait or half-body image. You can do fancy shots later.


13. Prompt style for source-faithful animation

Use a boring prompt. Do not make it cinematic. Do not add style words. Do not describe a new scene.

Positive prompt baseline

A realistic image-to-video animation of the person in the source image. Preserve the exact same face, identity, hairstyle, clothing, colors, lighting, and background. The person makes only very subtle natural movement: slight breathing, a small blink, and minimal head movement. Static camera. No zoom. No scene change. Natural colors. Sharp facial details.

Negative prompt baseline

different person, face change, identity change, distorted face, warped eyes, asymmetrical eyes, deformed mouth, changing hairstyle, changing clothes, changing background, camera movement, zoom, scene change, fantasy, sci-fi, anime, painting, overexposed, oversaturated, red tint, blurry, low detail, melted face, extra teeth

Important: at CFG 1, the negative prompt may do very little. Judge negative prompting mostly at CFG 1.5–2.5.


14. Prompt obedience testing

Do not test obedience with complex motion first.

Bad obedience tests:

turns around
walks forward
raises both hands
laughs widely
talks
dances
camera orbits around the subject
wind blows hair dramatically

Good obedience tests:

one subtle blink
gentle breathing only
slight smile
very small head tilt
tiny eye movement

A model that cannot obey “one subtle blink” is not ready for “turns head, smiles, and raises hand.”

Better prompt wording

Instead of:

The woman turns her head and smiles at the camera while wind blows through her hair.

Use:

The person makes a very small natural smile while keeping the same face, same pose, same hairstyle, same clothing, same lighting, and same background. Static camera.

The second prompt gives the model less room to invent.


15. What to do when the model does not obey

First classify the failure.

Failure Likely cause First fix
prompt action ignored too few steps, weak text encoder, action too subtle, distilled limitation slightly raise denoise or simplify action
face changes denoise too high, Low-noise weak, source face unclear, motion too large lower denoise / add Low steps
red tint VAE/model/sampler/shift issue check VAE, test shift/sampler
blurry face Low-noise too weak, too few steps, low quant, low resolution add Low steps / better Low quant
background changes denoise too high, prompt invites scene change lower denoise / static camera prompt
too much motion denoise/CFG/shift too high, Rapid merge exaggeration lower denoise or reduce action
no motion denoise too low, prompt too static denoise +0.05

The order I would use:

1. Keep CFG at 1.0.
2. Make the action simpler and more literal.
3. Tune denoise: 0.50 / 0.55 / 0.60 / 0.65.
4. Test shift: 5 / 6 / 8.
5. Add Low-noise steps if available.
6. Improve Low-noise quantization if possible.
7. Test CFG 1.5 / 2.0 / 2.5.
8. Stop before CFG 3 if identity starts changing.

16. Recommended experiment matrix

Do not run huge matrices at full resolution. Use short clips first.

Keep these fixed:

same image
same seed
same prompt
same resolution
same frame count
same workflow branch

Matrix A — denoise

CFG: 1.0
Steps: 4
Shift: current value
Sampler: current best

Test:

0.50
0.55
0.60
0.65

Pick the best identity/motion balance.

Matrix B — shift

Use the best denoise from Matrix A.

Shift 5
Shift 6
Shift 8

Pick the best.

Matrix C — CFG

Use best denoise + best shift.

CFG 1.0
CFG 1.5
CFG 2.0
CFG 2.5
CFG 3.0 only as a limit test

Pick the highest CFG that does not alter identity.

Matrix D — High/Low steps

If available:

2 High / 2 Low
2 High / 4 Low
4 High / 4 Low

If face detail improves with more Low steps, you found a better lever than CFG.

Matrix E — quantization

If using separate GGUF High/Low models:

Q3_K_M High / Q3_K_M Low
Q3_K_M High / Q4_K_S Low
Q4_K_S High / Q4_K_S Low

Avoid assuming Q4_K_M is worth the offload cost on 8GB.


17. Additional nodes: what I would and would not add

Worth testing later: WanMoeKSampler

Use it if you are working with separate Wan2.2 A14B High/Low models.

Good for:

clean A14B High/Low workflows
reducing manual High/Low split guessing
debugging MoE transition behavior

Not a fix for:

bad source image
bad VAE
too much denoise
bad prompt
4-step model limitations

Source: WanMoeKSampler

Required for GGUF: ComfyUI-GGUF

Use the proper GGUF loader rather than treating GGUF like a normal checkpoint. The ComfyUI-GGUF README says to replace the stock “Load Diffusion Model” with the “Unet Loader (GGUF)” node. See ComfyUI-GGUF.

Probably skip at 4 steps: CacheDiT

CacheDiT is more useful when you have enough steps to amortize the cache/warmup overhead. For Wan2.2 14B, its README says to use the dedicated Wan Cache Optimizer for best results with the MoE High/Low structure. See ComfyUI-CacheDiT.

My practical rule:

4 steps: skip CacheDiT
6–8 steps: probably skip unless testing
12–20 steps: consider CacheDiT

Useful but separate branch: Kijai WanVideoWrapper

Kijai’s wrapper is useful and often gets Wan-specific optimizations quickly. The official Wan2.2 repo lists it as an alternative implementation. See Wan2.2 official GitHub and Kijai ComfyUI-WanVideoWrapper.

But treat it as a separate branch. Do not change wrapper + sampler + LoRAs + resolution all at once.


18. Things I would avoid during baseline testing

Avoid:

720p
81+ frames
large camera movement
large head turns
talking/lip motion
multiple LoRAs
face restore nodes
interpolation while judging motion
upscaling while judging source fidelity
CFG above 3
high denoise
changing sampler + CFG + denoise together
testing tiny CFG increments

Especially avoid this kind of starting point:

8GB VRAM
A14B Q4_K_M
720p
81 frames
4-step Rapid/AIO
SageAttention
BlockSwap
multiple LoRAs
CFG above 3

That can produce occasional good clips, but it is a terrible learning baseline because too many variables are interacting.


19. Practical final recommendation

For your current setup, I would start here:

Sampler: sa_solver / beta if this is your current reliable branch
Steps: 4
CFG: 1.0
Denoise: 0.55–0.60
Shift: 8 as current control; test 5 and 6 separately
Resolution: 512–640px long side
Frames: 33–49
Prompt: subtle realistic motion only
Camera: static
LoRAs: none
Upscale/interpolation/face restore: off during testing

Then tune in this order:

1. Source image/crop
2. Denoise
3. Shift
4. Low-noise steps
5. Low-noise quantization
6. Text encoder quantization
7. CFG 1.5–2.5
8. Sampler branch
9. Extra nodes

My strongest advice:

Stop trying to solve source fidelity with CFG.
For your setup, source fidelity is mostly denoise + Low-noise behavior + source image quality + correct workflow components. CFG is only a small final pressure adjustment.


20. Short summary

  • Your current results are plausible for 8GB VRAM + GGUF + 4-step Rapid/Lightning-style I2V.
  • CFG is not the main control. Do not micro-tune it.
  • Use CFG 1.0 as baseline; test 1.5 / 2.0 / 2.5 only after denoise and shift.
  • Use denoise 0.50–0.60 for source-faithful animation.
  • If faces drift, lower denoise or improve Low-noise behavior before raising CFG.
  • If available, test 2 High / 4 Low and 4 High / 4 Low.
  • For 8GB VRAM, test Q3_K_M and Q4_K_S before assuming Q4_K_M is best.
  • If you can only improve one expert, improve Low-noise first for face/detail.
  • Use short 33–49 frame clips at 512–640px while testing.
  • Avoid 720p, long clips, multiple LoRAs, and post-processing until the baseline is stable.

Thank you for sharing.

before i posted i wanted to message you directly because ive seen you give great advice to so many people on the forum (until i realized you cant dm users here). thank you very much, youre doing gods work! i will try all your advice and take a good look at the links! much appreciated

i gave the low/high noise unets a try. at first it went from normal first 3 frames to complete blur (like 90% blur). then i fiddled with the settings and got it to remove the blur in the center of the image but sides still blurry. but the quality seems very poor. theres a weird pixelation, not digital like squares, its more like crosshatching. and it introduced a weird lighting artifact. strong yellow light flashing in the middle. any recommended base settings to start with? i started with cfg1 on both but it was a no go

To put it simply, there’s a suspicion that settings intended for a different model have gotten mixed in?:


Wan2.2 I2V-A14B High/Low UNets: blur, crosshatching, yellow flash — likely causes and clean baseline

Looking at the workflow screenshot, the problem is probably not mainly the prompt. It looks more like a sampling schedule / High-Low boundary / step count / VAE / distilled-vs-normal workflow mismatch problem.

The suspicious settings in the screenshot are:

HighNoise GGUF -> ModelSamplingSD3 shift 5.00 -> WanMoeKSampler model_high_noise
LowNoise GGUF  -> ModelSamplingSD3 shift 5.00 -> WanMoeKSampler model_low_noise

WanMoeKSampler:
  boundary: 0.750
  add_noise: enable
  steps: 6
  cfg_high_noise: 1.5
  cfg_low_noise: 2.0
  sampler_name: euler
  scheduler: simple
  sigma_shift: 4.00
  return_with_leftover_noise: disable

The short version:

The screenshot looks like a hybrid between a normal Wan2.2 High/Low UNet workflow and a 4-step Lightning/LightX2V-style workflow. That hybrid zone can easily cause heavy blur, side blur, crosshatching texture, and yellow lighting flashes.


1. Biggest issue: boundary = 0.750

For Wan2.2 I2V, boundary = 0.750 is the first thing I would change.

The WanMoeKSampler README says the Wan2.2 boundary is around:

Wan2.2 T2V: 0.875
Wan2.2 I2V: 0.900

It also explains that this boundary is a diffusion timestep, not a denoising step. The actual switch step depends on total steps, sampler, scheduler, and sigma shift.

So for Wan2.2 I2V, reset this:

boundary: 0.750

to this:

boundary: 0.900

Why this matters

Wan2.2 A14B uses separate denoising experts:

Expert Main job
High-noise expert early structure, broad layout, motion, pose, composition
Low-noise expert later detail, face, eyes, mouth, skin, color, texture, final sharpness

The Wan2.2 I2V-A14B model card describes this High-noise / Low-noise MoE design and the idea that the experts specialize in different denoising stages.

If the boundary is too low, the High-noise model can stay active too long and the Low-noise model may not get enough useful refinement time.

That can look like:

first frames look okay
then the clip turns blurry
center improves but sides remain mushy
fine texture looks scratchy/crosshatched
lighting becomes unstable
faces fail to refine

So the first clean correction is:

boundary: 0.900

2. Second issue: steps = 6 is too low for judging normal High/Low UNets

Six steps is very low for the normal Wan2.2 I2V-A14B High/Low model pair.

It can be useful as a quick smoke test, but it is not a fair quality test unless you are using a proper distilled / Lightning / LightX2V setup.

For the normal High/Low UNets, I would test:

steps: 12

If that is too slow on 8GB VRAM, use this only as a compromise:

steps: 8

But I would not judge the normal High/Low pair from 6 steps. At 6 steps, the Low-noise expert may simply not have enough time to resolve detail.

Symptoms of too few steps:

crosshatching texture
unfinished skin/detail
soft edges
side blur
poor face detail
color flicker
lighting pulses

3. Third issue: you may be applying shift twice

The screenshot shows:

ModelSamplingSD3 shift: 5.00

before both models, plus:

WanMoeKSampler sigma_shift: 4.00

inside the WanMoeKSampler.

While debugging, that is too ambiguous. Use one source of shift only.

Recommended cleanup

For the first stable baseline, remove the two ModelSamplingSD3 nodes:

HighNoise GGUF -> WanMoeKSampler model_high_noise
LowNoise GGUF  -> WanMoeKSampler model_low_noise

Then set this inside WanMoeKSampler:

sigma_shift: 5.0

This gives you one clear place controlling the shift.

Why 5.0? The LightX2V Wan2.2 I2V model card recommends Euler with:

shift: 5.0
guidance_scale: 1.0

for its distilled branch. More importantly, 5.0 is also a sane first test value when cleaning up the graph.

The key point is:

Do not run ModelSamplingSD3 shift 5 plus WanMoeKSampler sigma_shift 4 while trying to diagnose artifacts.

After you get a stable baseline, you can test whether the external ModelSamplingSD3 nodes help. But they should not be part of the first diagnosis pass.


4. Fourth issue: CFG values are in the wrong middle zone

The screenshot uses:

cfg_high_noise: 1.5
cfg_low_noise: 2.0

That is neither a strict Lightning/LightX2V recipe nor a normal High/Low baseline.

You need to decide which branch you are testing.


Branch A — normal Wan2.2 I2V-A14B High/Low UNets

Use this branch if you are loading the normal HighNoise and LowNoise GGUFs without Lightning/LightX2V LoRAs.

In this branch, CFG 1.0 is usually too weak. CFG 1.0 is mostly a Rapid/Lightning/distilled habit, not a universal Wan2.2 setting.

Recommended baseline:

High model:
  Wan2.2 I2V-A14B HighNoise GGUF

Low model:
  Wan2.2 I2V-A14B LowNoise GGUF

Remove:
  ModelSamplingSD3 nodes before WanMoeKSampler

WanMoeKSampler:
  boundary: 0.900
  add_noise: enable
  steps: 12
  cfg_high_noise: 3.0
  cfg_low_noise: 3.0
  sampler_name: euler
  scheduler: simple
  sigma_shift: 5.0
  start_at_step: 0
  end_at_step: 10000
  return_with_leftover_noise: disable

VAE:
  wan_2.1_vae.safetensors

Test size:
  33 frames
  512-640px long side
  fixed seed

Disable during baseline:
  LoRAs
  upscalers
  interpolation
  face restore
  post-sharpening
  color correction

If 12 steps is too slow:

boundary: 0.900
steps: 8
cfg_high_noise: 3.0
cfg_low_noise: 3.0
sampler_name: euler
scheduler: simple
sigma_shift: 5.0

But treat 8 steps as a sanity test, not a final quality test.


Branch B — Lightning / LightX2V / distilled 4-step branch

Use this branch only if you are using matching Lightning/LightX2V I2V LoRAs or a proper distilled LightX2V setup.

The LightX2V Wan2.2 I2V card recommends:

Euler scheduler
shift: 5.0
guidance_scale: 1.0

It describes this as running without CFG. The README also says the distilled model is built for substantially fewer inference steps, specifically 4-step-style use.

Strict distilled baseline:

High model:
  compatible Wan2.2 I2V-A14B HighNoise model

Low model:
  compatible Wan2.2 I2V-A14B LowNoise model

LoRAs:
  matching I2V Lightning/LightX2V High LoRA
  matching I2V Lightning/LightX2V Low LoRA
  strength: 1.0 each

Remove:
  external ModelSamplingSD3 nodes during baseline

WanMoeKSampler:
  boundary: 0.900
  add_noise: enable
  steps: 4
  cfg_high_noise: 1.0
  cfg_low_noise: 1.0
  sampler_name: euler
  scheduler: simple
  sigma_shift: 5.0
  start_at_step: 0
  end_at_step: 10000
  return_with_leftover_noise: disable

VAE:
  wan_2.1_vae.safetensors

Test size:
  33 frames
  512-640px long side
  fixed seed

Do not mix this with the normal branch.

Bad hybrid zone:

normal High/Low GGUFs
+ no matching distilled LoRAs
+ 6 steps
+ CFG around 1-2
+ boundary 0.750
+ external shift 5
+ internal sigma_shift 4

That is exactly the kind of setup that can produce blur, crosshatching, and flashing.


5. VAE check: very important

For Wan2.2 14B I2V, check that you are using:

wan_2.1_vae.safetensors

The ComfyUI Wan2.2 docs and ComfyUI Wan2.2 examples point to wan_2.1_vae.safetensors for the 14B workflows.

A wrong or mismatched VAE can look like:

soft decode
general haze
yellow/red color cast
skin tone shift
center glow
lighting flash
poor reconstruction
blurred details

Do not try to fix a VAE mismatch with prompts like “no yellow light.” Fix the VAE first.


6. Artifact-by-artifact diagnosis

A. “First 3 frames normal, then 90% blur”

Most likely causes:

boundary too low
too few total steps
Low-noise expert starts too late
shift schedule conflict
wrong VAE
normal UNets being run like a distilled 4-step model

Fix order:

1. boundary: 0.900
2. remove external ModelSamplingSD3 nodes
3. sigma_shift: 5.0 inside WanMoeKSampler
4. VAE: wan_2.1_vae.safetensors
5. normal branch: steps 12, CFG 3.0 / 3.0
6. distilled branch: steps 4, CFG 1.0 / 1.0, matching LoRAs only

B. “Center improved but sides are still blurry”

Likely causes:

not enough Low-noise refinement
bad High/Low boundary
low step count
resolution/aspect stress
VAE softness
quantization/offload instability
post-processing or resize issue

Try:

33 frames only
512-640px long side
boundary 0.900
steps 12 if normal branch
correct VAE
no post nodes
no upscaler
no interpolation
no face restore

Also use clean dimensions. Examples:

512x288
576x320
640x360
640x384
384x640 for portrait

Avoid large or odd dimensions while debugging.


C. “Crosshatching texture, not square pixelation”

That usually means incomplete or unstable denoising, not classic video compression.

Most likely causes:

6 steps is too low
boundary is wrong
GGUF quantization is stressed
shift schedule is confused
Low-noise refinement is underpowered
VAE decode is wrong or mismatched

The QuantStack Wan2.2 I2V-A14B GGUF page lists approximate quant sizes such as:

Q2_K:    5.3 GB
Q3_K_S:  6.52 GB
Q3_K_M:  7.18 GB
Q4_K_S:  8.75 GB
Q4_K_M:  9.65 GB
Q5_K_S: 10.1 GB
Q5_K_M: 10.8 GB
Q6_K:   12 GB
Q8_0:   15.4 GB

On an 8GB laptop GPU, Q4_K_M can be theoretically better but practically worse if it causes too much offloading, swapping, or instability.

Low-VRAM quant tests:

Test A:
  High: Q3_K_M
  Low:  Q3_K_M

Test B:
  High: Q3_K_M
  Low:  Q4_K_S

Test C:
  High: Q4_K_S
  Low:  Q4_K_S

For face/detail fidelity, the most interesting test is:

High: Q3_K_M
Low:  Q4_K_S

Reason: the Low-noise model is the detail finisher.


D. “Strong yellow light flashing in the middle”

This is probably not a prompt issue.

Likely causes:

wrong VAE
double shift / schedule conflict
LightX2V LoRA trajectory mismatch
normal High/Low UNets using distilled settings
too few steps
bad High/Low boundary
quantization + low-step instability

Fix order:

1. confirm VAE = wan_2.1_vae.safetensors
2. remove external ModelSamplingSD3 nodes
3. boundary = 0.900
4. sigma_shift = 5.0
5. normal branch: 12 steps, CFG 3.0 / 3.0
6. distilled branch: 4 steps, CFG 1.0 / 1.0, matching LoRAs only
7. disable upscaler/interpolation/face restore
8. test 33 frames at 512-640px long side

A negative prompt can include yellow flash, but if the denoising path or VAE is wrong, the prompt will not reliably fix it.


7. What to check in the console

Check where WanMoeKSampler actually switches from High-noise to Low-noise.

Look for something equivalent to:

switching model at step X

Do not reason from boundary alone. The WanMoeKSampler README explains that diffusion timestep is not the same thing as denoising step.

For a 4-step distilled branch, you generally want something close to:

High: 2 steps
Low:  2 steps

For a normal 12-step branch, you want enough Low-noise steps left to refine detail. If Low-noise only gets a tiny part of the run, blur and poor texture are expected.


8. Text encoder check

If prompt obedience is weak, do not only raise CFG. Text encoder quantization can matter.

The city96 UMT5 XXL encoder GGUF page says Q5_K_M or larger is recommended for best results, while smaller models can still be acceptable in constrained setups.

Approximate sizes listed there include:

Q3_K_M: about 3.06 GB
Q4_K_M: about 3.66 GB
Q5_K_M: about 4.15 GB
Q8_0:   about 6.04 GB
F16:    about 11.4 GB

For an 8GB GPU setup:

UMT5 Q3_K_M:
  safest memory option

UMT5 Q4_K_M:
  good low-VRAM baseline

UMT5 Q5_K_M:
  better prompt understanding if system RAM/offload behavior allows it

Weak prompt obedience may be text-encoder-related, not just CFG-related.


9. Suggested prompt while debugging

Use a boring source-faithful prompt. Do not use cinematic lighting while debugging a yellow lighting artifact.

Positive

A realistic image-to-video animation of the person in the source image. Preserve the exact same face, identity, hairstyle, clothing, colors, lighting, and background. The person makes only very subtle natural movement: slight breathing, a small blink, and minimal head movement. Static camera. No zoom. No scene change. Natural colors. Sharp facial details.

Negative

different person, face change, identity change, distorted face, warped eyes, asymmetrical eyes, deformed mouth, changing hairstyle, changing clothes, changing background, camera movement, zoom, scene change, fantasy, sci-fi, anime, painting, overexposed, oversaturated, red tint, yellow flash, blurry, low detail, melted face, extra teeth, crosshatching, noisy texture

At CFG 1.0, the negative prompt may have little practical effect. It should matter more in the normal branch at CFG around 3.0.


10. Minimal troubleshooting plan

Run these in order. Change only one branch at a time.

Test 1 — normal High/Low sanity test

Remove:
  both ModelSamplingSD3 nodes

WanMoeKSampler:
  boundary: 0.900
  add_noise: enable
  steps: 12
  cfg_high_noise: 3.0
  cfg_low_noise: 3.0
  sampler_name: euler
  scheduler: simple
  sigma_shift: 5.0
  start_at_step: 0
  end_at_step: 10000
  return_with_leftover_noise: disable

VAE:
  wan_2.1_vae.safetensors

Video:
  33 frames
  512-640px long side
  fixed seed

Disable:
  LoRAs
  upscaler
  interpolation
  face restore
  postprocessing

If this improves blur/crosshatching/yellow flash, the previous issue was probably:

boundary too low
too few steps
CFG too low for normal branch
shift conflict
VAE mismatch

Test 2 — cheaper normal-branch sanity test

If 12 steps is too slow:

Same as Test 1, but:

steps: 8

If 8 looks bad but 12 improves, the issue is mainly under-refinement.


Test 3 — strict Lightning/LightX2V branch

Only use this if you are using matching I2V Lightning/LightX2V LoRAs or a proper distilled LightX2V setup.

Use:
  matching I2V Lightning/LightX2V LoRAs
  LoRA strength: 1.0 each

Remove:
  both ModelSamplingSD3 nodes

WanMoeKSampler:
  boundary: 0.900
  add_noise: enable
  steps: 4
  cfg_high_noise: 1.0
  cfg_low_noise: 1.0
  sampler_name: euler
  scheduler: simple
  sigma_shift: 5.0
  start_at_step: 0
  end_at_step: 10000
  return_with_leftover_noise: disable

VAE:
  wan_2.1_vae.safetensors

Video:
  33 frames
  512-640px long side

If this still has yellow flashing, suspect:

wrong LoRA pair
T2V LoRA used in I2V
High/Low LoRAs mismatched
wrong VAE
wrong model pair
double shift
workflow node mismatch

11. Recommended settings table

Scenario Boundary Steps CFG high CFG low Sampler Scheduler Shift Notes
Normal High/Low sanity baseline 0.900 12 3.0 3.0 Euler Simple 5.0 Best next test
Normal low-cost test 0.900 8 3.0 3.0 Euler Simple 5.0 Debug only
Strict Lightning/LightX2V 0.900 4 1.0 1.0 Euler Simple 5.0 Only with matching distilled LoRAs/model
Current screenshot 0.750 6 1.5 2.0 Euler Simple 4.0 + external 5.0 Likely unstable hybrid

12. Things I would avoid right now

Avoid this while debugging:

boundary 0.750
steps 6
CFG 1-ish on normal High/Low UNets
external ModelSamplingSD3 shift + WanMoe sigma_shift
720p
49+ frames before baseline works
upscale
interpolation
face restore
multiple LoRAs
cinematic lighting prompts
large camera movement

Especially avoid judging the workflow from this combination:

8GB VRAM
normal High/Low GGUFs
boundary 0.750
steps 6
CFG 1.5 / 2.0
external shift 5
internal sigma_shift 4
large frame count
large resolution

That setup has too many unstable variables.


13. Short answer

Your screenshot has three likely problems:

1. boundary should probably be 0.900, not 0.750, for Wan2.2 I2V
2. six steps is too low for judging the normal High/Low UNets
3. shift is likely being applied ambiguously through both ModelSamplingSD3 and WanMoeKSampler

My recommended reset for the normal High/Low branch:

Remove:
  ModelSamplingSD3 nodes

WanMoeKSampler:
  boundary: 0.900
  steps: 12
  cfg_high_noise: 3.0
  cfg_low_noise: 3.0
  sampler: euler
  scheduler: simple
  sigma_shift: 5.0
  add_noise: enable
  return_with_leftover_noise: disable

VAE:
  wan_2.1_vae.safetensors

Test:
  33 frames
  512-640px long side
  no LoRAs
  no upscale
  no interpolation
  no face restore

If you are actually using Lightning/LightX2V LoRAs, use the strict distilled branch instead:

boundary: 0.900
steps: 4
cfg_high_noise: 1.0
cfg_low_noise: 1.0
sampler: euler
scheduler: simple
sigma_shift: 5.0
LoRA strength: 1.0
VAE: wan_2.1_vae.safetensors

The blur, crosshatching, and yellow flash are most likely from:

wrong boundary
+ too few steps
+ ambiguous/double shift
+ possibly wrong VAE
+ possibly mixing normal and distilled recipes

14. Useful references

yup this did it. the two sd3’s were the culprits. you nailed it! its working now, no distortions, no artifacts, has better obedience and face permanence. now if i could only fix the quality part. everything has some kind of low def bloom, airbrushed, blended quality to it. this isnt just this specific setups issue. i noticed when i switched from [rapidWAN22I2VGGUF_q4KMRapidBase.gguf] to [wan2.2-i2v-rapid-aio-v10-nsfw-Q4_K.gguf] a week ago. [rapidWAN22I2VGGUF_q4KMRapidBase.gguf] basically kept true to the source image no matter what it was. even low res screengrabs. it just made whatever i fed it move. [wan2.2-i2v-rapid-aio-v10-nsfw-Q4_K.gguf] and the two low/high unets always gave me this weird dream sequence kind of bloom.

i tried the

A Q3_K_M Q3_K_M safest low-VRAM baseline

its stable , no oom, no hick ups. im gonna move forward and test the B option. any setting changes for options A to tweak in order to squeeze more juice out of it before i move to the next step?

i tried them all, unfortunately results were very poor. very ai slop looking, refused to follow complex prompts, halucinated, just not feasible for my setup. im just gonna have to go back to this humble but effective setup that worked surprisingly well. it literally takes any source image and it animates it staying faithful to the quality 1:1. everything else i tried was a bust. however obedience to prompts and sometimes face morph are very hit and miss based on see. one final question and then i promise i stop bothering you. how can i get the best results out of my setup (in the screenshot). since im sticking with this till i get a better gpu i wanna at least squeeze the most out of it. im 100% satisfied with the image quality, its literally like the picture came to life. i just need more obedience and adherence to prompts. and ensuring the face stays the same (thats the biggest issue sometimes. it loses face permanence) which ksampler advanced settings to tweak to get the best result? and finally, is there a free website or some other resource for prompt restructuring? i cant use ollama etc cause it takes too big a bite of vram inside comfy . is there anything you would suggest me to add to my setup? tyvm for all your help

(btw full name of unet is rapidWAN22I2VGGUF_q4KMRapidBase.gguf (cant see it fully in screenshot)

I’m glad the correct answer was included.:laughing:

Hmm… I think I’ve got a pretty good grasp of the situation now. By the way, distilled models like Lightning tend to struggle with accurately reflecting prompt details—especially negative prompts—but there’s still room for improvement. Their responsiveness to positive prompts is actually quite good. Also, if you’re looking for highly complex prompt responses, I think it’s worth considering other variations (if exist).

Distilled models are often created by retraining a model after drastically pruning it, but in the distilled version, parts that shouldn’t be pruned for your specific purpose might have been removed. Well, I guess it can’t be helped if the goal is to save VRAM… But in any case, this means you also have to consider the performance of the model itself—or rather, the inherent characteristics of the distilled model.

By the way, if you’ll use an LLM for prompt refinement, I think using the Gemini or ChatGPT API is the easiest way, but if you want to do it entirely locally, an OSS LLM might be better. For this purpose, I think a smaller model from a high-quality OSS model family is perfectly sufficient. The models provided by Liquid (which includes 1.2B or even 350M variant) run just fine locally on a CPU. Other SOTA models like Qwen 3.5 and Gemma 4 in the 4B class or smaller can also run on a CPU alone. A 4B model is a bit heavy for a CPU, but at least these don’t consume VRAM… they run on RAM. Of course, they’d be very faster if with VRAM!


Wan2.2 RapidBase I2V on 8GB VRAM: getting more prompt obedience without losing source-image fidelity

At this point I would stop chasing the normal High/Low UNet route for this GPU and use rapidWAN22I2VGGUF_q4KMRapidBase.gguf as the main workflow.

That is not a downgrade. For the actual goal here — make the source image look like it came to life while preserving the same face, same lighting, same color, same texture, same source quality, and no AI-looking bloom — this model is doing the right kind of thing. The normal High/Low route may be more flexible in theory, but on an 8GB card it is costing too much source fidelity.

The new goal should be:

Keep RapidBase.
Keep source-image fidelity.
Add only mild prompt pressure.
Reduce face morphing.
Avoid turning the workflow into a repainting/generative workflow.

Useful references:


1. Why RapidBase is the right baseline for this specific goal

The High/Low UNet experiment was still useful because it proved one thing: the duplicated SD3 shift setup really was causing artifacts. Removing those conflicting shift nodes fixed distortion and improved obedience/face permanence. But the second lesson is more important:

A technically cleaner High/Low workflow still did not give the desired look.

The preferred model, rapidWAN22I2VGGUF_q4KMRapidBase.gguf, behaves more like a source-preserving animator than a full generative video model. That is exactly why it works well for this use case.

It is good at:

keeping the source image quality
keeping low-res screengrabs looking like themselves
preserving lighting and colors
preserving background
avoiding the airbrushed Wan2.2 dream-sequence look
making the original picture move

It is weaker at:

complex multi-action prompts
large head turns
speaking / mouth motion
hand gestures
strong semantic obedience
large expression changes
camera moves

That tradeoff is expected. A workflow that preserves the source image 1:1 is not going to be as willing to invent new actions. More obedience usually requires more invention; more invention means more risk of face drift.

So the right strategy is not:

force the model to obey huge prompts

The right strategy is:

ask for one small action
add only mild prompt pressure
use seed batching
choose outputs by face permanence first

2. Current control setup

From the screenshot, the current effective workflow is roughly:

Model:
  rapidWAN22I2VGGUF_q4KMRapidBase.gguf

VAE:
  wan_2.1_vae.safetensors

Text encoder:
  umt5-xxl-encoder-Q8_0.gguf

KSampler Advanced:
  add_noise: enable
  steps: 10
  cfg: 1.0
  sampler_name: sa_solver
  scheduler: beta
  start_at_step: 1
  end_at_step: 10000
  return_with_leftover_noise: enable

Save this as the control workflow.

Do not overwrite it. Duplicate it before experiments.

Testing rule:

same image
same prompt
same seed
same frame count
same resolution
change one setting only

If you change CFG, steps, start step, sampler, and prompt at the same time, the result becomes impossible to interpret.


3. Why CFG should stay low

The Rapid/AIO family is explicitly described as a fast all-in-one merge designed around few steps and CFG 1. One README snapshot recommends:

4 steps
1 cfg
sa_solver sampler
beta scheduler

Source: Phr00t Rapid AIO README snapshot

That does not mean the exact best value for your workflow must be exactly 4 steps. Your screenshot already works at 10 steps. But it does mean this model should be tuned like a few-step distilled / rapid model, not like a normal 20-30 step diffusion workflow.

Do not jump to:

cfg: 3.0
cfg: 4.0
cfg: 5.0

That is likely to cause:

face drift
new skin texture
bloom
over-smoothing
changed lighting
new expression
hallucinated details

Use a micro-range instead.


4. CFG test range

Current baseline:

cfg: 1.0

Recommended test values:

1.00
1.15
1.25
1.35
1.50

Interpretation:

CFG Expected behavior
1.00 maximum source fidelity, weakest negative-prompt effect
1.15 tiny prompt pressure
1.25 likely first useful obedience bump
1.35 upper mild test
1.50 stress test for face drift
2.00+ probably too much if face permanence matters

The likely useful zone is:

cfg: 1.15-1.35

Rule:

Use the highest CFG that does not change the face.

Test like this:

Run A:
  cfg: 1.00

Run B:
  cfg: 1.15

Run C:
  cfg: 1.25

Run D:
  cfg: 1.35

Run E:
  cfg: 1.50

Keep everything else identical.

Judge in this order:

1. same face / same identity
2. same source-image quality
3. no morphing
4. no artifacts
5. prompt obedience
6. natural motion

Prompt obedience is not the first priority. A clip that obeys perfectly but changes the face is a failed clip for this workflow.


5. Negative prompts are weak at CFG 1

A common trap is adding a giant negative prompt and expecting it to control the output. In many few-step Wan/Rapid/Lightning-style workflows, CFG 1 means negative prompts are weak or mostly inactive.

The Wan prompting guide explains this directly: in standard diffusion, CFG above 1 gives the model a stronger positive-vs-negative comparison, but in few-step CFG 1 workflows, negative prompts often do little. See How to get the most out of prompts for WAN models.

Practical consequence:

Do not rely on a huge negative prompt.
Put the important preservation rules in the positive prompt.

Positive prompt should explicitly say:

same face
same identity
same hairstyle
same clothing
same lighting
same colors
same camera angle
same background
static camera
no zoom
no scene change
only subtle motion

A short negative prompt is still fine, but it is secondary.


6. start_at_step: test 1 vs 0

Current screenshot:

start_at_step: 1

This may be helping source fidelity. Starting at step 1 can skip a tiny early part of the denoising path, which may reduce repainting.

Test only:

start_at_step: 1
start_at_step: 0

Expected tradeoff:

Setting Likely benefit Risk
1 better source fidelity and face permanence weaker motion / weaker prompt response
0 more motion and prompt response more face drift / more repainting

Suggested test:

Run A:
  cfg: 1.25
  start_at_step: 1
  steps: 10

Run B:
  cfg: 1.25
  start_at_step: 0
  steps: 10

Possible decisions:

Result Keep
0 improves obedience and face stays stable start_at_step: 0
0 gives more motion but face changes start_at_step: 1
no meaningful difference start_at_step: 1
0 adds bloom/repainting start_at_step: 1

My expectation: start_at_step: 1 may remain the safest default.


7. Steps: test 8 / 10 / 12

Current setting:

steps: 10

This may already be close to the sweet spot.

Few-step distilled models do not always improve with more steps. Sometimes extra steps create more smoothing, blending, or repainting.

Test only:

steps: 8
steps: 10
steps: 12

Expected behavior:

Steps Likely behavior
8 faster, possibly more source-faithful, possibly weaker obedience
10 current working baseline
12 may improve smoothness/obedience, but may add bloom or airbrushing
16+ not recommended for this model unless intentionally stress-testing

Suggested test:

Run A:
  steps: 8

Run B:
  steps: 10

Run C:
  steps: 12

Keep the best balance. If 12 adds the “dream sequence” look, go back to 10.


8. return_with_leftover_noise: test once

Current screenshot:

return_with_leftover_noise: enable
end_at_step: 10000

Since end_at_step is far beyond the actual step count, the sampler is probably completing its pass. This setting may not matter much, but test it once.

Run A:
  return_with_leftover_noise: enable

Run B:
  return_with_leftover_noise: disable

Keep whichever preserves the “picture came to life” look.

Do not spend a whole day on this. It is unlikely to be the main obedience or face-permanence control.


9. add_noise: keep enabled

Keep:

add_noise: enable

For image-to-video, the model needs noise to create motion. If you disable it, you may get a more frozen output or odd behavior depending on the rest of the graph.

Only test add_noise: disable if diagnosing a very specific problem:

every seed changes the face
motion is always too aggressive
the image is being repainted too much

Even then, treat it as a diagnostic test, not the likely final setting.


10. Sampler and scheduler: keep sa_solver / beta

Your current best branch uses:

sampler_name: sa_solver
scheduler: beta

Keep that as the main branch.

The Rapid/AIO README snapshot specifically recommends sa_solver and beta for that family. Source: Rapid AIO README snapshot.

If you want to test alternatives, do it only after the CFG/start/steps tests, and keep them as separate branches:

Branch A:
  sa_solver / beta

Branch B:
  euler / beta

Branch C:
  euler_a / beta

Branch D:
  euler / simple

Expected behavior:

Sampler / scheduler Likely behavior
sa_solver / beta best current source-fidelity branch
euler / beta may obey differently, possibly less faithful
euler_a / beta more variation/motion, higher face-drift risk
euler / simple more relevant to Lightning/LightX2V-style workflows

I would not change sampler/scheduler unless the smaller tests fail.


11. Seed batching is now one of the strongest tools

You already noticed face morphing is seed-dependent. That is real.

In video generation, the seed affects:

eye behavior
mouth behavior
micro-expression
small head motion
whether face identity drifts
whether the source texture holds

Use two phases.

Phase A — setting tests

Use one fixed seed:

fixed seed
same image
same prompt
same resolution
same frame count
change one setting only

This tells you what the setting does.

Phase B — production seed search

After choosing settings, run:

8-16 seeds
same image
same prompt
same final settings
short preview first

Pick by this priority:

1. same face / same identity
2. same source-image quality
3. no morphing
4. natural motion
5. prompt obedience
6. no artifacts

For your goal, a seed that keeps the face and obeys 70% is better than a seed that obeys 100% and changes the person.


12. Exact tuning plan

Matrix 0 — save control

Model:
  rapidWAN22I2VGGUF_q4KMRapidBase.gguf

VAE:
  wan_2.1_vae.safetensors

Text encoder:
  umt5-xxl-encoder-Q8_0.gguf

KSampler Advanced:
  add_noise: enable
  steps: 10
  cfg: 1.0
  sampler_name: sa_solver
  scheduler: beta
  start_at_step: 1
  end_at_step: 10000
  return_with_leftover_noise: enable

Save this output as the reference.

Matrix 1 — CFG

cfg: 1.00
cfg: 1.15
cfg: 1.25
cfg: 1.35
cfg: 1.50

Pick the highest CFG that does not alter identity.

Matrix 2 — start step

Use the best CFG.

start_at_step: 1
start_at_step: 0

Keep 1 unless 0 clearly improves obedience without face drift.

Matrix 3 — steps

Use best CFG and best start step.

steps: 8
steps: 10
steps: 12

Keep the one with the least bloom/airbrushing and best face permanence.

Matrix 4 — leftover noise

Use best CFG/start/steps.

return_with_leftover_noise: enable
return_with_leftover_noise: disable

Keep the more source-faithful result.

Matrix 5 — seed batch

Use final settings.

8-16 seeds
short preview
same prompt
same image

Pick the seed by face permanence first.


13. Recommended presets

Preset A — safest source fidelity

Use when the face must stay the same.

Model:
  rapidWAN22I2VGGUF_q4KMRapidBase.gguf

VAE:
  wan_2.1_vae.safetensors

Text encoder:
  umt5-xxl-encoder-Q8_0.gguf

KSampler Advanced:
  add_noise: enable
  steps: 10
  cfg: 1.0
  sampler_name: sa_solver
  scheduler: beta
  start_at_step: 1
  end_at_step: 10000
  return_with_leftover_noise: enable

Use for:

portraits
faces
low-res screengrabs
source-quality preservation
subtle motion

Preset B — slightly more obedient

Same as Preset A, except:

cfg: 1.15

Then test:

cfg: 1.25

Stop if the face changes.

Preset C — stronger motion test

Same as Preset A, except:

start_at_step: 0
cfg: 1.15

If the face changes, return to:

start_at_step: 1

Preset D — smoothness test

Same as Preset A, except:

steps: 12

If it adds bloom or airbrushing, return to:

steps: 10

Preset E — faster seed scouting

Same as Preset A, except:

steps: 8
shorter frame count
lower test resolution

Use this only for finding seeds quickly, then rerun good seeds at normal settings.


14. Prompt strategy: one action only

This workflow needs simple prompts.

Bad prompt:

The person turns their head, smiles, raises their hand, looks into the camera, hair moves in the wind, camera slowly zooms in, cinematic lighting.

Why this is bad:

too many actions
requires new expression
requires new pose
requires new hair behavior
requires camera motion
invites lighting changes
increases face drift

Better prompt:

The same person from the source image gently blinks once. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

Best rule:

one generation = one small action

Safe actions:

one subtle blink
gentle breathing
tiny natural smile
slight eye movement
very small head tilt

Risky actions:

speaking
laughing widely
turning head far
walking
dancing
raising hands
hair blowing strongly
camera zoom
camera orbit
lighting change

For this workflow, obedience improves when the requested action is simple enough that the model does not need to repaint the person.


15. Positive prompt templates

Since negative prompts are weak at CFG 1, put preservation constraints in the positive prompt.

Safe source-faithful template

The same person from the source image gently blinks once. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No pan. No scene change. Natural subtle motion. Sharp face.

Slightly more expressive template

The same person from the source image makes a tiny natural smile while gently breathing. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

Minimal template

Same person, same face, same identity, same lighting and background. One subtle blink. Static camera.

Face permanence template

The same person keeps the exact same face and identity throughout the video. Only subtle natural breathing and one small blink. Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera.

The repetition of “same face” and “same identity” is not elegant, but it is useful conditioning.


16. Negative prompt template

Keep it short.

different person, face change, identity change, warped face, distorted eyes, changing hairstyle, changing clothes, changing background, camera movement, zoom, scene change, blurry face

Optional additions:

extra teeth, melted face, asymmetrical eyes, over-smoothed skin, airbrushed, bloom

Do not spend all your effort on negative prompting. At CFG 1, it may do very little. At CFG 1.15-1.35, it may help slightly, but positive prompt structure and seed selection matter more.

Reference: Wan prompting guide on CFG 1 / negative prompts


17. Handling complex prompts

The model refuses or hallucinates complex prompts because they ask for too many inventions at once.

A complex prompt often includes:

subject action
facial expression
body motion
camera motion
lighting change
background interpretation
style direction

That is too much for a source-faithful RapidBase workflow.

Instead of:

She turns to the camera, smiles, raises her hand, and the camera slowly zooms in.

Use separate clips:

Clip 1:
  same person gently blinks once

Clip 2:
  same person makes a tiny natural smile

Clip 3:
  same person slightly raises one hand, only if the hand is already visible

Do not ask for a hand raise if the hand is not clearly visible in the source image. If the model must invent a hand, it may also invent a new body or face.


18. Face permanence rules

Face permanence is mostly controlled by:

source image clarity
motion size
CFG
start_at_step
seed
prompt complexity
frame count
camera motion

Do:

use clear face images
keep motion small
use static camera
use one action only
keep CFG low
batch seeds
choose face permanence first

Avoid:

large head turns
speaking
wide smiles
looking away then back
hands crossing the face
camera movement
dramatic emotion
lighting changes
long clips before seed selection

The model is most likely to morph the face when asked for mouth/teeth motion, big expression changes, or head rotation. Blinks and breathing are much safer.


19. Should you add nodes?

Main recommendation:

Add almost nothing.

Your current workflow’s value is that it does not repaint too much. Extra nodes can easily destroy that.

Avoid adding during optimization:

face restore
style LoRAs
multiple LoRAs
high-strength LoRAs
upscalers before judging motion
interpolation before judging motion
color correction before judging model behavior

Upscale/interpolation should happen only after you choose:

prompt
seed
settings
motion
face permanence

20. Optional node: NAG

NAG is the one optional control idea that fits the problem.

Why it may help:

the model runs near CFG 1
negative prompts are weak
raising CFG can morph the face
NAG may add negative-prompt-like control without pushing CFG too hard

The ComfyUI-NAG README says NAG restores effective negative prompting in few-step diffusion models and can complement CFG. The NAG project page similarly describes NAG as a method for restoring negative prompting in few-step sampling.

How to test:

copy the workflow
add NAG only in the copy
keep CFG low
use the same seed and prompt
compare against the saved control

Remove it if it causes:

bloom
airbrushing
texture changes
face drift
loss of source quality

Do not make NAG part of the main workflow until it beats the control.


21. LoRAs: only one, only low strength

The Phr00t Rapid/AIO model card notes Wan 2.1 LoRA compatibility and low-noise Wan 2.2 LoRA compatibility, but warns against high-noise Wan 2.2 LoRAs for that family. See Phr00t WAN2.2 Rapid All-in-One.

If testing LoRAs:

one LoRA only
strength 0.15
strength 0.25
strength 0.35

Avoid:

1.0 strength
multiple LoRAs
style LoRAs
high-noise Wan2.2 LoRAs
character LoRAs unless necessary

For this workflow, LoRAs are more likely to hurt source fidelity than help, unless very targeted.


22. Free prompt restructuring resources

Do not run Ollama or a local LLM on the same GPU while using ComfyUI. On an 8GB card, that competes directly with Wan.

Use web tools or CPU-only local tools.

Free web options

Good enough:

ChatGPT Free
Google AI Studio / Gemini

References:

Use one batched request rather than many small requests.


23. Prompt rewriter request template

Paste this into ChatGPT, Gemini, or a local helper.

Rewrite this as a short Wan2.2 image-to-video prompt for a low-VRAM RapidBase workflow.

Rules:
- one small action only
- preserve exact face and identity
- preserve hairstyle, clothing, lighting, colors, camera angle, and background
- static camera
- no zoom
- no pan
- no scene change
- avoid cinematic embellishment
- avoid new details not visible in the source image
- keep it literal and short
- output exactly 3 versions:
  1. safest source-faithful version
  2. slightly more expressive version
  3. shortest version

Original idea:
<put idea here>

In normal prose, refer to the placeholder as <put idea here>. Inside code blocks, use raw <put idea here>.

This is better than asking “make the prompt better,” because “better” usually means more cinematic, more detailed, and more inventive — exactly what you do not want.


24. CPU-only local prompt helper

A local helper is optional.

Goal:

rewrite prompts
do not use GPU VRAM
do not compete with ComfyUI

A good tiny local option is LFM2.5-1.2B-Instruct-GGUF. LiquidAI’s docs explain that LFM models are available in GGUF format for llama.cpp-style use: LiquidAI llama.cpp deployment guide.

Example CPU-only server command:

llama-server \
  -hf LiquidAI/LFM2.5-1.2B-Instruct-GGUF:Q4_K_M \
  -c 2048 \
  -ngl 0 \
  --host 127.0.0.1 \
  --port 8080

Important part:

-ngl 0

The llama.cpp server docs expose GPU layer offload settings through ngl / GPU layer options; setting GPU layers to zero is the relevant CPU-only principle. See llama.cpp server README.

Recommended order:

1. ChatGPT Free or Gemini
2. LFM2.5-1.2B Q4_K_M CPU-only
3. Qwen 2B-4B CPU-only if you want smarter rewriting
4. larger local models only if you have spare CPU/RAM

25. Prompt helper system prompt

Use this as the system prompt in ChatGPT, Gemini, LFM, Qwen, or any prompt helper.

You are a prompt rewriting assistant for Wan2.2 image-to-video.

Rewrite the user's idea into a short, literal, source-faithful I2V prompt.

Rules:
- Use one small action only.
- Preserve the exact same face and identity.
- Preserve hairstyle, clothing, lighting, colors, camera angle, and background.
- Keep the camera static.
- No zoom.
- No pan.
- No scene change.
- No cinematic embellishment.
- No new objects.
- Avoid talking, dancing, walking, large head turns, and large expression changes.
- Prefer subtle motion: blink, gentle breathing, tiny smile, very small eye movement.

Output exactly:
1. Safest:
2. Slightly more expressive:
3. Shortest:

Do not explain.

Then give it:

Rewrite this idea for Wan2.2 I2V:

<your idea>

Example input:

make her look at the camera and smile a bit, maybe some hair movement

Expected output style:

1. Safest:
The same person from the source image gently blinks once and makes a tiny natural smile. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

2. Slightly more expressive:
The same person from the source image looks naturally toward the camera and makes a very small smile. Preserve the same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Only subtle natural motion. Static camera.

3. Shortest:
Same person, same face and identity. One subtle blink and tiny smile. Static camera. Same lighting and background.

26. What I would do next

  1. Keep rapidWAN22I2VGGUF_q4KMRapidBase.gguf as the main branch.
  2. Save the current workflow as the control.
  3. Test CFG 1.00 / 1.15 / 1.25 / 1.35 / 1.50.
  4. Test start_at_step: 1 vs 0.
  5. Test steps: 8 / 10 / 12.
  6. Test return_with_leftover_noise: enable vs disable once.
  7. Use seed batches after choosing settings.
  8. Use one-action prompts.
  9. Put preservation constraints in the positive prompt.
  10. Try NAG only in a duplicate workflow if negative prompting remains weak.
  11. Use ChatGPT/Gemini or CPU-only LFM2.5 for prompt rewriting, not a GPU LLM inside ComfyUI.

Short summary

  • Keep rapidWAN22I2VGGUF_q4KMRapidBase.gguf; it matches the source-fidelity goal.
  • Keep sa_solver / beta as the main branch.
  • Do not chase CFG 3+.
  • Test CFG only in a tiny range: 1.00 / 1.15 / 1.25 / 1.35 / 1.50.
  • Test start_at_step: 1 versus 0.
  • Test steps: 8 / 10 / 12.
  • Use seed batches; face permanence is seed-sensitive.
  • At CFG 1, negative prompts are weak. Put identity/background/camera constraints in the positive prompt.
  • Use one small action per prompt.
  • Add almost nothing to the workflow. NAG is the only optional control node worth testing, and only in a copy.
  • For prompt rewriting, use ChatGPT Free, Gemini/AI Studio, or a CPU-only tiny model like LFM2.5.

thanks man, i been experimenting little by little as you said. i achieved close to perfect face permanence and obedience and nsfw without loras. its still trial and error and things start to break when i push with too many specific requests and details in one scene, its a balancing act but im getting there little by little. question : is there any way to formulate the prompts that would imply . do action A, after action A is done do action B? does it understand some kind of sequential instructions like these? also, what is your thoughts on (((weighted prompts:1.9))) for this? any effect whatsoever?

is there any way to formulate the prompts that would imply . do action A, after action A is done do action B?

There seem to be several methods available, but some of them are difficult to use in an 8GB VRAM environment:


Wan2.2 RapidBase I2V: sequential actions, prompt weights, and continuity-safe A → B workflows

Short answer:

Yes, you can write prompts like “do action A, then after A is done, do action B.” The model can understand that language. But in a normal single-prompt I2V workflow, that instruction is usually a soft temporal suggestion, not a reliable frame-accurate command.

For the current RapidBase workflow, the safest ranking is:

1. One-clip two-beat prompt
2. Two clips with a handoff frame
3. Neutral overlap + short crossfade
4. FLF2V bridge clip
5. Prompt Relay
6. Prompt Schedule / FizzNodes

For prompt weights:

Avoid:
  (((action:1.9)))

Prefer:
  (same face and identity:1.10)
  (preserve exact face:1.10)
  (static camera:1.10)
  (tiny natural smile:1.05)

The key rule is:

Weight preservation more than action.

The current workflow is working because it preserves the source image. Anything that pushes too hard toward complex action can also push the model into repainting, hallucination, or face drift.


1. Does the model understand “A, then B”?

It can understand the wording, but it does not necessarily execute it as an exact timeline.

A prompt like this is understandable:

The same person first blinks once, then after a brief pause makes a tiny natural smile.

But in a normal I2V generation, the text prompt conditions the whole clip. It is not automatically split into exact frame ranges like:

frames 0-16:
  action A

frames 17-33:
  action B

So the model may interpret “first A, then B” loosely.

Possible outcomes:

Prompt Possible model behavior
blink once, then smile blink and smile happen in the right order
blink once, then smile smile starts before the blink finishes
blink once, then smile only the smile happens
look down, then look back gaze drifts vaguely instead of following exact order
A then B then C one action is skipped or the face starts drifting

This is normal for a single-prompt video model. The model sees the whole instruction, but it is not a strict animation timeline unless you use timeline-control tools.

Useful background:


2. Best first method: one-clip two-beat prompting

For the current RapidBase workflow, this is the best first method.

It does not add nodes, LoRAs, bridge models, scheduling tools, or extra VRAM load. It also protects the main thing the current setup is good at:

same source image
same face
same lighting
same texture
same background
low hallucination

The limitation is that A → B order is only approximate.

Good use cases

Use one-clip two-beat prompts for small actions:

blink once -> tiny smile
gentle breathing -> blink once
look slightly downward -> return eyes to camera
tiny smile -> neutral expression
eyes shift slightly left -> eyes return to camera
neutral expression -> tiny smile

Bad use cases

Avoid large or multi-stage sequences:

turn head -> talk -> raise hand
walk forward -> gesture -> camera zooms in
look away -> laugh -> turn back
large smile -> speaking -> hair blowing
pose change -> lighting change -> background reaction

Each extra action increases the chance of:

face drift
changed mouth shape
changed eye shape
new lighting
new camera angle
background mutation
AI-looking repainting

3. Good wording for “after A is done, do B”

Use completion language, not just a loose list.

Weak:

blink and smile

Better:

first blinks once, then after the blink is complete, slowly forms a tiny natural smile

Good sequencing phrases:

first <action A>, then after a brief pause <action B>
after <action A> is complete, <action B>
begins still, then <action A>, then settles into <action B>
first holds a neutral expression, then gradually <action B>
after returning to neutral, <action B>

Avoid vague or overloaded phrasing:

blink and smile naturally
perform a sequence of expressions
react emotionally
do a cute expression
move seductively
act naturally

Vague words invite the model to improvise. Improvisation is where identity drift usually starts.


4. Practical one-clip formula

Use this structure:

[identity lock] + [starting state] + [action A] + [pause/settle] + [action B] + [camera lock] + [scene lock]

Example:

The same person from the source image keeps the exact same face and identity. The video begins with a calm neutral expression. First, the person gently blinks once. After the blink is complete, the person slowly forms a tiny natural smile. Preserve the same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No pan. No scene change.

This is better than:

she blinks then smiles

because it tells the model:

who must remain the same
what state to start from
what action comes first
what happens after
what must not change

5. One-clip two-beat prompt templates

Safest A → B template

The same person from the source image first blinks once, then after a brief pause makes a tiny natural smile. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No pan. No scene change. Subtle natural motion only.

More explicit timing template

The video begins with the same person holding still. First, the person gently blinks once. After the blink is complete, the person slowly forms a tiny natural smile. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No scene change.

Face-first template

Preserve the exact same face and identity throughout the video. The same person first blinks once, then after a brief pause makes a tiny natural smile. Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

Short template

Same person, same face and identity. First one subtle blink, then a tiny natural smile. Static camera. Same lighting and background.

Very safe template

The same person keeps the exact same face and identity throughout the video. First, one small blink. Then, a tiny natural smile. Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera.

6. When a single prompt is not enough

If exact order matters, use two clips.

Do not do this:

Clip 1:
  original source image -> action A

Clip 2:
  original source image -> action B

That creates two independent clips from the same original starting point. Clip 2 does not know where Clip 1 ended.

Better:

Clip 1:
  original source image -> action A only

Handoff frame:
  clean stable frame near the end of Clip 1

Clip 2:
  handoff frame -> action B only

This is the most practical way to get reliable A → B ordering without adding complex nodes.


7. Handoff-frame workflow

Process

1. Generate Clip 1 with action A only.
2. Inspect the last 3-10 frames.
3. Do not blindly use the final frame.
4. Pick the cleanest stable frame:
   - best face
   - least blur
   - stable lighting
   - stable background
   - expression suitable for the next action
5. Save that frame as PNG.
6. Use it as the source image for Clip 2.
7. Prompt Clip 2 for action B only.
8. Keep settings consistent:
   - same resolution
   - same FPS
   - same VAE
   - same text encoder
   - same sampler
   - same scheduler
   - same CFG
   - same steps
   - same prompt style

Clip 1 example

The same person gently blinks once, then returns to a calm neutral expression. Preserve the same face, identity, lighting, clothing, colors, camera angle, and background. Static camera.

Clip 2 example

The same person begins from a calm neutral expression, then slowly forms a tiny natural smile. Preserve the same face, identity, lighting, clothing, colors, camera angle, and background. Static camera.

This is more reliable than trying to force a complex sequence into one prompt.


8. Neutral overlap and crossfade

If using two clips, make the join happen during a neutral moment.

Bad join:

Clip 1 ends during a blink.
Clip 2 starts with a smile.

Better join:

Clip 1 ends after returning to neutral.
Clip 2 starts from the neutral handoff frame.

If the join is slightly visible, use a short crossfade.

Typical overlap:

4-8 frames
same FPS
same resolution
same color settings
same encoding settings

FFmpeg example:

ffmpeg \
  -i clip1.mp4 \
  -i clip2.mp4 \
  -filter_complex "xfade=transition=fade:duration=0.25:offset=2.75" \
  -c:v libx264 -crf 18 -preset slow \
  output.mp4

offset must be adjusted to match the length of Clip 1 and the desired transition point.

Reference:

Important:

A crossfade can hide a small seam. It cannot fix a true face, lighting, or background mismatch.

If the face is different between the clips, a crossfade may create ghosting or a double-face dissolve.


9. FLF2V bridge clip

FLF2V means First-Last Frame to Video.

Instead of simply crossfading Clip A into Clip B, you provide:

first frame = stable end frame of Clip A
last frame  = stable start frame of Clip B

Then the model generates the transition between them.

Concept:

Clip A:
  source -> action A

Clip B:
  handoff/source -> action B

Bridge:
  first frame = stable end frame of Clip A
  last frame  = stable start frame of Clip B
  prompt      = smooth subtle transition, same face, same lighting, static camera

Why it can help:

more natural transition than crossfade
can reduce a sudden jump between two clips
uses actual visual endpoints

Why it may not be ideal for the current 8GB RapidBase workflow:

separate workflow family
may be heavier
may not preserve the same RapidBase look
may introduce bloom or airbrushed style
may require more setup and testing

Use FLF2V only if:

Clip A is good.
Clip B is good.
The join is visibly bad.
A simple crossfade is not good enough.
The bridge can be short.

References:


10. Prompt Relay

Prompt Relay is closer to the real solution for “A happens in one segment, B happens in another segment.”

Instead of relying on a single prompt, Prompt Relay routes different prompts through different temporal segments.

Concept:

Global prompt:
  same person, same face, same identity, same lighting, same background, static camera

Segment 1:
  blink once

Segment 2:
  tiny natural smile

Why it is attractive:

A and B happen inside one timeline
less independent-clip continuity drift
global identity/camera constraints can stay active
different segments can receive different action prompts

Why it should be treated carefully:

changes the workflow structure
may not plug cleanly into the current RapidBase GGUF workflow
may increase complexity
8GB behavior is uncertain
could break the source-fidelity look

Do not add it to the working workflow directly. Test only in a duplicate workflow.

References:


11. Prompt Schedule / FizzNodes

Prompt scheduling is the general concept of changing prompt conditioning over time.

Concept:

Frames 0-16:
  same person gently blinks once

Frames 17-33:
  same person slowly forms a tiny smile

Why it may help:

more explicit temporal control
frame/segment-based prompt changes
better than hoping a single prompt follows order

Why it is not the first recommendation here:

not guaranteed to fit the current RapidBase GGUF workflow
can change conditioning behavior
may break the current source-fidelity look
adds complexity

References:


12. Recommended method ranking

Rank Method A → B reliability Source fidelity 8GB friendliness Recommendation
1 One-clip two-beat prompt Medium High High Try first
2 Two clips + handoff frame High Medium-high High Best practical method
3 Two clips + neutral crossfade Medium-high Medium-high High Good polish
4 FLF2V bridge High for transition Medium Medium-low Separate experiment
5 Prompt Relay High conceptually Unknown Unknown Advanced experiment
6 Prompt Schedule / FizzNodes Medium-high conceptually Unknown Medium Experimental

Best practical rule:

simple A -> B:
  use one-clip two-beat prompt

strict A -> B:
  use two clips with a handoff frame

smooth transition:
  use handoff frame + optional crossfade

true timeline control:
  test Prompt Relay or Prompt Schedule only in a duplicate workflow

13. Prompt weights: do they work?

Yes, ComfyUI prompt weights can work.

Common syntax:

(phrase:1.2)

Plain parentheses also increase weight. ComfyUI’s CLIPTextEncode documentation says plain parentheses apply a default weight of 1.1, and the ComfyUI Community Manual says nested weights multiply.

Examples:

(phrase)
  roughly increases emphasis

(phrase:1.2)
  explicit weight

((phrase:1.2):0.5)
  nested weights multiply

References:


14. Is (((weighted prompts:1.9))) useful here?

Probably not. For this workflow, it is more likely to hurt than help.

Avoid:

(((turns head and smiles:1.9)))

Avoid:

(((first blinks then smiles:1.9)))

Avoid:

(((action A then action B:1.9)))

Why? Because a huge action weight tells the model:

This action matters more than preserving the source image.

That can cause:

face drift
changed facial geometry
changed skin texture
changed lighting
hallucinated details
mouth/teeth weirdness
background changes
overcooked motion
loss of source fidelity

The current RapidBase workflow works because it is conservative. Heavy action weights fight that.


15. Better prompt-weight strategy

Do not heavily weight the action. Lightly weight preservation.

Better:

The same person first blinks once, then after a brief pause makes a tiny natural smile. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No scene change.

Risky:

The same person (((first blinks then smiles:1.9))). Same face and background.

The first prompt says:

identity and camera are important
action is small
do not repaint

The second prompt says:

force this action even if the model has to invent

For face permanence, that is the wrong priority.


16. Suggested weight ranges

Weight Use
1.00 normal baseline
1.05 tiny emphasis
1.10 safe emphasis
1.15 useful emphasis for identity/static camera
1.20 upper normal test
1.25 mild stress test
1.35 risky; use sparingly
1.50+ likely too strong
1.90 avoid for source-faithful I2V

For this workflow, use mostly:

1.05-1.20

Maybe test:

1.25

Avoid:

1.50+
1.90
triple-parentheses action forcing

17. What to weight

Good things to weight

(same face and identity:1.10)
(preserve exact face:1.10)
(preserve source image:1.10)
(static camera:1.10)
(no scene change:1.05)
(same lighting and background:1.10)
(tiny natural smile:1.05)
(one subtle blink:1.05)

Risky things to weight

(turns head:1.4)
(speaks:1.4)
(laughs widely:1.4)
(raises hand:1.4)
(hair blowing:1.4)
(camera zooms in:1.4)

Very risky

(((wide smile:1.9)))
(((speaking:1.9)))
(((turning head:1.9)))
(((complex action sequence:1.9)))

Large facial motion and mouth motion are exactly where face permanence usually breaks.


18. Weighted A → B examples

Safe weighted A → B prompt

The same person from the source image first blinks once, then after a brief pause makes a tiny natural smile. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No pan. No scene change.

Slightly stronger action prompt

The same person begins still, then (gently blinks once:1.05), then slowly forms a (tiny natural smile:1.10). Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No scene change.

Face-first prompt

(Preserve the exact same face and identity:1.15). The same person first blinks once, then slowly forms a tiny natural smile. Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No scene change.

Minimal weighted prompt

(Same face and identity:1.15). First one subtle blink, then a tiny smile. Same lighting and background. (Static camera:1.10).

19. What not to do

Avoid this:

(((The person first blinks, then smiles, then turns their head, then speaks:1.9)))

That stacks three problems:

too many actions
too much weight
weighting the part that causes identity drift

Also avoid:

First she blinks, then smiles, then speaks, then turns her head, while the camera zooms in and the lighting becomes cinematic.

That asks the model to solve:

facial motion
mouth motion
head rotation
camera motion
lighting change
identity preservation
background stability

That is too much for a source-faithful RapidBase clip.


20. Practical A → B workflow

Step 1 — choose the smallest version of the action

Instead of:

turns head and smiles

use:

tiny eye movement and tiny smile

Instead of:

speaks

use:

subtle mouth movement

Instead of:

laughs

use:

tiny natural smile

Step 2 — write a two-beat prompt

The same person first <action A>, then after a brief pause <action B>. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No scene change.

Step 3 — add light preservation weights

(Preserve the exact same face and identity:1.15)
(Static camera:1.10)

Step 4 — batch seeds

8-16 seeds
same prompt
same settings
short preview

Pick by:

1. face permanence
2. correct order
3. natural motion
4. prompt obedience

Step 5 — split into two clips if order fails

If the model keeps blending A and B, use:

Clip 1:
  action A only

Handoff:
  clean stable frame near the end of Clip 1

Clip 2:
  action B only

21. Best practical recommendation

For the current RapidBase workflow:

Use one-clip two-beat prompts first.
Use "first A, then after a brief pause B."
Keep A and B very small.
Batch seeds.
Weight preservation, not action.
Avoid 1.9 weights.
Use handoff-frame two-clip generation when strict order matters.
Only test Prompt Relay / scheduling in a duplicate workflow.

22. Example final prompts

Blink → smile

The same person from the source image first blinks once, then after a brief pause makes a tiny natural smile. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No pan. No scene change.

Look down → return gaze

The same person from the source image first looks slightly downward with only a tiny eye movement, then returns the eyes to the camera. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. (Static camera:1.10). No zoom. No scene change.

Neutral → tiny smile

The video begins with the same person holding a calm neutral expression. Then the person slowly forms a tiny natural smile. (Preserve the exact same face and identity:1.15). Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No scene change.

Breathing → blink

The same person from the source image keeps the exact same face and identity. The person gently breathes with subtle natural motion, then blinks once after a brief pause. Same hairstyle, clothing, lighting, colors, camera angle, and background. Static camera. No zoom. No scene change.

23. Reference links

Wan prompting / CFG behavior

Rapid / AIO background

FLF2V / Wan2.2 workflows

Prompt Relay / timeline control

Prompt scheduling

Prompt weighting

Crossfade

i been experimenting using the methodology you suggested, working by exclusion. i gotta say, the results are inconclusive. it seems to be EXTREMELY “picture dependent”. like every picture works best with its unique set of settings. if anybody stumbling onto this post is interested, the most solid results are cfg 1/ steps 4/ sa_solver,beta/start at step 0

start at step 1 gives more prompt obedience but it also takes quite a few artistic liberties in some areas (it seems like it defaults on what it encountered in its training rather than staying true to the source image). the thing i would really like to solve with start at step 1 is locking the camera movement. for some reason it keeps doing cinematic zooms etc even tho both pos and neg prompts explicitly state not to

final question (for real this time). i figured i might have better luck with starting with i2i to get the base image i want and then feed that into i2v. since its just an image and not a video i figured my 8gb vram would fare much better and its probably easier to get what i want with i2i.

can you link a solid, obedient, uncensored workflow for i2i(realism)?

Since you have limited VRAM, the method using i2i is a bit more involved, but it looks quite promising. This is because generative AI tends to produce more predictable results when the task is more specific and constrained by specific data in advance.

workflow for i2i(realism)?

The SDXL model used in the explanation below is based on a slightly older standard, so please substitute it with a model that works well for you at the moment as you read:


Short answer

Yes: for this specific goal, I think i2i first → i2v second is probably the most sensible route.

The important mental split is:

i2i = still-image preproduction
i2v = subtle animation pass

I would not ask Wan2.2 I2V to do everything at once. I would use i2i to create the exact still image I want first, then ask I2V for only a tiny amount of motion.

The main reason: the actual goal is not “make a cinematic video.” The goal is more like:

same face
same identity
same lighting
same background
same camera angle
subtle natural motion
static camera
low VRAM

That is a source-fidelity problem more than a “maximum motion” problem.


Why this is difficult

“Make the picture move” sounds simple, but technically it is a difficult balancing act.

The model has to do two opposite things:

Change enough pixels to create motion.
Do not change enough pixels to change identity.

If it changes too little, the result is stiff or nearly static.

If it changes too much, the result becomes:

different face
different skin texture
different lighting
different expression
different background
camera drift
zoom
airbrushed / dreamlike repainting

This is why I2V feels unstable. It is not literally “puppeting” the input image. It is generating a video conditioned on the image and prompt. The image is a strong reference, but not a hard lock.

So the practical strategy is:

Do not give I2V a hard source image and hope settings fix it.
Make the source image easier first.
Then ask I2V for less.

My recommended production route

For this case, I would use this pipeline:

1. SDXL img2img realism prep
2. optional inpainting for small defects
3. Wan2.2 / Rapid-style I2V for tiny motion
4. seed batch
5. optional two-clip handoff for A → B actions
6. final upscale/interpolation only after the raw video is already good

The key is that each stage has one job.

Stage Job Why
i2i make the still image good easier and lighter than video
inpaint fix small local defects preserves the rest of the image
i2v add tiny motion avoids asking video model to redesign the image
seed batch find the lucky identity-preserving run often more reliable than forcing CFG
two-clip handoff handle A → B better than one overloaded prompt
postprocess final polish should not be used to rescue bad identity drift

Best actual i2i workflow for realism

The most solid i2i base is not a huge “everything workflow.” I would start with the simple official ComfyUI img2img pattern:

Load Image
→ VAE Encode
→ KSampler
→ VAE Decode
→ Save Image

The key setting is denoise.

ComfyUI’s image-to-image docs explain the basic rule clearly: lower denoise keeps the generated image closer to the reference image; higher denoise changes it more.

Useful link:

For this use case, that is exactly the kind of control you want. Video settings are messy. i2i denoise is much more direct.


Recommended realism models

Option 1: RealVisXL V5.0

Use this first if the goal is photoreal local image generation with fewer built-in content restrictions.

Useful link:

Its model card says it is aimed at photorealism and can produce SFW and NSFW images. It also recommends negative terms such as bad anatomy, face asymmetry, eye asymmetry, deformed eyes, deformed mouth, and open mouth.

This makes it a good fit for:

photoreal portrait cleanup
realistic skin/lighting
less filtered local generation
base-image preparation before I2V

Safety/common-sense note: “uncensored” should mean fewer artificial style/content refusals for legal local image generation, not illegal, non-consensual, or underage sexual content.

Option 2: Juggernaut XL v9

Use this if you want a mature general photoreal SDXL model.

Useful link:

Its model page presents it as a refined photoreal SDXL checkpoint and emphasizes the maturity of the SDXL ecosystem.

This makes it good for:

stable photorealism
general portrait/photo realism
natural skin and lighting
8GB-friendly SDXL work

My practical model choice

I would test both, but start like this:

First: RealVisXL V5.0
Second: Juggernaut XL v9

RealVisXL if I care more about less-restricted photoreal realism. Juggernaut if I care more about broad, mature, predictable SDXL photoreal output.


Suggested i2i settings

Start with this grid:

Model:
  RealVisXL V5.0
  or Juggernaut XL v9

Sampler:
  DPM++ 2M Karras
  or DPM++ SDE Karras

Steps:
  30–50

CFG:
  4–6

Denoise:
  0.20
  0.25
  0.30
  0.35

Resolution:
  768–1024 long side

For this use case, I would expect the useful i2i denoise range to often be around:

0.22–0.35

Not because it is universal, but because:

0.10–0.18 = may preserve too many source flaws
0.20–0.35 = good cleanup range
0.40–0.55 = stronger rewrite, identity risk
0.60+ = usually too much if identity matters

Use the lowest denoise that produces an improved still image.


i2i prompt template

Positive prompt

photorealistic natural portrait of the same person, realistic skin texture, natural lighting, sharp clear eyes, detailed face, natural facial proportions, same hairstyle, same clothing, same background, neutral calm expression, realistic camera photo, natural colors

Negative prompt

different person, changed identity, face asymmetry, eyes asymmetry, deformed eyes, deformed mouth, open mouth, bad anatomy, plastic skin, over-smoothed skin, airbrushed, waxy, blurry, low detail, oversaturated, overexposed, fantasy, anime, painting, 3d render, cartoon

Important: do not ask i2i for a huge expression change if the image is meant to become an I2V source.

Better I2V source:

neutral expression
clear eyes
relaxed mouth
natural lighting
stable face angle
simple background
not too glossy
not too airbrushed

Riskier I2V source:

wide smile
open mouth
visible teeth
extreme side angle
strong shadow over one eye
hair covering eye
hand covering face
heavy bloom
heavy compression noise

When to use inpainting

Use inpainting only after the global i2i result is mostly good.

Useful links:

Use inpainting for:

one bad eye
mouth edge artifact
small skin artifact
hairline issue
bad tooth
bad hand/finger near face
small background distraction

Do not use inpainting for:

rebuilding the whole face
large expression changes
major pose changes
major lighting changes
turning a bad source into a totally different image

Rule of thumb:

If the image is 90% good, inpaint.
If the image is 50% wrong, redo img2img or regenerate.

When to add IPAdapter

Add IPAdapter only if plain img2img changes identity too much.

Useful link:

The ComfyUI IPAdapter repo describes IPAdapter as powerful image-to-image conditioning, like a “1-image LoRA.”

Good for:

stronger reference-image pull
same subject feel
same style/lighting
same identity impression

Tradeoffs:

extra setup
extra model files
more VRAM than plain img2img
can overconstrain the output
can copy unwanted lighting/style
can make faces uncanny if pushed too hard

My advice: do not start with IPAdapter. Start with plain img2img. Add IPAdapter only if the face keeps drifting during still-image prep.


When to add ControlNet

Add ControlNet only if pose, framing, or composition drifts too much.

Good for:

same pose
same framing
same head angle
same body silhouette
same rough composition

Tradeoffs:

more VRAM
more setup
can make the result stiff
can create edge/outline artifacts
does not guarantee identity

Useful starting point:

For this use case, ControlNet is usually a secondary tool. It solves structure/composition, not identity by itself.


After i2i: feed the still into I2V

Once you have the best still image, save it as the new source.

Then the I2V prompt should become extremely conservative.

I2V prompt

A realistic image-to-video animation of the same person in the source image. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. The person makes only one small natural blink and very subtle breathing. A single locked-off tripod shot. The camera remains completely fixed and steady. No zoom. No pan. No scene change. Natural colors. Sharp facial details.

I2V negative prompt

different person, face change, identity change, distorted face, warped eyes, asymmetrical eyes, deformed mouth, changing hairstyle, changing clothes, changing background, camera movement, zoom, pan, dolly, orbit, scene change, fantasy, anime, painting, overexposed, oversaturated, blurry, low detail, melted face, airbrushed, waxy skin

The i2i stage already did the “make the image good” work. The I2V stage should only do:

one blink
subtle breathing
tiny smile
tiny eye movement

Do not ask I2V to also improve the image, redesign the face, change lighting, fix anatomy, and perform multiple timed actions.


Keep the current low-VRAM I2V branch as the control

I would keep the current Rapid/AIO-style I2V branch as the control branch.

Useful links:

The Rapid/AIO model page frames these models around CFG 1 and 4 steps, and describes the base version as stable with sa_solver recommended.

That matters because it means normal SDXL instincts may be wrong here:

CFG 1 is not automatically too low.
4 steps is not automatically too few.
Higher CFG may repaint instead of obey.
More steps may overprocess instead of improve.

Suggested I2V baseline

CFG:
  1.0

Steps:
  4 first

Sampler:
  sa_solver if using a stable/base Rapid-style model

Scheduler:
  beta if that is the current stable pairing

Frames:
  33–49

Resolution:
  512–640 long side

Motion:
  one tiny action only

Then test one thing at a time.

CFG grid:

1.00
1.15
1.25
1.35
1.50

Steps grid:

4
8
10
12

Seed batch:

8–16 seeds after choosing the best source still

Success is not “highest obedience.” Success is:

highest obedience before face/source fidelity starts breaking

Best practical A → B method

For A → B, I would not start with Prompt Relay or FLF2V. I would start with two short clips.

Instead of:

first blink, then smile

do:

Clip A:
  blink once, return to neutral

Handoff:
  choose the most stable late frame, not automatically the final frame

Clip B:
  tiny smile from neutral

Clip A prompt:

The same person gently blinks once, then returns to a calm neutral expression. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. A single locked-off tripod shot. No zoom. No pan. No scene change.

Clip B prompt:

The same person begins from a calm neutral expression, then slowly forms a tiny natural smile. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. A single locked-off tripod shot. No zoom. No pan. No scene change.

This is low-VRAM friendly and more controllable than asking one prompt to handle exact timeline logic.


Alternatives table

Method / workflow Good for Tradeoff Low-VRAM friendliness Recommendation
Official ComfyUI img2img + RealVisXL V5.0 photoreal base image, less-restricted local realism still image only; too much denoise changes identity very high best i2i-first option
Official ComfyUI img2img + Juggernaut XL v9 mature SDXL photoreal still prep less specifically “uncensored” oriented very high test against RealVisXL
ComfyUI inpainting fixing one bad eye, mouth, skin area, artifact masking skill required; can patchwork if overused high use after img2img if needed
IPAdapter stronger identity/style reference in still prep extra setup; can overconstrain medium-high add only if img2img changes identity
ControlNet Canny/Depth/OpenPose preserving pose/composition does not guarantee identity; can stiffen output medium add only if structure drifts
Current Rapid/AIO-style I2V tiny motion, low VRAM, seed batching limited prompt obedience; weak for A → B very high keep as main I2V control branch
Official Wan2.2 5B workflow official 8GB-friendly sanity check lower quality ceiling than 14B high use as baseline/check
Cordux low-VRAM Wan2.2 workflow packaged low-VRAM Wan workflow with GGUF/offload/chaining more custom nodes; “runs” does not mean best face preservation high good comparison branch
Official Wan2.2 14B + ComfyUI-GGUF more standard high/low Wan behavior on lower VRAM slower, heavier, quantization tradeoffs medium learning/comparison branch
Two-clip handoff A → B timing while preserving identity manual; seams possible very high best practical A → B method
Official FLF2V / first-last-frame specific start image → specific end image can morph faces; end-state can leak early; heavier medium-low later experiment for tiny expression transitions
Prompt Relay segment-level prompt timing, A → B → C advanced; more custom-node complexity medium-low later, after simple baseline works
WanVideoWrapper advanced Wan experiments and non-native features WIP-style complexity; more debugging medium-low future/advanced branch
RIFE/interpolation final smoothness can create eye/mouth ghosts; does not fix identity high final step only
Upscale / face restore final polish can change face or introduce flicker medium avoid during testing

Recommended test plan

Step 1: Make still-image candidates

Generate:

original.png
i2i_realvis_denoise020.png
i2i_realvis_denoise025.png
i2i_realvis_denoise030.png
i2i_realvis_denoise035.png
i2i_juggernaut_denoise025.png
i2i_juggernaut_denoise030.png

Pick the best 2–3.

Do not pick only the prettiest image. Pick the most I2V-friendly image:

clear face
clear eyes
natural mouth
neutral expression
not over-smoothed
not over-sharpened
not too stylized
not too cinematic

Step 2: Run the same I2V test on each

Use the same prompt, same settings, same frame count.

The same person in the source image makes one small natural blink with very subtle breathing. Preserve the exact same face, identity, hairstyle, clothing, lighting, colors, camera angle, and background. A single locked-off tripod shot. The camera remains completely fixed and steady. No zoom. No pan. No scene change.

Run 4 seeds per still.

This tells you which still image animates best.

Step 3: Score the results

Metric Priority
same face / same identity 1
same lighting/background 2
static camera 3
no eye/mouth warping 4
natural motion 5
prompt obedience 6

Prompt obedience is last because a clip that obeys but changes the person is still a failure.

Step 4: Tune I2V lightly

Only after finding the best source image:

CFG:
  1.00
  1.15
  1.25
  1.35
  1.50

Steps:
  4
  8
  10
  12

Do not jump to high CFG.

Step 5: For A → B, use two clips

Clip 1:
  blink, return to neutral

Handoff:
  choose most stable late frame

Clip 2:
  tiny smile from neutral

Finish:
  trim/crossfade only if needed

Why I would not chase one giant “obedient workflow”

“Obedient” means different things in different stages.

For i2i, obedience means:

follow the still-image prompt
preserve the source image
change only what denoise allows

For i2v, obedience means:

perform the requested motion at the right time
without changing the source image

Those are very different problems.

A giant “obedient i2i/i2v workflow” may include many controls, but more controls do not automatically mean better results. They can create:

more setup problems
more missing nodes
more VRAM use
more hidden defaults
more ways to change the face
more difficulty isolating the cause of failure

For this use case, simplicity is not a limitation. It is an advantage.


What I would use exactly

Still image

Workflow:
  Official ComfyUI img2img

Model:
  RealVisXL V5.0 first
  Juggernaut XL v9 second

Sampler:
  DPM++ 2M Karras

Steps:
  35–50

CFG:
  4.5–5.5

Denoise:
  0.25
  0.30
  0.35

Optional:
  inpaint eyes/mouth only if needed

Video

Workflow:
  stable RapidBase-style I2V branch

Settings:
  CFG 1.0
  4 steps first
  33–49 frames
  512–640 long side
  static camera prompt
  one tiny motion only

Test:
  8–16 seeds after source image is chosen

A → B

Clip 1:
  blink, return to neutral

Handoff:
  choose most stable late frame

Clip 2:
  tiny smile from neutral

Finish:
  trim/crossfade only if needed

Bottom line

The best route is:

SDXL i2i realism prep
→ local repair/inpaint if needed
→ conservative Rapid/AIO-style I2V
→ seed batch
→ two-clip handoff for A → B

I would not try to solve everything with I2V.

Use i2i to make the image good. Use I2V only to make it subtly alive.

Short summary

  • Yes, i2i first is a strong idea for this case.
  • Best i2i base: official ComfyUI img2img.
  • Best realism models to test: RealVisXL V5.0 and Juggernaut XL v9.
  • Best i2i starting settings: denoise 0.25–0.35, CFG 4–6, 30–50 steps.
  • Use inpainting only for small local defects.
  • Add IPAdapter only if identity drifts during i2i.
  • Add ControlNet only if pose/composition drifts.
  • Keep the Rapid/AIO-style I2V branch for subtle motion.
  • Use CFG 1 / 4 steps as the first I2V baseline.
  • For A → B, use two clips with a stable handoff frame.
  • Treat Cordux, official 5B, 14B GGUF, FLF2V, and Prompt Relay as comparison or later branches, not the first production path.

i tried all the settings but no results. doesnt obey prompts AT ALL , just changes the person into someone completely different

This applies to I2I in general, but it is difficult to maintain the subject’s identity solely through prompt control. This is because it is hard for the AI to understand what should be redrawn and what should be retained.

Especially in the case of simple, standard I2I, the original image is treated merely as a reference. With standard Image-to-Image, the process generally focuses on generating a different image with a similar composition. In cases where precision isn’t critical, standard inpainting without a mask might work.

However, if you want high accuracy, more precise control is desirable. (For example, if you “absolutely” want to preserve a face.) A common method is to create a mask excluding the face and perform inpainting, but since manually creating a mask (which you could even do in MSPaint…) is a hassle, it’s important to figure out how to automate the process by having the AI handle face detection and other tasks. There are plenty of components available online for this purpose, but the challenge lies in how to combine them…


Short answer:

If img2img turns the person into someone completely different, I would stop treating that mainly as a prompt-obedience problem.

That is probably more like an identity-rewrite problem.

Whole-image img2img is still regeneration. “same person” in the prompt is not an identity lock. If SDXL is allowed to touch the whole image, it can improve the frame, but it can also reinterpret the face.

The Diffusers img2img guide is useful background here: img2img starts from an initial image, adds noise, and denoises toward a new result. That is not the same thing as Photoshop-style editing.

So I would not keep searching whole-image img2img settings first.

I would change the workflow structure.

Do not use SDXL to redraw the face first.
Use detection to protect the identity-critical area, then inpaint everything else.

The practical idea

Instead of this:

source frame
→ whole-image SDXL img2img
→ hope the prompt preserves identity
→ I2V

I would try this:

source frame
→ detect face or full person
→ make a protection mask
→ grow/dilate the mask
→ blur/feather the mask
→ invert the mask
→ inpaint only the non-face or non-person area
→ stitch back into the original frame
→ save fixed PNG
→ feed fixed PNG to I2V

This changes the task from:

make img2img preserve identity

to:

remove the identity area from the repair target

That is a much easier problem.

Why I would not start with FaceDetailer as a face redraw tool

FaceDetailer is useful, but I would not start by letting it redraw the face.

For this specific failure mode, I would use the detector part only.

Something like:

YOLO / Impact Pack detects the face
→ face mask
→ grow / blur
→ invert
→ SDXL repairs everything except the face

In other words, FaceDetailer-style tools are useful here because they can locate the face.

Not because we want SDXL to repaint the face.

Good references for this detector/detailer ecosystem:

Important note:

ComfyUI-Impact-Pack says UltralyticsDetectorProvider is not part of Impact Pack itself anymore. For YOLO / Ultralytics detection, install ComfyUI-Impact-Subpack too.

The Subpack README also says Ultralytics models should be placed under:

models/ultralytics/bbox
models/ultralytics/segm

depending on the model type.

For face/person detection models, Bingsu/adetailer is a common source.

Minimal face-protect workflow

This is the first workflow I would try.

Load Image
→ UltralyticsDetectorProvider
→ YOLO face detector, for example face_yolov8m
→ BBOX Detector / Simple Detector
→ face mask
→ grow/dilate mask
→ blur/feather mask
→ invert mask
→ Inpaint Crop
→ SDXL inpaint sampler
→ Inpaint Stitch
→ Save fixed PNG
→ use that PNG as I2V input

Mask meaning:

white = repair this
black = preserve this

So if the detector gives you:

white = face
black = everything else

then invert it.

After inversion:

white = non-face area
black = protected face area

Now SDXL is asked to repair the frame while not touching the face.

The basic ComfyUI inpaint concept is covered in the official ComfyUI Inpainting Workflow. That workflow uses a manual mask, but conceptually the manual mask can be replaced with an automatically generated detector mask.

If you use Impact Pack SEGS, the shape is usually:

UltralyticsDetectorProvider
→ BBOX Detector (SEGS) or Simple Detector (SEGS)
→ SEGS to MASK (combined)
→ preview mask
→ grow/blur
→ invert
→ inpaint

Useful node references:

Face-protect vs person-protect

I would probably make two versions.

Mode Protects Repairs Use when
face-protect face / identity center background, clothing, non-face defects the face is the main identity risk, but clothing/background may need repair
person-protect whole person mostly background hair, clothing, body shape, pose, or full identity must not change

Face-protect route:

face detector
→ face mask
→ grow/blur
→ invert
→ inpaint non-face area

Person-protect route:

person segmentation detector
→ person mask
→ grow/blur
→ invert
→ inpaint background only

The tradeoff is simple:

face-protect = more repair freedom, more risk to hair/clothes/body
person-protect = safer identity/clothing preservation, less repair freedom

If the person is changing too much, use person-protect mode.

If only the face is changing, face-protect mode may be enough.

For person masks, look at segmentation detector routes in Impact Pack detector tutorial, and put segmentation models under models/ultralytics/segm as described in ComfyUI-Impact-Subpack.

Do not use the raw mask directly

A raw face mask is usually too tight.

It may protect the middle of the face, but not enough of:

face outline
hairline
ears
chin
neck
jaw shadow
skin/background transition

So I would not do:

face mask
→ invert
→ inpaint

I would do:

face mask
→ grow/dilate
→ blur/feather
→ invert
→ inpaint

Possible starting values:

face mask grow/dilate: 24-64 px
face mask blur/feather: 12-32 px
person mask grow/dilate: 16-48 px
person mask blur/feather: 8-24 px

Those are not magic values. They are just a reasonable diagnostic range.

The mask should be previewed before sampling.

Why I would use Crop & Stitch

I would strongly consider using Inpaint Crop & Stitch rather than sampling the entire frame.

The reason is simple:

we do not want to resample the whole image
we only want to repair the selected area
then stitch that repair back into the original frame

Useful node/packages:

The important part is that Crop & Stitch can crop around the masked area, sample that region, then stitch it back while preserving the unmasked area.

That is exactly the kind of behavior I would want before I2V.

A useful comment I have seen summarized the same idea as:

Ultralytics detects BBOX/SEGM
→ Detector node gets SEGS/MASK
→ convert SEGS to mask if needed
→ connect to Inpaint Crop
→ KSampler
→ Inpaint Stitch

That is basically the route I would try here.

Suggested first test

Do not put Wan, ControlNet, SAM, IPAdapter, FaceID, upscalers, and inpaint all in one big workflow at first.

First test the still image repair step only.

Use one source image and compare:

A. original frame
B. whole-image img2img result
C. face-protect inpaint result
D. person-protect inpaint result

Success condition for this stage:

the fixed PNG still looks like the same person

Not:

the prompt was perfectly followed

Not yet:

the final I2V clip is perfect

First prove that the still PNG is not being identity-rewritten.

Starting SDXL inpaint settings

For identity-preserving prep, I would start conservative.

denoise: 0.12 / 0.18 / 0.24
steps: 20-30
cfg: 3-5
sampler: whatever is stable in your SDXL workflow

I would avoid starting with high denoise.

If denoise is too high, the result may look cleaner but less like the source.

That is exactly the failure mode we are trying to avoid.

Suggested prompt for the SDXL repair pass

For this pass, I would not use glamour / beauty / cinematic language.

Positive:

realistic photo cleanup, preserve the original photo, same lighting, same camera angle, same clothing, same background structure, natural texture, realistic details, no stylization

Negative:

different person, changed face, changed identity, changed hairstyle, changed clothing, changed lighting, beauty filter, airbrushed skin, plastic skin, waxy skin, cinematic lighting, dreamlike, over-smoothed, cartoon, painting, 3d render

If you are using face-protect mode, the prompt is mostly for the non-face repair area.

The face should be protected by the mask, not by the prompt.

Useful ComfyUI parts / references

Basic inpainting:

ComfyUI Inpainting Workflow

This is the basic official inpaint workflow. It uses manual masks, but conceptually you can replace the manual mask with an automatically generated face/person mask.

Background concept:

Diffusers img2img guide

Useful if you want the conceptual reason whole-image img2img can drift.

Diffusers inpainting guide

Useful if you want the conceptual difference between whole-image img2img and masked repair.

Impact Pack / detection:

ComfyUI-Impact-Pack

Impact Pack has detector/detailer/upscaler/pipe nodes. Important note: UltralyticsDetectorProvider is not part of Impact Pack itself anymore. Install Impact Subpack too.

ComfyUI-Impact-Subpack

This provides UltralyticsDetectorProvider, which loads YOLO / Ultralytics models and provides BBOX_DETECTOR / SEGM_DETECTOR.

Impact Pack detector tutorial

This explains the detector side: BBOX, SEGM, SAM, and SEGS.

BBOX Detector (SEGS)

Useful for understanding the face/person detection step.

SEGS to MASK (combined)

Useful if your detector route returns SEGS and you need a normal mask.

Impact Pack node list mirror

Useful for checking exact node names.

YOLO / detection models:

Bingsu/adetailer

Common source for face/person/clothing detection models.

Ultralytics assets

General Ultralytics model assets.

Workflow examples / wiring examples:

FaceDetailerPipe workflow index

I would not necessarily use FaceDetailer to redraw the face here, but these workflows can be useful for learning the YOLO / detector / pipe wiring.

ComfyUI Face Detailer guide

Again, I would treat this mainly as a guide to the detector/detailer ecosystem, not as the first thing to use for identity preservation.

Improving faces with Impact-Pack Detailers

Useful background for Impact Pack detailer workflows.

Crop and stitch:

ComfyUI-Inpaint-CropAndStitch

This is probably one of the most relevant pieces for this use case.

Comfy-Org crop-and-stitch nodes

Same general idea: crop before sampling, stitch back afterward, preserve unmasked areas.

RunComfy: ComfyUI-Inpaint-CropAndStitch

Readable node overview.

RunComfy: Inpaint Crop

Useful if you want the specific node page.

Optional second stage

Only after the mask workflow works, I would test extra control.

If normal SDXL inpaint is not good enough:

Option 1: Fooocus-style inpaint support

Acly/comfyui-inpaint-nodes

This can add Fooocus / LaMa / MAT-style inpaint tools to ComfyUI.

RunComfy: ComfyUI Inpaint Nodes

Readable guide for the Acly inpaint nodes.

Fooocus inpaint files

Fooocus inpaint model files.

Fooocus inpaint patch

Specific Fooocus inpaint patch file.

Option 2: SDXL inpainting model

SDXL Inpainting 0.1

This is a dedicated SDXL inpainting model.

Option 3: ControlNet

I would treat ControlNet as a second-stage fix, not the first fix.

First solve the mask design.

Then:

Tile ControlNet = if texture/clothing/background changes too much
Canny ControlNet = if outlines drift too much
Inpaint ControlNet = if fill/boundary quality is poor
Union ControlNet = more advanced route, more modes, more complexity

Possible references:

ControlNet Canny SDXL

Xinsir ControlNet Canny SDXL

Xinsir ControlNet Tile SDXL

Xinsir ControlNet Union SDXL

Xinsir ControlNetPlus GitHub

controlnetXL_inpaint

controlnet-inpaint-dreamer-sdxl

Diffusers ControlNet with SDXL docs

I would not start with all of these.

For this case, I would probably test in this order:

1. face/person-protect mask without ControlNet
2. Crop & Stitch
3. Tile ControlNet if texture changes too much
4. Canny ControlNet if shape drifts too much
5. inpaint ControlNet or Fooocus inpaint if fill quality is weak

Optional third stage

If you still need stronger identity preservation, then I would start looking at identity-reference systems.

But I would not start there.

Only after the simpler mask route fails would I look at things like:

IPAdapter FaceID
InstantID
LivePortrait
manual mask correction

The reason is simple:

the first problem to solve is not "how do I force SDXL to know the person?"
the first problem is "how do I stop SDXL from touching the person?"

Final rule before I2V

Only feed the PNG to I2V after the still image still looks like the same person.

If the still-image prep already changes the person, I2V cannot fix that.

It will just animate the changed person.

So the diagnostic order should be:

1. Can I make a fixed still PNG that preserves identity?
2. Does that fixed PNG animate better than the original frame?
3. Only then tune the I2V settings.

For your specific failure mode, I would debug the still-image prep first.

Not the Wan settings first.

Incredibly useful information in here everyone! :face_with_monocle:

Not recommended to digest all in one sitting though (unless you’re an agent and not a mere hooman). Kinda makes you miss the “simplicity” of Wan 2.1. The 3-step workflow seems like a cakewalk in comparison. Pick your resolution, quantization, and top it off with customization (LoRAs+experiments).

I’m still seeing incredibly high usage of Wan2.1 (in relative terms) and some hesitation/reluctance to adopt wan v2.2 (and up) by the general public as well as “veterans” on HF and other platforms that aren’t ComfyUI-centric. And… understandably so, IMO. 2.1 when used correctly is still incredibly powerful and robust, and the economics of running these models, for the majority of people, heavily favors 2.1.

The high/low noise differentiation could have been better explained or conceptualized, IMO, by the WAN team as well as by us, the community. Probably a bit late though, as 2.2 itself is on the cusp of being overshadowed by the next iteration of models.

Anyways, thanks for the wealth of knowledge @John6666 - bookmarkin’ forsure!