SFTTrainerflags blocks assistant_only_loss=True

Hi. I came across this interesting case. I’m fine-tuning Qwen3.5-4B-Base for language-modeling only (no multi-model). I’m using AutoTokenizer instead of AutoProcessor. I updated the tokenizer_chat_template to add the {%generation%} tag so I can use assistant_only_loss=True

When I set assistant_only_loss=True, I get the following objection from SFTTrainer:

ValueError: Assistant-only loss is not yet supported for vision-language models. Please set `assistant_only_loss=False` in the `SFTConfig`.

The justification I got is: “It seems like the error is coming from SFTTrainer itself — it inspects the model’s config to detect if it’s a VLM. Qwen3.5 has vision capabilities registered in its config, so even if you’re using it as a language-only model, SFTTrainerflags it as a VLM and blocks assistant_only_loss=True.”

I wonder if there’s a workaround for SFTTrainer to treat the model as language only since I’m defining it as: from transformers import AutoModelForCausalLM.

Thanks!

It looks like this is falling into a gap between supported paths:


I think your read is basically right, but I would frame it slightly differently:

This does not look like a simple user mistake. It looks like a boundary case between:

  1. a model artifact that is published/handled as a VLM,
  2. a user workflow that is intentionally text-only,
  3. TRL’s current SFTTrainer VLM path, and
  4. the still-fragile implementation details behind assistant_only_loss=True.

The short version is:

assistant_only_loss=True is supported for the text/conversational SFT path when assistant-token masks can be produced from the chat template, but TRL currently blocks it for VLMs. Since Qwen/Qwen3.5-4B-Base resolves through the VLM/processor path, this check is triggered even if your dataset is text-only.

That makes the error understandable, but the UX is confusing.

Why this happens

The model page for Qwen/Qwen3.5-4B-Base is not presented like a plain text-only Causal LM path. The Hub usage panel shows image-text-to-text, AutoProcessor, and AutoModelForImageTextToText usage for this artifact:

So even if your training data is text-only, the model artifact itself is still a vision-language capable artifact from the library/tooling point of view.

On the TRL side, the important current signal is this tracking issue:

That issue explicitly lists VLM families, including Qwen3.5, and notes:

VLMs currently don’t support assistant_only_loss in SFT (blocked by a separate check).

So I would not treat this as a random failure. It is a known unsupported combination.

Why assistant_only_loss=True is more fragile than it looks

assistant_only_loss=True is not just a boolean that says “ignore user messages.” It depends on a whole chain working correctly:

  • the dataset must be in the expected conversational format;
  • the chat template must support assistant-token masking;
  • the template needs {% generation %} / {% endgeneration %} markers;
  • the tokenizer/processor must return a correct assistant mask;
  • truncation must not remove the supervised assistant span;
  • the data collator must actually apply that mask to labels;
  • and in VLMs, image placeholders / image tokens / processor-specific expansion must not shift or break the mask.

The TRL docs describe the text/conversational path here:

The Transformers docs describe the lower-level assistant-mask mechanism here:

There are also related issues showing why this area is brittle, especially once ProcessorMixin / multimodal paths are involved:

That is why the TRL check is conservative. Allowing assistant_only_loss=True on the VLM path without correct masking would be worse than raising an error, because it could silently train on the wrong tokens.

What I would try first for a strictly text-only experiment

If your workload is truly text-only, I would first try to force the text/tokenizer path by explicitly passing the tokenizer as processing_class.

For example:

from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig

model_id = "Qwen/Qwen3.5-4B-Base"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(
        output_dir="<your-output-dir>",
        assistant_only_loss=True,
        # other args...
    ),
    train_dataset=train_dataset,
    processing_class=tokenizer,
)

The important part is not just creating an AutoTokenizer. The important part is passing it into SFTTrainer:

processing_class=tokenizer

Otherwise, SFTTrainer may resolve the default processing class through the processor path, and the model is then treated as a VLM.

However, I would treat this as a workaround, not as proof that VLM + assistant_only_loss is officially supported.

After doing this, I would always inspect the labels before training:

batch = trainer.data_collator.torch_call([train_dataset[0]])
labels = batch["labels"][0]

print("supervised tokens:", (labels != -100).sum().item())
print(tokenizer.decode(labels[labels != -100]))

You want the decoded supervised text to contain only the assistant answer span. If system/user/prompt text appears there, then the loss mask is not doing what you think it is doing.

This check is important because “the trainer runs” and “the correct tokens receive loss” are not the same thing.

What I would prefer for a more robust text-only setup

If the actual goal is:

train only on the answer/completion, not on the prompt/user/system text

then I would prefer a prompt-completion dataset and completion_only_loss=True, instead of relying on assistant masks from a chat template.

That path is conceptually simpler because the supervised boundary is explicit in the dataset schema:

{
    "prompt": "<|im_start|>system\n...\n<|im_end|>\n<|im_start|>user\n...\n<|im_end|>\n<|im_start|>assistant\n",
    "completion": "The assistant answer goes here.<|im_end|>"
}

Then configure SFT like this:

from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(
        output_dir="<your-output-dir>",
        completion_only_loss=True,
        assistant_only_loss=False,
    ),
    train_dataset=prompt_completion_dataset,
    processing_class=tokenizer,
)

Relevant docs:

For this particular case, I would consider completion_only_loss=True the more durable route if you do not actually need multimodal inputs.

If you really want to train it as a VLM

If you are actually using images/multimodal messages, I would not currently expect assistant_only_loss=True to work through the normal TRL SFT path.

The relevant upstream signal is still:

In that case, the choices are roughly:

  1. accept whole-sequence loss for now;
  2. restructure into prompt-completion if your data shape allows it;
  3. write a custom collator and verify labels very carefully;
  4. wait for upstream VLM assistant-mask support.

A custom collator is possible, but I would be careful. For VLMs, masking is not just “find assistant text and set everything else to -100.” You also need to account for processor output, image placeholders, image tokens, multimodal token expansion, truncation, special tokens, and model-specific label handling.

What I think TRL could improve

The current error is technically reasonable:

Assistant-only loss is not yet supported for vision-language models.

But for this specific failure mode, it does not explain why a text-only user hit a VLM error.

A more helpful error would say something like:

This model is being treated as a VLM because the resolved processing class is a ProcessorMixin/AutoProcessor path. VLMs do not currently support assistant_only_loss in SFT. If your workload is strictly text-only, try passing a tokenizer explicitly as processing_class=tokenizer, or use a prompt-completion dataset with completion_only_loss=True.

That would make the workaround and the distinction much clearer.

My practical recommendation

For a quick experiment:

processing_class=tokenizer

then inspect labels != -100.

For a more reliable text-only fine-tuning setup:

completion_only_loss=True
assistant_only_loss=False

with an explicit prompt-completion dataset.

For actual multimodal/VLM SFT:

assistant_only_loss=False

unless you are prepared to write and validate a custom collator.

So overall: I think your intuition is right that this is a “gap” rather than a simple user error. But given the current TRL issue tracker and docs, I would treat the current behavior as an intentional unsupported-path guard, not as a straightforward bug.