This is where the whole series starts to smell less like a framework and more like a real worksite.
The earlier pieces could still rely on maps, layers and clean distinctions.
This one cannot.
This one has to deal with the questions that start showing up the moment you actually train something:
- why is this so slow
- why does it look stuck
- why are some warnings harmless and others a real problem
- why can LoRA just about run while partial FT collapses into memory pain
- why does “only 2.7% of weights” still feel enormous
- why does generation look dead when it is sometimes merely being brutally honest about how slow it is
If I had to put the central point right at the top, it would be this:
Why training is so much heavier than inference
Saying “because the model is big” is not wrong, but it is not enough.
The pain you felt earlier did not come only from size.
It came from the fact that training does much more than inference.
What inference is doing
At its core:
- forward pass
- logits
- then the next token
If all you do is a single forward pass, the cost is comparatively plain. That is exactly why your “forward pass only” DPO smoke test could succeed even when generation felt unusable.
What training is doing
Training has to carry extra work:
- loss computation
- backward pass
- gradient storage
- optimiser state
- parameter updates
So training is not “inference done many times”.
It is a different workload category.
That is why the same 8B model can feel:
- survivable in one type of inference
- much harsher the moment training begins
What loss is, and why it exists
This sounds like a beginner question, but it matters.
The plain version is:
loss is the numerical form of how far the model is from the target you want it to move towards
Training needs something optimisable. The model cannot merely be told “that answer felt a bit off”. It needs a quantity that backward pass can use.
In SFT
Loss is closer to:
- how far the model’s output is from the demonstrated target answer
In DPO
The shape is different, but the principle is similar:
- how far the model’s current preferences are from the desired preference relation
So loss is not dashboard decoration.
It is part of the actual force that moves parameters.
Why training loss does not equal quality
You already proved this the hard way.
Some LoRA runs:
- completed successfully
- produced checkpoints
- logged reasonable numbers
and yet once deployed back into use, the model still:
- became duller
- invented strange terminology
- lost its balance
- felt less intelligent overall
That is why I never want to turn loss into the protagonist.
Loss matters.
But it only tells you:
how well the model fit this training signal
It does not guarantee:
- stable generalisation
- preserved deployment quality
- good behaviour on real prompts
So the stable workflow is always:
- read the loss
- but also compare versions
- and also test them in use
Why LoRA training takes so long
Your Mac runs were very honest about this.
The slowness was not imaginary.
At least four things were stacked together.
1. You were not only doing forward passes
This is the main one.
Backward pass and optimiser steps fundamentally change the cost.
2. You were using MPS
MPS matters because it makes local Apple Silicon training possible at all.
But it is not a drop-in substitute for a mature high-end CUDA training stack. PyTorch treats MPS as an official backend and documents its dedicated memory controls, which tells you it is real support, not a hack. It does not tell you that large-model training will be comfortable. citeturn563578search7turn563578search11
3. Your settings were conservative but still non-trivial
Small batch sizes, modest accumulation and non-trivial sequence lengths all make the cost feel very real.
4. Saving can itself be slow
Not every apparent stall is training.
Checkpoint writing, merged shard writing and large-file output can be painfully slow on their own.
Why generation keeps looking stuck
You built a very complete case study of this.
You saw things like:
- generation appearing dead
- but forward pass succeeding
- shorter generation limits still feeling awful
- the eventual realisation that this was 8B + MPS + adapter + generate, not a broken model
The most important judgement here is:
stuck does not automatically mean broken
Sometimes it really means:
- first-token latency is terrible
- MPS generation is heavy
- the model and adapter are fine, but your requested workload is unfriendly to the hardware
That is exactly why moving to:
- forward-pass-only smoke tests
- logits checks
- chosen/rejected log-prob comparisons
was not just a workaround. It was, in several cases, a more honest evaluation route.
The temperature warning and do_sample
You saw this many times:
The following generation flags are not valid and may be ignored: [‘temperature’, ‘top_p’]
This is not the main problem.
It usually just means:
do_sample=False- therefore sampling parameters such as
temperatureandtop_pdo not apply - transformers is warning you that they are being ignored
Annoying, yes.
Fatal, no.
What the torch_dtype warning is telling you
You also saw:
torch_dtypeis deprecated! Usedtypeinstead!
That is not a sign that the model cannot run.
It is an API-evolution warning.
In plain terms:
- this parameter name is being phased out
- it still works now
- but the newer spelling should replace it later
Worth cleaning up, not worth panicking about.
What MPS actually is
The most useful plain-language definition is:
MPS is the route that lets PyTorch use Apple’s GPU stack on Apple Silicon
For your purposes, that mattered because:
- it made local training on a Mac worth attempting at all
- but it also exposed the real limits of that path
PyTorch documents MPS-specific backend and memory settings, which is a good reminder that this is a supported path, not a side alley. citeturn563578search7
What the MPS high watermark is
This is the piece you started touching when PYTORCH_MPS_HIGH_WATERMARK_RATIO showed up.
In plain terms, it is a memory guardrail for the MPS allocator.
It is not a permission slip to consume memory infinitely.
So when you see errors describing:
- MPS allocated
- max allowed
- tried to allocate
that usually means: this configuration does not fit on this machine
What PYTORCH_MPS_HIGH_WATERMARK_RATIO is
It is one of PyTorch’s MPS memory-control environment variables. The official docs expose it as part of allocator tuning. That matters because it helps you understand what it is not: it is not a magic model shrink ray. citeturn563578search7
Loosening it can sometimes shift behaviour. It does not necessarily fix the underlying mismatch between workload and hardware.
Why OOM happens
Your partial FT run failed in a very educational way.
The reason is straightforward.
1. You are not just loading weights
You are also carrying:
- gradients
- optimiser states
- activations
- update steps
2. Partial FT really opens original weights
That is the large difference from LoRA.
LoRA is closer to:
- frozen base
- small trainable increments
Partial FT is closer to:
- genuine trainable base weights
- with gradients and optimiser states attached
3. exp_avg and exp_avg_sq are not free
These are optimiser states associated with Adam-style methods.
That means you are not only paying for the trainable parameters themselves, but also for those extra running statistics.
That is why many runs do not explode during forward pass. They explode when the optimiser step arrives.
What trainable params and trainable % are actually telling you
These numbers are worth understanding properly.
trainable params
The raw number of parameters that will actually be updated.
trainable %
How large that trainable set is relative to the entire model.
So when you see:
- 17.4%
- or 2.7%
the human instinct is to think “2.7% sounds small”.
On an 8B model, it is not small in the way your intuition wants it to be.
Is roughly 2.7% a lot?
For local 8B training on MPS, yes.
Certainly large enough to push you out of LoRA territory and into something that genuinely feels like base-weight work.
What 2.7% really means
The easiest way to think about it is:
you are not opening a symbolic flag.
You are opening hundreds of millions of original parameters.
So “small percentage” does not mean “small bill”.
That point matters especially in large-model work.
Why partial FT feels so much harsher than LoRA
Because it sits in the most deceptive middle zone:
- much deeper than LoRA
- but not obviously as radical as full fine-tuning
That makes it easy to underestimate.
You tell yourself:
- I am only opening the final few layers
- it should not be that different
Then the machine answers with a very firm “it is different”.
How to read different kinds of “stuck”
I think this is worth keeping as a practical diagnostic table.
Type 1: merely slow
Signs:
- no explicit error
- the system is still doing work
- generation just sits there
Likely cause:
- terrible first-token latency
- heavy MPS generation
- large model plus adapter
Type 2: true incompatibility
Signs:
- immediate TypeError
- trainer arguments rejected
- API calls not recognised
Likely cause:
- version mismatch across TRL / PEFT / Transformers
Type 3: resource collapse
Signs:
- OOM
- allocator errors
- failure during optimiser step
Likely cause:
- too much trainable scope
- partial FT too heavy
- sequence too long
- optimiser state too large
Type 4: text-editing and shell-layer accidents
Signs:
kimportSyntaxError- heredoc leftovers
- running commands from the wrong directory
Not glamorous, but extremely common.
Why compare scripts can also become expensive
You proved this too.
Train
Heavy, yes, but the trainer knows what kind of heavy it is.
Compare
Looks harmless, but once it becomes:
- 8B base
- adapter attached
- generation
- or long-sequence log-prob comparison
it can absolutely become a serious workload in its own right.
That is why you ended up trimming compare scripts into:
- one sample only
- shorter answers
- disable-adapter base comparisons
- margin-based scoring instead of full long-form generation
That was not cheating.
It was respecting the machine.
The one sentence worth keeping
If this piece leaves one sentence behind, it should be this:
training is not slow inference. It is forward pass, loss, backward pass, gradients, optimiser states and updates all competing inside the same space.
Once that sentence really lands, a great deal of the rest becomes legible:
- why LoRA is still survivable
- why partial FT explodes so quickly
- why MPS often feels abrasive
- why generation looking stuck is not always a bug
- why compare scripts can become unexpectedly painful too