By the time you truly reach this stage, one fact becomes impossible to miss:
A lot of first local fine-tuning attempts are driven by a very intuitive picture:
- training completes
- files appear
- therefore the model must be ready
Reality is longer than that.
What training often gives you is not a finished product, but an intermediate artefact. That artefact might be:
- an adapter
- a checkpoint
- a safetensors weight file
- a pre-merge intermediate state
Turning those things back into something genuinely usable on a local machine, especially through Ollama, usually requires a different engineering path altogether.
Why adapters do not have to be merged
This is worth stating early, because many people begin by assuming merge is mandatory.
It is not.
Adapters are designed to be able to:
- live separately from the base model
- be attached when needed
- let one base model support multiple behavioural variants
That is part of why LoRA is so attractive in engineering terms. You do not need to duplicate a full base model every time you create a new variant. You only need to store the increment.
So an adapter that is not merged is not incomplete.
It simply means:
your deployment route still allows the base and the increment to exist separately
What the difference is between an adapter and a merge
This needs to stay very clean.
Adapter route
- the base model remains the base model
- the adapter is stored separately
- inference combines them
Merge route
- the adapter’s learned increment is folded back into the base model
- the result becomes one new complete weight set
So merge is not a training method.
It is a packaging and deployment step.
That matters because it affects:
- file size
- versioning flexibility
- what downstream tools can ingest directly
Why adapters are so much more storage-efficient
Because they only store the delta.
The files you kept seeing:
adapter_model.safetensorsadapter_config.json
are not whole-model copies. They are compact trainable increments. That is why they remain so much smaller than a fully merged model.
Which is also why:
- adapters are excellent for experimentation
- merge feels more like a pre-production step
What merge is actually doing
The plain version is:
merge folds the adapter’s learned increment back into the base model weights
So after merge, you no longer have:
- base + adapter as separate pieces
You now have:
- one full model weight set
That is why merged models become much larger. You are no longer storing a small delta. You are storing the whole thing.
What Safetensors is
Safetensors is not a model category.
It is best understood as a weight-file format.
So when you saw:
adapter_model.safetensorsmodel.safetensors
the difference was not that one was somehow more “advanced”.
The difference was what they were storing:
- adapter increment
- or full merged weights
Safetensors appears constantly in this workflow because it fits very naturally into:
- Hugging Face model storage
- pre- and post-merge exchange
- movement between training and local deployment stages
Where GGUF sits
GGUF is also not a different personality of model.
It is better understood as another container format, especially common in local quantised inference ecosystems.
The most practical way to hold the distinction is:
- Safetensors: training / merge / interchange
- GGUF: local quantised inference ecosystem
Neither is the model itself.
They are containers.
Why merge is often followed by quantisation
This was one of the clearest lived differences in your own session.
You merged a model, brought it back into Ollama, and saw the most direct possible outcome:
- the unquantised version could run
- but it was painfully slow
Then the q4 version appeared, and the speed improved dramatically.
The main judgement here is:
quantisation is not for making the model smarter. Quantisation is for making the model runnable
What quantisation is
The plainest version is:
replace a heavier, higher-precision weight representation with a lighter one
A useful metaphor is:
- same luggage contents
- different suitcase
- lighter to carry
- easier to deploy
Why do it at all?
Because local deployment pain usually comes from:
- memory
- load cost
- inference latency
Quantisation is a direct fight against those constraints.
What q4_0, q4_K_M, fp16 and fp32 are
You do not need to fall into the deepest quantisation theory on day one. The broad shape is enough to begin with.
fp32
Higher precision, heavier.
fp16
Still fairly high precision, but lighter than fp32.
q4_0
A common 4-bit quantised family form.
q4_K_M
Another common 4-bit local-inference format. This is the kind of route that eventually gave you a version that felt usable again.
These names are not describing different personalities.
They are describing different weight representations of essentially the same model.
What quantisation helps, and what it does not
It helps with:
- storage
- fitting into local memory
- inference speed
It does not automatically fix:
- a poorly trained adapter
- a distorted model
- a bad recipe
- a degraded alignment balance
That is one of the most important conclusions from the whole path.
Quantisation can solve “slow”.
It cannot solve “this model was trained into the weeds”.
Why some versions are slow and others feel stupid
This is worth preserving as a central judgement.
Slow
Usually means:
- too large
- not quantised
- expensive to load and run
Stupid
Usually means:
- data too small
- recipe unstable
- adapter distorted the model
- evaluation happened too late
Those two problems can absolutely appear in the same run.
But they are not the same disease.
Why a Modelfile is not the same thing as a system prompt
You already felt this distinction earlier.
System prompt
More like a role note for the actor.
It tells the model:
- who it is
- how it should speak
- what not to do
Modelfile
More like the full production brief.
It can define:
- the base model source
- system text
- template
- parameters
- messages
- and sometimes the route by which the model or adapter is packaged
So a Modelfile is not merely a longer system prompt.
It sits one layer above that.
How a Modelfile is built
You already walked the two most important routes.
Route 1: from an existing Ollama model
For example:
FROM llama3.1:8b
SYSTEM ...
PARAMETER ...
TEMPLATE ...
Route 2: from a merged local model directory
For example:
FROM /Users/daniel/llama31-lora-lab/out/merged-qkvo-full
The difference is:
- the first wraps an existing Ollama model
- the second repackages your own completed model artefact
Why ollama create sometimes failed to read your Modelfile properly
This was not simply “Ollama being broken”.
More often the cause was one of these:
- wrong working directory
-fpath not pointed correctly- the
FROMtarget not being something Ollama could ingest in that route - mixing adapter-based assumptions with full-model assumptions
That is why the eventually successful path became:
- merge first
- quantise next
- create a new Ollama model from the resulting complete model
What “returning to Ollama” really means
This is worth keeping as a single very plain sentence:
you are not merely copying training files back into Ollama; you are turning training artefacts into deployment-ready model products
That makes a great many things suddenly reasonable:
- why adapters are not always directly consumable
- why merge can be the smoother path
- why quantisation is practically mandatory in local deployment
- why the Modelfile belongs to packaging, not to training itself
The one sentence worth keeping
If this piece leaves one sentence behind, it should be this:
training usually gives you artefacts, not a finished product. To get a usable local model, you still need adapter management, merging, format conversion, quantisation and Modelfile packaging.
That sentence matters because it separates “training succeeded” from “this is now usable”.