This appendix is not meant to retell the whole series.

Its job is simpler:

The main series is the map.
This appendix is the pocket index.


1. Layer map at a glance

TermClosest layerPlainest interpretation
Single requestapplication / interaction layerwhat you actually send into the model this time
System Promptprompt / system layerthe director’s role note for the actor
TEMPLATE / Chat Templateprompt-format layerthe script format for the dialogue
MESSAGE / few-shotprompt-demonstration layera few sample performances
PARAMETERinference / runtime layershooting-style controls like temperature and top-p
Modelfileservice / packaging layerthe whole production brief
tokenizerinput encoding layerthe translator that turns text into tokens
base model weightsmodel-weight layerthe actor’s original brain and training base
adapter / LoRAparameter-increment layera trainable prosthetic or muscle-memory increment
DPO / SFTtraining-method layerhow you choose to teach the model
mergedeployment conversion layerfolding the increment back into a full model
quantizationdeployment optimisation layermaking the model lighter to run locally
Ollamalocal service layerpackaging the model into a runnable local service

2. Core glossary

base model

The starting point for all later customisation.
If it is a raw base, it is closer to pretrained foundation. If it is instruct, it has already gone through instruction tuning.

instruct

A version that has already been trained to follow instructions and interact more usefully.
Most of your own experiments sat on Llama-3.1-8B-Instruct.

weights

The model’s real parameters.
If prompts are role notes and templates are script format, weights are the actor’s original brain and training.

adapter

A separate trainable weight block attached to the base model.
It can be stored separately and does not have to be merged.

fine-tuned adapter

An adapter that has already been trained, such as:

  • adapter_model.safetensors
  • adapter_config.json

LoRA

Low-Rank Adaptation.
A PEFT route that changes model behaviour through a relatively small number of trainable parameters. LoRA is not a training objective. It is an update strategy.

PEFT

Parameter-Efficient Fine-Tuning.
The broader family of methods that aim to avoid opening the whole model.

SFT

Supervised Fine-Tuning.
Teaching the model how to answer through target demonstrations.

LoRA SFT

Doing SFT through a LoRA-based PEFT route.

DPO

Direct Preference Optimization.
Teaching the model through prompt + chosen + rejected preference pairs.

TRL

Transformers Reinforcement Learning.
The Hugging Face-adjacent library that provides common trainers for SFT, DPO and other alignment-style workflows.

SFTTrainer

A trainer built around demonstration-style supervised fine-tuning.

DPOTrainer

A trainer built around preference data with chosen and rejected answers.

tokenizer

Turns text into tokens and often cooperates with chat templates in chat-style models.

chat template

The formatting logic that arranges system, user and assistant turns into the sequence the model expects.

prompt-completion

A simple data format:

  • prompt
  • completion

conversational

A chat-style format:

  • messages
  • assistant reply

few-shot

Putting a few demonstrations directly into the prompt context.
This does not write into model weights. It writes into the current request.

prefix tuning

A tuning route closer to parameter space than plain prompting, but usually still lighter than LoRA.

prompt tuning

A route where a small learnable prompt representation is trained rather than opening large sections of the model.

q_proj / k_proj / v_proj / o_proj

Core projection layers inside transformer attention.
Your baseline and qkvo routes were mainly deciding how much of this region to touch.

all-linear

A shorthand for applying LoRA to a broader set of linear modules, not just the main attention projections.

target_modules

The LoraConfig field that decides where LoRA will be attached.

layers_to_transform

Decides on which layers those target modules will be modified.

layers_pattern

Helps match the correct model-layer structure when doing selective layer targeting.

model.model.norm

A normalisation layer near the output end of the model.
Opening it can have a strong effect and a high cost.

lm_head

The output head that maps hidden states to vocabulary logits.
Also very sensitive because of how close it is to final outputs.

partial FT

Partial fine-tuning.
Opening only some original weights rather than the entire model.

full fine-tune

Opening and training the original model weights much more broadly or completely.

merge

Folding the adapter’s learned increment back into the base model weights.

Safetensors

A weight-file format.
Not a model category.

GGUF

A weight-container format common in local quantised inference ecosystems.

quantization

Replacing heavier high-precision weight representations with lighter ones so the model is easier to run locally.

fp32 / fp16

Higher-precision representation families.
Heavier than q-formats.

q4 / q4_0 / q4_K_M

Common 4-bit quantised families and variants.
The main thing to remember is that they are much lighter than fp16.

blob

Usually not a separate model, but an internal stored object or layer artefact in a toolchain.

Modelfile

Ollama’s packaging blueprint.
It can define:

  • FROM
  • SYSTEM
  • TEMPLATE
  • PARAMETER
  • MESSAGE
  • and some model-source paths

Ollama

A local model service and packaging tool.
It turns model artefacts into something you can actually run locally.


3. Common parameter map

r

One of the main LoRA capacity knobs.
Not automatically better when larger.

lora_alpha

A scaling factor for LoRA influence.
Not the learning rate, but still important.

learning_rate

The size of each parameter update step.
Not something you should always maximise.

num_train_epochs

How many full passes through the dataset are performed.

max_length

How long each example is allowed to be.
Longer is heavier.

gradient_accumulation_steps

Accumulate gradients over several smaller steps before updating.
A way of trading time for memory.

dataloader_pin_memory

Often discussed in CUDA contexts; in your MPS experiments it was not especially helpful.

temperature

Controls sampling spread.
Only meaningful when sampling is enabled.

top_p

A sampling control, again mostly meaningful when sampling is actually enabled.

do_sample

Whether generation is running in sampling mode or deterministic mode.


4. Sentences worth remembering

  • SFT teaches the model how to answer; DPO teaches the model which answer to prefer.
  • LoRA is an update strategy, not a training objective.
  • A Modelfile is not just a longer system prompt; it is a higher-level packaging blueprint.
  • Adapters do not have to be merged; merge is simply convenient for some deployment paths.
  • Quantization can rescue slowness, but not necessarily stupidity.
  • The best mainline is not the most dramatic version, but the one that preserves the base while actually solving the problem.