LLM Naming Explained (What do the options mean?)

By
  • Martin Kollie
2 minute read
LLM Models Explained
Deep seek model examples

Model Naming Explained

You've browsed the Ollama homepage and are unsure of which model to download. There are a bunch of options: Like should you get the q4_K_M or fp16? What does this all mean?

Luckily most models follow a similar naming convention. For example, let's look at a Llama model series:

llama3.3-70b-instruct-q4_K_M model name - number of parameters - fine-tuned type - quantization type

Parameters

This is the 70b model which means 70 billion, which is an indication of the capacity and complexity of the model. Higher numbers typically perform better but also require better hardware.

Instruction Model

An instruction model is a type of large language model that has been fine-tuned to closely follow instructions or prompts given by the user. If you want to chat with it like ChatGPT, then an instruct model is probably the best choice.

Quantization Type

Quantization is used to compress the model. There are different levels to it. Q4 stands for 4-bit quantization. It's a good balance between performance and accuracy.

Some types explained:

  • fp16 - is full precision 16-bit floating point. This is the least compressed and highest quality, hence the massive file size.
  • q - refers to the quantization levels.
  • This model has q2 all the way to q8. Higher levels require more memory.
  • Suffixes:
    • 0 or 1 - uniform quantization
    • K - refers to the quantization method K-quant
    • (_S, _M, _L) Small, Medium or Large - Block size (small: low memory usage, lower precision)

If not specified, it's usually q4_K_M by default.

Quantization is similar to watching a 1080p video vs. a 720p video vs. 480p. You are making tradeoffs between quality (resolution) vs. size/accuracy.

Higher quantization can give you better results but may require more computing power. For most use cases, you want to go for q4_K_M, which offers a good balance between performance and accuracy. This is also the default quantization for most models.

Here's a table with what the values mean:

CategorySizeQuantization MethodQuality ImpactRecommendation
Q2_KsmallestK-quantextreme quality lossNot Recommended
Q3_K/Q3_K_Mvery smallK-quantvery high quality lossNot Recommended
Q3_K_Svery smallK-quantvery high quality lossNot Recommended
Q4_0smallUniformvery high quality lossprefer using Q3_K_M
Q4_1smallUniformsubstantial quality lossprefer using Q3_K_L
Q4_K_SsmallK-quantsignificant quality lossNot Recommended
Q4_K/Q4_K_MmediumK-quantbalanced qualityRecommended
Q5_0mediumUniformbalanced qualityprefer using Q4_K_M
Q5_1mediumUniformlow quality lossprefer using Q5_K_M
Q5_K_SlargeK-quantlow quality lossRecommended
Q5_K/Q5_K_MlargeK-quantvery low quality lossRecommended
Q6_Kvery largeK-quantextremely low quality lossNot Specified
Q8_0very largeUniformextremely low quality lossNot Recommended
F16/FP16extremely largeN/Avirtually no quality lossNot Recommended
F32/FP32absolutely hugeN/AlosslessNot Recommended

LLAMA.cpp Github reference

Now you should be able to tell

deepseek-r1:70b-llama-distill-q4_K_M

Oooh, that's the DeepSeek R1 70 billion parameter model distilled into Llama with q4 K-quantization of a medium size which offers a good balance between performance and accuracy.


Email newsletter

Get updates sent to your inbox.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.


More Posts