Importing Quantized Models in GGUF Format: A Beginner’s Guide to Ollama

Table of Contents

In this tutorial I'll demonstrate how to import any large language model from Huggingface and run it locally on your machine using Ollama, specifically focusing on GGUF files. As an example, I'll use the CapybaraHermes model from "TheBloke".

If you prefer learning through a visual approach or want to gain additional insight into this topic, be sure to check out my YouTube video on this subject!

Quick Links

Ollama
Microsoft Visual Studio Code
Hugging Face
- Hugging Face Website
- Capybara-Hermes Model from TheBloke

Setup

Here's what I'll be using:

Ollama: framework for running large language models locally
- Open-source and easy to set up
- Link for installation process
Microsoft Visual Studio Code (VSCode): The editor of choice for this project, but any other editor can be used.
- Free and easy to install.
An account with Hugging Face for downloading quantized models (GGUF file)
- Free and easy to sign up

What are Quantized Models?

Quantized models are smaller versions of large language models that have been "shrunk" using a technique called quantization.

How does Quantization work?
It's like compressing a file into a zip archive. The goal is to reduce the size of the model so it can run efficiently on devices with limited resources, such as laptops or mobile phones.

Trade-off between Precision and Accuracy:
While quantization may lead to a trade-off between precision and accuracy, the impact is often not significant for everyday use cases. The quantized model can still perform remarkably well compared to its full-sized version.

In summary, quantized models are smaller versions of large language models that have been compressed using the quantization technique, allowing them to run efficiently on devices with limited resources.

Finding Quantized Models in GGUF Format

Step 1: Navigate to Huggingface MODELS Section

Head over to the Huggingface MODELS section and look for the tag list on the left side of the page.

Step 2: Filter by GGUF Tag

Under LIBRARY, find the GGUF tag and click on it to narrow down your search results to only include models with a GGUF file.

Choosing the Right File

I'll be importing a model called CapyBaraHermes 2.5 Mistral 7B - GGUF from TheBloke.

TheBloke provides a detailed overview of the available files and their implications.

Some files are more compressed, which means they may not be as accurate in their performance. Others are less compressed and therefore offer better accuracy.
It's up to you to decide what you need.

I'll be choosing the Q4 K.M version, which strikes a good balance between compression and quality.

To access the file, simply click on the link that leads to it and then press the download button.

Importing a Quantized Model

Create a Modelfile:
- Outline the specific guidelines for how your customized model should behave.
Generate the Actual Custom Model:
- Take the modelfile and generate the actual custom model.

Step 1 - Creating a Modelfile

Either create a new modelfile from scratch and populate it manually - or copy the modelfile of the quantized model base model (Mistral 7b) for more efficiency
The command is:

ollama show <source-model-name> --modelfile > <target-modelfile-name>

Example:

ollama show mistral --modelfile > capyhermes-modelfile

Step 2 - Generate a Custom Model from Modelfile

Create a custom model from the modelfile.
The command is:

ollama create <target-model-name> -f <target-modelfile-name>

For the Yoda-Example:

ollama create capyhermes -f capyhermes-modelfile

As a result, a new model called capyhermes will be created

Improving your Imported Model

If you're not happy with the performance of your imported, or you want to change it, its 2 easy steps:

Remove your custom model using the instruction ollama rm <custom-model-name>
Change the instructions in the model file
Generate a custom model from the changed model file (step 2).

For example, please see this tutorial on creation custom models.