Try Llama2 Locally at No Cost

Today I will unravel the process of running Llama2, a state-of-the-art llm model from Meta, for free! Let's dive into using this powerful tool.

I will assume a minimal level of familiarity with the command line and also that you have a basic machine with at least 8GB of RAM and without a GPU. If you have a GPU, you can use it to speed up the inference process (explore the official docs).

Clone the Repository

First and foremost, We need to clone the repository. Using the GitHub llama.cpp, we may clone the repository on our machine. Of course, we need to have Git installed for this to work.

After the clone, we follow the build instructions to compile the code and make the program executable.

Build Instructions

We have three different options for building the llama.cpp project:

Using 'make'

For Linux or MacOS systems, we can simply build the program in the terminal using:

make

For Windows, we do the following:

Download the latest version of w64devkit.
Extract 'w64devkit' on my computer.
Run 'w64devkit.exe'.
Use the 'cd' command to get to the 'llama.cpp' directory.
From here we all set to run:
```
make
```

Using CMake

mkdir build
cd build
cmake ..
cmake --build . --config Release

Download Llama2

Proceed to download the latest release of Llama2 (a quantized version brought by TheBloke) from here. We look for what best suits our system's RAM following the table provided here.

Note: The GGML versions are used to run the model on the CPU, while the GPTQ versions are used to run the model on GPU. However, with the GGML versions it may offload some layers to the GPU, which will speed up the inference process. Explore the official docs for more information. Update: The GGML versions are deprecated. Use the new GGUF versions instead.

For an 8GB RAM, is recommended to begin with the llama-2-7b-chat.ggmlv3.q4_K_S.bin version.

Following the download, put the obtained file in the models/7B directory.

Spring Into Action

All set? Start the server!

For unix-based systems (such as Linux and macOS):

./server -m models/7B/llama-2-7b-chat.ggmlv3.q4_K_S.bin -c 2048

For Windows systems:

server.exe -m models\7B\llama-2-7b-chat.ggmlv3.q4_K_S.bin -c 2048

That's all there is to it!

Get Engaged with Llama2

Once everything is set into place, simply type into the web browser "http://127.0.0.1:8080".

Now, start inference with my Llama2. Observe how human the interactions seem, get plenty set of useful information, and simply have fun!

In a time when the role of AI is amplifying and human-like chatbots are soaring in popularity, mastering the ground-level implementation of tools like Llama2, can be overwhelming with the expansive possibilities of AI systems. So, go ahead and apply this simple demonstration to host Llama2 on your device.

Try More

The downloaded model is a 7B model, which means it has 7 billion parameters. The larger the model, the more powerful it is. However, the larger the model, the more RAM it requires. If you have a GPU, you can use it to speed up the inference.

Different models have different capabilities. For example, the Chat model is fine-tuned to have conversations, while the basic model is unrestricted and may generate more diverse content.

Announcing Goat, A simple way to run LLMs locally

While there are many UI to run LLMs I haven't found an easy and as zero-config as possible. So I created Goat, a simple way to run LLMs locally. It's an installable app that runs locally on your machine and allows you to run LLMs simply. It will be open-source and free to use. I am planning to release it in the next few months supporting Windows, Mac, and Linux.

If you want to be notified when it's released, you may subscribe to the mailing list.