Looking to experience AI at its best? We try Llama2, an new model from Meta, locally on our system for free! Transform your PC into a playground for next-gen AI, at absolutely zero cost!
Today I will unravel the process of running Llama2, a state-of-the-art llm model from Meta, for free! Let's dive into using this powerful tool.
I will assume a minimal level of familiarity with the command line and also that you have a basic machine with at least 8GB of RAM and without a GPU. If you have a GPU, you can use it to speed up the inference process (explore the official docs).
Clone the Repository
First and foremost, We need to clone the repository. Using the GitHub llama.cpp, we may clone the repository on our machine. Of course, we need to have Git installed for this to work.
After the clone, we follow the build instructions to compile the code and make the program executable.
We have three different options for building the llama.cpp project:
For Linux or MacOS systems, we can simply build the program in the terminal using:
For Windows, we do the following:
- Download the latest version of w64devkit.
- Extract 'w64devkit' on my computer.
- Run 'w64devkit.exe'.
- Use the 'cd' command to get to the 'llama.cpp' directory.
- From here we all set to run:
mkdir build cd build cmake .. cmake --build . --config Release
Note: The GGML versions are used to run the model on the CPU, while the GPTQ versions are used to run the model on GPU. However, with the GGML versions it may offload some layers to the GPU, which will speed up the inference process. Explore the official docs for more information. Update: The GGML versions are deprecated. Use the new GGUF versions instead.
For an 8GB RAM, is recommended to begin with the llama-2-7b-chat.ggmlv3.q4_K_S.bin version.
Following the download, put the obtained file in the
Spring Into Action
All set? Start the server!
For unix-based systems (such as Linux and macOS):
./server -m models/7B/llama-2-7b-chat.ggmlv3.q4_K_S.bin -c 2048
For Windows systems:
server.exe -m models\7B\llama-2-7b-chat.ggmlv3.q4_K_S.bin -c 2048
That's all there is to it!
Get Engaged with Llama2
Once everything is set into place, simply type into the web browser "http://127.0.0.1:8080".
Now, start inference with my Llama2. Observe how human the interactions seem, get plenty set of useful information, and simply have fun!
In a time when the role of AI is amplifying and human-like chatbots are soaring in popularity, mastering the ground-level implementation of tools like Llama2, can be overwhelming with the expansive possibilities of AI systems. So, go ahead and apply this simple demonstration to host Llama2 on your device.
The downloaded model is a 7B model, which means it has 7 billion parameters. The larger the model, the more powerful it is. However, the larger the model, the more RAM it requires. If you have a GPU, you can use it to speed up the inference.
Different models have different capabilities. For example, the
Chat model is fine-tuned to have conversations, while the basic model is unrestricted and may generate more diverse content.
Announcing Goat, A simple way to run LLMs locally
While there are many UI to run LLMs I haven't found an easy and as zero-config as possible. So I created Goat, a simple way to run LLMs locally. It's an installable app that runs locally on your machine and allows you to run LLMs simply. It will be open-source and free to use. I am planning to release it in the next few months supporting Windows, Mac, and Linux.
If you want to be notified when it's released, you may subscribe to the mailing list.