The Most Simple Way to Set Up ChatGPT Locally | by Dennis Bakhuis

The secret to running LLMs on consumer hardware

Figure 1: Cute tiny little robots are working in a futuristic soap factory (unsplash: Gerard Siderius).

As a data scientist, I have dedicated numerous hours delving into the intricacies of Large Language Models (LLMs) like BERT, GPT2,3,4, and ChatGPT. These advanced models have significantly expanded in scale, making it increasingly challenging to operate the latest high-performance models on standard consumer equipment. Regrettably, at my home, I still do not have a 8x A100 machine at my disposal.

I do not (yet) have a 8x A100 machine at home

In the last few years a new technique was used to make models smaller and faster: quantization. This method elegantly trims down the once-bulky LLMs to a size more digestible for consumer-grade hardware. It’s akin to putting these AI giants on a digital diet, making them fit comfortably into the more modest confines of our home computers. Meanwhile, the open-source community, with trailblazers like 🤗 HuggingFace and 🦄 Mistral, has been instrumental in democratizing access to these models. They’ve essentially turned the exclusive AI club into a ‘come one, come all’ tech fest — no secret handshake required!

While instruction-trained model weights are a significant piece of the puzzle, they’re not the whole picture. Think of these weights as the brain of the operation — essential, yet incomplete without a body. This is where a so-called wrapper comes into play, acting as the limbs that enable the model to process our prompts. And let’s not forget, to really bring this AI show to life, we typically need the muscle of hardware accelerators, like a GPU. It’s like having a sports car (the model) without a turbocharged engine (the GPU) — sure, it looks good, but you won’t be winning any races! 🚗💨💻

In this article, I’ll show you on how to query various Large Language Models locally, directly from your laptop. This works on Windows, Mac, and even Linux (beta). It is based on llama.cpp, so it supports not only CPU, but also common accelerators such as CUDA and Metal.

In the first section we will install the program to process and manage your prompts for various models. The second section will help you get started quickly and in the last section I’ll give some suggestions for models to use. So lets get started!