LLMs for Everyone: Running the HuggingFace Text Generation Inference in Google Colab | by Dmitrii Eliuseev

Contents

Experimenting with Large Language Models for free (Part 3)Text Generation Inference

Experimenting with Large Language Models for free (Part 3)

In the first part of the story, we used a free Google Colab instance to run a Mistral-7B model and extract information using the FAISS (Facebook AI Similarity Search) database. In the second part of the story, we used a LLaMA-13B model and a LangChain library to make a chat with text summarization and other features. In this part, I will show how to use a HuggingFace 🤗 Text Generation Inference (TGI). TGI is a toolkit that allows us to run a large language model (LLM) as a service. As in the previous parts, we will test it in the Google Colab instance, completely for free.

Text Generation Inference

Text Generation Inference (TGI) is a production-ready toolkit for deploying and serving large language models (LLMs). Running LLM as a service allows us to use it with different clients, from Python notebooks to mobile apps. It is interesting to test the TGI’s functionality, but it turned out that its system requirements are pretty high, and not everything works as smoothly as expected:

A free Google Colab instance provides only 12.7 GB of RAM, which is often not enough to load a 13B or even 7B model “in one piece.” The AutoModelForCausalLM class from HuggingFace allows us to use “sharded” models that were split into smaller chunks. It works well in Python, but for some reason, this functionality does not work in TGI, and the instance is crashing with a “not enough memory” error.
A VRAM size can be a second issue. In my tests with TGI v1.3.4, 8-bit quantization was working well with a bitsandbytes library, but the 4-bit quantization (bitsandbytes-nf4 option) did not work. I especially verified this in Colab Pro on the 40 GB NVIDIA A100 GPU; even with bitsandbytes-nf4 or bitsandbytes-fp4 enabled, the required VRAM size was 16.4 GB, which is too high for a free Colab instance (and even for Colab Pro users, the 40 GB NVIDIA A100 usage price is 2–4x higher compared to 16 GB NVIDIA T4).
TGI needs Rust to be installed. A free Google Colab instance does not have a full-fledged terminal, so proper installation is also a challenge.