Deploying LLMs Into Production Using TensorRT LLM | by Het Trivedi

Contents

Hands-On Python Tutorial Step 1: Compiling the model Step 2: Deploying the compiled model Deploying the model in GKE Performance Benchmarks Conclusion Enjoyed This Story?Images

Hands-On Python Tutorial

There are two steps to deploy a model using TensorRT-LLM:

Compile the model
Deploy the compiled model as a REST API endpoint

Step 1: Compiling the model

For this tutorial, we will be working with Mistral 7B Instruct v0.2. As mentioned earlier, the compilation phase requires a GPU. I found the easiest way to compile a model is on a Google Colab notebook.

TensorRT LLM is primarily supported on high end Nvidia GPUs. I ran the google colab on an A100 40GB GPU and will use the same GPU for deployment as well.

!git clone https://github.com/NVIDIA/TensorRT-LLM.git
%cd TensorRT-LLM/examples/llama

Clone the TensorRT-LLM git repo. This repo contains all of the modules and scripts we need to compile the model.

!pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
!pip install huggingface_hub pynvml mpi4py
!pip install -r requirements.txt

Install the necessary Python dependencies.

from huggingface_hub import snapshot_download
from google.colab import userdatasnapshot_download(
"mistralai/Mistral-7B-Instruct-v0.2",
local_dir="tmp/hf_models/mistral-7b-instruct-v0.2",
max_workers=4
)

Download the Mistral 7B Instruct v0.2 model weights from hugging face and store them in a local directory at tmp/hf_models/mistral-7b-instruct-v0.2
If you look inside the tmp/hf_models directory in Colab you should see the model weights there.

!python convert_checkpoint.py --model_dir ./tmp/hf_models/mistral-7b-instruct-v0.2 \
--output_dir ./tmp/trt_engines/1-gpu/ \
--dtype float16

The raw model weights cannot be compiled. Instead, they have to get converted into a specific tensorRT LLM format.
The convert_checkpoint.py script takes the raw Mistral weights and converts them into a compatible format.
The --model_dir is the path to the raw model weights.
The --output_dir is the path to the converted weights.

!trtllm-build --checkpoint_dir ./tmp/trt_engines/1-gpu/ \
--output_dir ./tmp/trt_engines/compiled-model/ \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--max_input_len 32256

The trtllm-build command compiles the model. At this stage, you can pass in various optimization flags as well. To keep things simple, I have not used any additional optimizations.
The --checkpoint_dir is the path to the converted model weights.
The --output_dir is where the compiled model gets saved.
Mistral 7B Instruct v0.2 supports a 32K context length. I’ve set that context length using the--max_input_length flag.

Note: Compiling the model can take 15–30 minutes

Once the model is compiled, you can upload your compiled model to hugging face hub. In order the upload files to hugging face hub you will need a valid access token that has WRITE access.

import os
from huggingface_hub import HfApifor root, dirs, files in os.walk(f"tmp/trt_engines/compiled-model", topdown=False):
for name in files:
filepath = os.path.join(root, name)
filename = "/".join(filepath.split("/")[-2:])
print("uploading file: ", filename)
api = HfApi(token=userdata.get('HF_WRITE_TOKEN'))
api.upload_file(
path_or_fileobj=filepath,
path_in_repo=filename,
repo_id="<your-repo-id>/mistral-7b-v0.2-trtllm"
)

This code uploads the compiled model, the .engine file, to hugging face under your user id.
Replace the <your-repo-id> in the code with your hugging face repo which is usually your hugging face user id.

Awesome! That finishes the model compilation part. Onto the deployment step.

Step 2: Deploying the compiled model

There are a lot of ways to deploy this compiled model. You can use a simple tool like FastAPI or something more complex like the triton inference server.

When using a tool like FastAPI, the developer has to set up the API server, write the Dockerfile, and configure CUDA correctly. Managing these things can be a real pain and it ruins the overall developer experience.

To avoid these issues, I’ve decided to use a simple open-source tool called Truss. Truss allows developers to easily package their models with GPU support and run them on any cloud environment. It has a ton of great features that make containerizing models a breeze:

GPU support out of the box. No need to deal with CUDA.
Automatic Dockerfile creation. No need to write it yourself.
Production ready API server
Simple python interface

The main benefit of using Truss is that you can easily containerize a model with GPU support and deploy it to any cloud environment.

Build the Truss once. Deploy is anywhere.

Creating the Truss

Create or open a python virtual environment with python version ≥ 3.8 and install the following dependency:

pip install --upgrade truss

(Optional) If you want to create your Truss project from scratch you can run the command:

truss init mistral-7b-tensort-llm

You will be prompted to give your model a name. Any name such as Mistral 7B Tensort LLM will do. Running the command above auto generates the required files to deploy a Truss.

To speed the process up a bit, I have a Github repository that contains the required files. Please clone the Github repository below:

This is what the directory structure should look like for mistral-7b-tensorrt-llm-truss :

├── mistral-7b-tensorrt-llm-truss
│   ├── config.yaml
│   ├── model
│   │   ├── __init__.py
│   │   └── model.py
|   |   └── utils.py
|   ├── requirements.txt

Here’s a quick breakdown of what the files above are used for:

The config.yaml is used to set various configurations for your model, including its resources, dependencies, environmental variables, and more. This is where we can specify the model name, which Python dependencies to install, as well as which system packages to install.

2. The model/model.py is the heart of Truss. It contains the Python code that will get executed on the Truss server. In the model.py there are two main methods: load() and predict() .

The load method is where we’ll download the compiled model from hugging face and initialize the TensorRT LLM engine.
The predict method receives HTTP requests and calls the model.

3. The model/utils.py contains some helper functions for the model.py file. I did not write the utils.py file myself, I took it directly from the TensorRT LLM repository.

4. The requirements.txt contains the necessary Python dependencies to run our compiled model.

Deeper Code Explanation:

The model.py contains the main code that gets executed, so let’s dig a bit deeper into that file. Let’s first take a look at the load function.

import subprocess
subprocess.run(["pip", "install", "tensorrt_llm", "-U", "--pre", "--extra-index-url", "https://pypi.nvidia.com"])import torch
from model.utils import (DEFAULT_HF_MODEL_DIRS, DEFAULT_PROMPT_TEMPLATES,
load_tokenizer, read_model_name, throttle_generator)
import tensorrt_llm
import tensorrt_llm.profiler
from tensorrt_llm.runtime import ModelRunnerCpp, ModelRunner
from huggingface_hub import snapshot_download
STOP_WORDS_LIST = None
BAD_WORDS_LIST = None
PROMPT_TEMPLATE = None
class Model:
def __init__(self, **kwargs):
self.model = None
self.tokenizer = None
self.pad_id = None
self.end_id = None
self.runtime_rank = None
self._data_dir = kwargs["data_dir"]
def load(self):
snapshot_download(
"htrivedi99/mistral-7b-v0.2-trtllm",
local_dir=self._data_dir,
max_workers=4,
)
self.runtime_rank = tensorrt_llm.mpi_rank()
model_name, model_version = read_model_name(f"self._data_dir/compiled-model")
tokenizer_dir = "mistralai/Mistral-7B-Instruct-v0.2"
self.tokenizer, self.pad_id, self.end_id = load_tokenizer(
tokenizer_dir=tokenizer_dir,
vocab_file=None,
model_name=model_name,
model_version=model_version,
tokenizer_type="llama",
)
runner_cls = ModelRunner
runner_kwargs = dict(engine_dir=f"self._data_dir/compiled-model",
lora_dir=None,
rank=self.runtime_rank,
debug_mode=False,
lora_ckpt_source="hf",
)
self.model = runner_cls.from_dir(**runner_kwargs)

What’s happening here:

At the top of the file we import the necessary modules, specifically tensorrt_llm
Next, inside the load function, we download the compiled model using the snapshot_download function. My compiled model is at the following repo id: htrivedi99/mistral-7b-v0.2-trtllm . If you uploaded your compiled model elsewhere, update this value accordingly.
Then, we download the tokenizer for the model using the load_tokenizer function that comes with model/utils.py .
Finally, we use TensorRT LLM to load our compiled model using the ModelRunner class.

Cool, let’s take a look at the predict function as well.

def predict(self, request: dict):prompt = request.pop("prompt")
max_new_tokens = request.pop("max_new_tokens", 2048)
temperature = request.pop("temperature", 0.9)
top_k = request.pop("top_k",1)
top_p = request.pop("top_p", 0)
streaming = request.pop("streaming", False)
streaming_interval = request.pop("streaming_interval", 3)
batch_input_ids = self.parse_input(tokenizer=self.tokenizer,
input_text=[prompt],
prompt_template=None,
input_file=None,
add_special_tokens=None,
max_input_length=1028,
pad_id=self.pad_id,
)
input_lengths = [x.size(0) for x in batch_input_ids]
outputs = self.model.generate(
batch_input_ids,
max_new_tokens=max_new_tokens,
max_attention_window_size=None,
sink_token_length=None,
end_id=self.end_id,
pad_id=self.pad_id,
temperature=temperature,
top_k=top_k,
top_p=top_p,
num_beams=1,
length_penalty=1,
repetition_penalty=1,
presence_penalty=0,
frequency_penalty=0,
stop_words_list=STOP_WORDS_LIST,
bad_words_list=BAD_WORDS_LIST,
lora_uids=None,
streaming=streaming,
output_sequence_lengths=True,
return_dict=True)
if streaming:
streamer = throttle_generator(outputs, streaming_interval)
def generator():
total_output = ""
for curr_outputs in streamer:
if self.runtime_rank == 0:
output_ids = curr_outputs['output_ids']
sequence_lengths = curr_outputs['sequence_lengths']
batch_size, num_beams, _ = output_ids.size()
for batch_idx in range(batch_size):
for beam in range(num_beams):
output_begin = input_lengths[batch_idx]
output_end = sequence_lengths[batch_idx][beam]
outputs = output_ids[batch_idx][beam][
output_begin:output_end].tolist()
output_text = self.tokenizer.decode(outputs)
current_length = len(total_output)
total_output = output_text
yield total_output[current_length:]
return generator()
else:
if self.runtime_rank == 0:
output_ids = outputs['output_ids']
sequence_lengths = outputs['sequence_lengths']
batch_size, num_beams, _ = output_ids.size()
for batch_idx in range(batch_size):
for beam in range(num_beams):
output_begin = input_lengths[batch_idx]
output_end = sequence_lengths[batch_idx][beam]
outputs = output_ids[batch_idx][beam][
output_begin:output_end].tolist()
output_text = self.tokenizer.decode(outputs)
return "output": output_text

What’s happening here:

The predict function accepts a few model inputs such as the prompt , max_new_tokens , temperature , etc. We extract all of these values at the top of the function using the request.pop method.
Next, we format the prompt into the required format for TensorRT LLM using the self.parse_input helper function.
Then, we call our LLM model to generate the outputs using the self.model.generate function. The generate function accepts a variety of arguments that help control the output of the LLM.
I’ve also added some code to enable streaming by producing a generator object. If streaming is disabled, the tokenizer simply decodes the output of the LLM and returns it as a JSON object.

Awesome! That covers the coding portion. Let’s containerize it.

Containerizing the model:

In order to run our model in the cloud we need to containerize it. Truss will take care of creating the Dockerfile and packaging everything for us, so we don’t have to do much.

Outside of the mistral-7b-tensorrt-llm-truss directory create a file called main.py . Paste the following code inside it:

import truss
from pathlib import Pathtr = truss.load("./mistral-7b-tensorrt-llm-truss")
command = tr.docker_build_setup(build_dir=Path("./mistral-7b-tensorrt-llm-truss"))
print(command)

Run the main.py file and look inside the mistral-7b-tensorrt-llm-truss directory. You should see a bunch of files get auto-generated. We don’t need to worry about what these files mean, it’s just Truss doing its magic.

Next, let’s build our container using docker. Run the commands below sequentially:

docker build mistral-7b-tensorrt-llm-truss -t mistral-7b-tensorrt-llm-truss:latest
docker tag mistral-7b-tensorrt-llm-truss <docker_user_id>/mistral-7b-tensorrt-llm-truss
docker push <docker_user_id>/mistral-7b-tensorrt-llm-truss

Sweet! We’re ready to deploy the model in the cloud!

Deploying the model in GKE

For this section, we’ll be deploying the model on Google Kubernetes Engine. If you recall, during the model compilation step we ran the Google Colab on an A100 40GB GPU. For TensorRT LLM to work, we need to deploy the model on the exact same GPU for inference.

I won’t go super deep into how to set up a GKE cluster as it’s not in the scope of this article. But, I will provide the specs I used for the cluster. Here are the specs:

1 node, standard kubernetes cluster (not autopilot)
1.28.5 gke kubernetes version
1 Nvidia A100 40GB GPU
a2-highgpu-1g machine (12 vCPU, 85 GB memory)
Google managed GPU Driver installation (Otherwise we need to install Cuda driver manually)
All of this will run on a spot instance

Once the cluster is configured, we can launch it and connect to it. After the cluster is active and you’ve successfully connected to it, create the following kubernetes deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b-v2-trt
namespace: default
spec:
replicas: 1
selector:
matchLabels:
component: mistral-7b-v2-trt-layer
template:
metadata:
labels:
component: mistral-7b-v2-trt-layer
spec:
containers:
- name: mistral-container
image: htrivedi05/mistral-7b-v0.2-trt:latest
ports:
- containerPort: 8080
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-a100
---
apiVersion: v1
kind: Service
metadata:
name: mistral-7b-v2-trt-service
namespace: default
spec:
type: ClusterIP
selector:
component: mistral-7b-v2-trt-layer
ports:
- port: 8080
protocol: TCP
targetPort: 8080

This is a standard kubernetes deployment that runs a container with the image htrivedi05/mistral-7b-v0.2-trt:latest . If you created your own image in the previous section, go ahead and use that. Feel free to use mine otherwise.

You can create the deployment by running the command:

kubectl create -f mistral-deployment.yaml

It takes a few minutes for the kubernetes pod to be provisioned. Once the pod is running, the load function we wrote earlier will get executed. You can check the logs of the pod by running the command:

kubectl logs <pod-name>

Once the model is loaded, you will see something like Completed model.load() execution in 449234 ms in the pod logs. To send a request to the model via HTTP we need to port-forward the service. You can use the command below to do that:

kubectl port-forward svc/mistral-7b-v2-trt-service 8080

Great! We can finally start sending requests to the model! Open up any Python script and run the following code:

import requestsdata = "prompt": "What is a mistral?"
res = requests.post("http://127.0.0.1:8080/v1/models/model:predict", json=data)
res = res.json()
print(res)

You will see an output like the following:

"output": "A Mistral is a strong, cold wind that originates in the Rhone Valley in France. It is named after the Mistral wind system, which is associated with the northern Mediterranean region. The Mistral is known for its consistency and strength, often blowing steadily for days at a time. It can reach speeds of up to 130 kilometers per hour (80 miles per hour), making it one of the strongest winds in Europe. The Mistral is also known for its clear, dry air and its role in shaping the landscape and climate of the Rhone Valley."

The performance of TensorRT LLM can be visibly noticed when the tokens are streamed. Here’s an example of how to do that:

data = "prompt": "What is mistral wind?", "streaming": True, "streaming_interval": 3
res = requests.post("http://127.0.0.1:8080/v1/models/model:predict", json=data, stream=True)for content in res.iter_content():
print(content.decode("utf-8"), end="", flush=True)

This mistral model has a fairly large context window, so feel free to try it out with different prompts.

Performance Benchmarks

Just by looking at the tokens being streamed, you can probably tell TensorRT LLM is really fast. However, I wanted to get real numbers to capture the performance gains of using TensorRT LLM. I ran some custom benchmarks and got the following results:

Small Prompt:

Medium prompt:

Large prompt:

Conclusion

In this blog post, my goal was to demonstrate how state-of-the-art inference can be achieved using TensorRT LLM. We covered everything from compiling an LLM to deploying the model in production.

While TensorRT LLM is more complex than other inferencing optimizers, the performance speaks for itself. This tool provides state-of-the-art LLM optimizations while being completely open-source and is designed for commercial use. This framework is still in the early stages and is under active development. The performance we see today will only improve in the coming years.

I hope you found something valuable in this article. Thanks for reading!

Enjoyed This Story?

Consider subscribing for free.

Images

If not otherwise stated, all images are created by the author.