Deploying Large Language Models with SageMaker Asynchronous Inference | by Ram Vegiraju

Queue Requests For Near Real-Time Based Applications

Image from Unsplash by **Gerard Siderius**

LLMs continue to burst in popularity and so do the number of ways to host and deploy them for inference. The challenges with LLM hosting have been well documented particularly due to the size of the model and ensuring optimal usage of the hardware that they are deployed on. LLM use-cases also vary. Some may require real-time based response times, while others have a more near real-time based latency requirement.

For the latter and for more offline inference use-cases, SageMaker Asynchronous Inference serves as a great option. With Asynchronous Inference, as the name suggests we focus on a more near real-time based workload where the latency is not necessary super strict, but still requires an active endpoint that can be invoked and scaled as necessary. Specifically within LLMs these types of workloads are becoming more and more popular with use-cases such as Content Editing/Generation, Summarization, and more. All of these workloads don’t need sub-second responses, but still require a timely inference that they can invoke as needed as opposed to a fully offline nature such as that of a SageMaker Batch Transform.

In this example, we’ll take a look at how we can use the HuggingFace Text Generation Inference Server in conjunction with SageMaker Asynchronous Endpoints to host the Flan-T-5-XXL Model.

NOTE: This article assumes a basic understanding of Python, LLMs, and Amazon SageMaker. To get started with Amazon SageMaker Inference, I would reference the following guide. We will cover the basics of SageMaker Asynchronous Inference, but for a deeper introduction refer to the starter example here that we will be building off of.

DISCLAIMER: I am a Machine Learning Architect at AWS and my opinions are my own.

When to use SageMaker Asynchronous Inference
TGI Asynchronous Inference Implementation
a. Setup & Endpoint Deployment
b. Asynchronous Inference Invocation
c. AutoScaling Setup
Additional Resources & Conclusion