Hosting Multiple LLMs on a Single Endpoint | by Ram Vegiraju | Jan, 2024

Editor
3 Min Read


Utilize SageMaker Inference Components to Host Flan & Falcon in a Cost & Performance Efficient Manner

Image from Unsplash by Michael Dziedzic

The past year has witnessed an explosion in the Large Language Model (LLM) space with a number of new models paired with various technologies and tools to help train, host, and evaluate these models. Specifically, Hosting/Inference is where the power of these LLMs and Machine Learning in general is recognized, as without inference there is no visual result or purpose to these models.

As I’ve documented in the past, hosting these LLMs can be quite challenging due to the size of the model and utilizing the associated hardware behind a model efficiently. While we’ve worked with model serving technologies such as DJL Serving, Text Generation Inference (TGI), and Triton in conjunction with a model/infrastructure hosting platform such as Amazon SageMaker to be able to host these LLMs, another question arises as we try to productionize our LLM use-cases. How we can we do this for multiple LLMs?

Why does the initial question even arise? When we get to production level use-cases, its common to have multiple models that may be utilized. For instance, maybe a Llama model is used for your summarization use-case, while a Falcon model is powering your chatbot. While we can host these models each on their own persistent endpoint, this leads to heavy cost implications. A solution where both cost and performance/resource allocation and optimization is considered is needed.

In this article, we will explore how we can utilize an advanced hosting option known as SageMaker Inference Components to address this problem and build out an example where we host both a Flan and Falcon model on a singular endpoint.

NOTE: This article assumes an intermediate understanding of Python, LLMs, and Amazon SageMaker Inference. I would suggest following this article for getting started with Amazon SageMaker Inference.

DISCLAIMER: I am a Machine Learning Architect at AWS and my opinions are my own.

  1. Inference Components Introduction
  2. Other Multi-Model SageMaker Inference Hosting Options
Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.