Mar 22, 2023 • Written By Aaron Pham
We are seeing a surge in recent months of developments and works on large language models (LLM) and its applications such as ChatGPT, Stable Diffusion, Copilot.
However, deploying and serving LLMs at scale is a challenging task that requires specific domain expertise and inference infrastructure. A rough estimation of running ChatGPT shows that serving efficiency is critical to making such models work at scale. These operations are often known as Large Language Models Operations (LLMOps). LLMOps, in general, is considered as a subset of MLOps, which is a set of practices combining software engineering, DevOps, and data science to automate and scale the end-to-end lifecycle of ML models.
Teams can encounter several problems when running inference on large models, including:
In this blog post, we will be demonstrating the capabilities of BentoML and Triton Inference Server to help you solve these problems.
Triton Inference Server is a high performance, open-source inference server for serving deep learning models. It is designed to serve a variety of deep learning models and frameworks, such as ONNX, TensorFlow, TensorRT. It is also designed with optimisations to maximise hardware utilisation through various model execution and efficient batching strategies.
Triton Inference Server is great for serving large language models, where you want a high-performance inference server that can utilise all available resources with complex batching strategies.
BentoML is an open-source platform designed to facilitate the development, shipping, and scaling of AI applications. It empowers teams to rapidly develop AI applications that involve multiple models and custom logic using Python. Once developed, BentoML allows these applications to be seamlessly shipped to production on any cloud platform with engineering best practices already integrated. Additionally, BentoML makes it easy to scale these applications efficiently based on usage, ensuring that they can handle any level of demand.
Starting BentoML v1.0.16, Triton Inference Servers can now be seamlessly used as a Runner. Runners are abstractions of logic that can execute on either CPU or GPU and scale independently. Prior to the Triton integration, one of the drawbacks of using Python runners is the Global Interpreter Lock (GIL), where it only allows one thread to be executed at a time. While the model inference can still run on GPU or multi-threaded CPU, the IO logic is still subjective to the limitations of GIL, which limits the underlying hardware utilisation (CPU and GPU). Triton’s C++ runtime is optimised for high throughput model serving. By using Triton as a runner, users can take full advantages of Triton’s high-performance inference, while continue enjoy all features that BentoML offers.
In the following tutorial, we will use a PyTorch YOLOv5 object detection example. The source code can be found in the Triton PyTorch YOLOv5 example project. You can also find the TensorFlow and ONNX examples under the same directory.
The TLDR is that BentoML provides the capabilities for users to run Triton Inference Server via
In order to use the bentoml.triton API, users are required to have the Triton Inference Server container image available locally.
Install the extension for BentoML with Triton support:
The following section assumes that you have a basic understanding of BentoML architecture. If you are new to BentoML, we recommend you to read our Getting Started guide first.
To prepare your model repository under your BentoML project, you will need to put your model in the following file structure:
1 is the version of the model, and
model.pt is the TorchScript model.
Note that the model weight file name must prefix with
model.<extensions>for all Triton model
config.pbtxt file is the model configuration that denotes how Triton can serve this models.
The example for the
config.pbtxt for YOLOv5 model is as follows:
Note that for PyTorch models, you will need to export your model to TorchScript first. Refer to PyTorch's guide to learn more about how to convert your model to TorchScript.
Now that we have our model repository ready, we can create a Triton Runner to interact with others BentoML Runners.
You can also use S3 or GCS as your model repository, by passing the path to your S3 or GCS bucket to the
model_repository argument. If S3 or GCS bucket is defined, the model will not be packaged into the Bento image, but downloaded at runtime before serving.
Each model in the model repository can be accessed via the signature of this
For example, the model
torchscript_yolov5s can be accessed via
triton_runner.torchscript_yolov5s, and you can invoke the inference of such model with
async_run method. This is similar to how other BentoML's built-in Runners work.
Let's unpack this code snippet. First we define an async API that takes in an image and returns a
numpy array. We then do some pre-processing to the input images and pass it into the model
The signature of
run method is as follows:
run can only take either all positional arguments or all keyword arguments. The arguments must match the input signature of the model specified in the
From the aforementioned
config.pbtxt, we can see that the input signature of the model is
INPUT__0, which is a 3-dimensional tensor of type
TYPE_FP32 with a batch dimension. This means
run method can only take in either a single positional argument or a single keyword argument with the name
async_run returns a
InferResult object, which is a wrapper around the response from Triton Inference Server. Refer to the internal doc-string for more details.
Additionally, the Triton runner also exposes all
tritonclient model management APIs so that users can fully utilize all features provided by Triton Inference Server.
For example, one can load/unload the model dynamically via the endpoint
To package your BentoService with Triton Inference Server, you can add the following to your existing
Note that the
base_image is the Triton Inference Server docker image from NVIDIA's container catalog.
If the model repository is stored in S3 or GCS, there is no need to add the
That's it! Build the BentoService and containerize with
bentoml build and
bentoml containerize respectively:
Congratulations! You can now fully utilize the power of Triton Inference Server with BentoML through
bentoml.triton. You can read more about this integration from our documentation. If you enjoyed this article, feel free to support us by starring our GitHub, and join our community Slack channel!