October 26, 2023 • Written By Fog Dong
Note: This blog post is converted from Fog Dong’s keynote speech at KubeCon China 2023.
There's no doubt that artificial intelligence (AI) is once again catching everyone's eyes. The surge in AI startups globally over the past decade is a testament to this. This year’s KubeCon further highlighted the growing interest in AI, with more AI-related discussions than ever before.
While the AI landscape flourishes with exciting opportunities, not all companies can, or need to, build their own models. Fortunately, the open-source community has seen a variety of machine learning models born for the AI industry.
Compared to closed models, open-source models are winners in customizability, data privacy, and cost-efficiency. By fine-tuning these models with proprietary datasets and deploying them in private environments, businesses can ensure data privacy and only pay for the cost that the models need.
However, converting a model into a fully-functional application isn't a straightforward task. For example, how do we take an open-source model like Llama 2 into a production-ready application running on the cloud that generates advertising proposals?
If we dive deeper into this, we can consider a model as ML code, but the journey involves more than that. Among other things, code, configuration, data collection, and serving infrastructure are all building blocks. In order to bridge the gap between the model and the application, we need to first find an intermediate station. With this in mind, we split the process into two main phases: build and deploy.
The build phase presents challenges like model packaging, environment management, and model versioning; for the deploy phase, challenges might look familiar to us, since those are the challenges that Kubernetes and the cloud-native ecosystem are trying to resolve.
Now that we know the challenges, let's try to resolve them.
To build an intermediate artifact, you need to somehow find a way to package the model, dependencies, and everything you need into something that can be easily deployed. That's where BentoML comes into the play.
BentoML is an open-source Python framework that can help you build your application. Drawing inspiration from the Asian bento box, which offers a complete meal in one package, BentoML bundles APIs, dependencies, models, and other necessary components into a single deployable unit, also known as a Bento.
A Bento has three key components:
Note: We all know that basically all the models are more like a compute-intensive workload. When transforming a model into an application, we might need to handle things like concurrent requests. That's when it's converted into an IO-intensive workload. This is also the reason why we separate the API Server from the Runner within a Bento.
Another important thing to consider is how do we organize a Bento to make sure it is built in the way we want? When building a Bento, the first thing you need to do is to write a Bento configuration file in YAML, typically known as
bentofile.yaml. In this file, you specify all the configurations you care about, such as the version of the dependencies and the entry point of your API Server and Runner.
With a single command
bentoml build, a Bento is built and stored in the BentoML local Bento Store by default. You can use commands like
bentoml push to push a Bento to S3 or other registries.
Once the Bento is built, the next step is deployment. While BentoML offers commands like
bentoml serve for local testing, stable production deployment requires the scalability and resilience of cloud-native solutions and Kubernetes.
To deploy a Bento as a microservice on Kubernetes, we need two more controllers working within the cluster.
We now know both the challenges and the solutions, but it doesn’t make the journey less daunting. Over the past year, we’ve seen the booming of open-source large language models (LLMs). A question suddenly found us: Can all the AI developers easily serve open-source language models on the cloud?
OpenLLM, an open-source project we launched several months ago, aims to make the deployment of popular LLMs on the cloud easy and efficient. The initial idea of this project is to combine the best practices in the industry, enabling AI developers to deploy LLMs with just one command.
However, productionizing LLMs like Llama 2 is much more complicated with major challenges like scalability, throughput, latency, and cost. To address these challenges, we have added more enhancements to the project.
Another important improvement on OpenLLM is continuous batching via vLLM. For a typical Bento, when multiple requests arrive, the API Server scales first and then redirects the requests to the Runner, which batches all the requests to the model. However, this is not sufficient enough for language models.
That's why we need continuous batching. The essence of continuous batching is its dynamic nature. This "giant batch" cyclically processes requests, ensuring that the model remains optimally engaged without unnecessary idle time. See this blog post to learn more.
Why does this matter? For models as complex and resource-intensive as LLMs, the real-time, iteration-level batching ensures maximum resource utilization, faster processing time, and, ultimately, a more cost-effective and responsive system.
In production deployment, it's essential to remember that about 95% of models remain idle most of the time, yet reserved instances continue to generate costs. This is why we need to leverage the power of serverless deployment. Simply put, when there is no request, we want replicas to be scaled down to zero to prevent GPU waste. This represents an important feature of BentoCloud, our serverless platform that allows developers to deploy AI applications in production with ease and speed.
To achieve this, we can use the HPA and the Kubernetes Event-Driven Autoscaler (KEDA) together with three key components - Interceptor, Scaler and Proxy container.
Here's a breakdown of how the process works:
The journey from model to application is undoubtedly challenging. While this article offers one approach to the problem, countless other strategies exist. If you're interested in this, feel free to join our community and we can have further discussions.
To learn more about BentoML and OpenLLM, check out the following resources: