Our Open-Source model serving framework BentoML offers a unified standard for AI inference, model packaging, and serving optimizations.
Our Open-Source model serving framework BentoML makes it easy to create high-performance AI API service with custom code.
@bentoml.service( traffic={ "timeout": 300, }, resources={ "gpu": 1, "gpu_type": "nvidia-l4", }, ) class VLLM: def __init__(self) -> None: from vllm import AsyncEngineArgs, AsyncLLMEngine ENGINE_ARGS = AsyncEngineArgs( model='meta-llama/Llama-2-7b-chat-hf', max_model_len=MAX_TOKENS ) self.engine = AsyncLLMEngine.from_engine_args(ENGINE_ARGS) @bentoml.api async def generate( self, prompt: str = "Explain superconductors like I'm five years old", max_tokens: Annotated[int, Ge(128), Le(MAX_TOKENS)] = MAX_TOKENS, ) -> AsyncGenerator[str, None]: from vllm import SamplingParams SAMPLING_PARAM = SamplingParams(max_tokens=max_tokens) prompt = PROMPT_TEMPLATE.format(user_prompt=prompt) stream = await self.engine.add_request(uuid.uuid4().hex, prompt, SAMPLING_PARAM) cursor = 0 async for request_output in stream: text = request_output.outputs[0].text yield text[cursor:] cursor = len(text)
A prototype running locally is just one command away from a reliable and secure cloud deployment, ready for scaling in production.
bentoml deploy .
Call the deployed endpoint via auto-generated web UI, Python Client or REST API.
curl -s -X POST \ 'https://vllm-llama-7b-e3c1c7db.mt-guc1.bentoml.ai/generate' \ -H 'Content-Type: application/json' \ -d '{ "max_tokens": 1024, "prompt": "Explain superconductors like I'"'"'m five years old" }'
We've built the fastest cloud infrastructure for AI inference, comes with everything you need to streamline the path to production AI.
Automatic scale up and down to zero, only pay for what you use
Flexible APIs for deploying online API services, batch inference jobs and async job queues
See how your models are performing and troubleshoot issues with built-in observability tools
Self-host BentoCloud runtime in your own cloud account
Let your models and data live within your Virtual Private Cloud (VPC)
Full visibility and control over the compute resources and network access
Easily utilize compute across multiple clouds