June 11, 2025 • Written By Bo Jiang and Sherlock Xu
The landscape of LLM inference is rapidly evolving, with a clear shift toward distributed serving.
What’s happening? Single-node GPU optimizations like dynamic batching, paged attention, and CUDA tweaks are starting to show limitations as LLM inference scales. Larger models like DeepSeek-R1 and tasks with longer context, such as reasoning or agentic use cases, stretch the limits even further.
Today, inference optimization is no longer just about squeezing more out of a single node, but rethinking how to distribute inference across a cluster of workers. If you’re working to deploy and scale LLM services, this is a trend you can’t afford to ignore. It heralds a new era, one that brings better resource allocation, smarter GPU usage, lower token latency, and reduced cost per generated token.
Leading AI teams and open-source communities are already pioneering distributed inference strategies. We’ve seen optimization efforts converge on three key areas:
In this blog post, we’ll walk through each of these and highlight the active progress to address the challenges they present.
To understand PD disaggregation, let’s start with how LLM inference actually works.
For transformer-based LLMs, every time you send a prompt, the model goes through two key steps:
For a long time, the standard way of doing inference was to run these two steps together. On the surface, this might seem straightforward.
In practice, you’ll often have multiple requests arriving at once. Each one has its own prefill and decode needs, but only one phase can run at a time. When the GPU is occupied with compute-heavy prefill tasks, decode tasks must wait, which increases ITL, and vice versa.
Since prefill primarily determines the TTFT and decode impacts ITL, collocating them makes it difficult to optimize both metrics simultaneously.
The idea of PD disaggregation is simple: separate these two very different tasks so they don’t get in each other’s way. Key benefits include:
Several open-source frameworks and projects are actively exploring PD disaggregation, including SGLang, vLLM, Dynamo, and llm-d.
As promising as PD disaggregation sounds, it’s not a one-size-fits-all fix.
Thresholds matter: If your workload is too small, or your GPU setup isn’t tuned for this approach, performance can drop (by 20-30% in our tests).
Local prefill can be faster: For shorter prompts or when the decode engine has a high prefix cache hit, running prefill locally on the decode worker is often faster and simpler.
Data transfer cost: Disaggregation requires moving KV caches rapidly and reliably between prefill and decode workers. This means your solution must support fast, low-latency communication protocols that are both hardware- and network-agnostic. Unless the performance gains from disaggregation outweigh the data transfer cost, overall performance can actually degrade. Existing methods for data transfer for your reference: NVIDIA Inference Xfer Library (NIXL), CXL, NVMe-oF.
For traditional web applications, load balancing is usually pretty simple. Requests are small, responses are quick, and any backend instance can handle any request equally well. Load balancers can use simple strategies like round-robin to distribute traffic evenly.
But things are completely different for LLM inference. A major factor here is the KV cache built during the prefill phase.
Traditional load balancers treat LLM workers like identical black boxes. They don’t see what’s going on inside each worker, including:
When a load balancer can’t see these details, it starts making bad decisions, leading to:
The open-source community is already working on smarter solutions. For example, the Gateway API Inference Extension project uses an endpoint picker (EPP) to collect information on KV cache utilization, queue length, and LoRA adapters on each worker, and routes requests to the optimal replica for better inference.
The term "KV cache" originally described caching within a single inference request. As mentioned above, LLMs work autoregressively during decode as they output the next new token based on the previously generated tokens (i.e. reusing their KV cache). Without the KV cache, they need to recompute everything for the previous tokens in each decode step, which would be a huge waste of resources.
When extending this caching concept across multiple requests, it’s more accurate to call it prefix caching.
Imagine you have a chatbot system with a prompt like this:
You are a helpful AI writer. Please write in a professional manner.
This system prompt doesn’t change from one conversation to the next. When new messages come in, the model can reuse the stored prefix cache, only processing the new part of the prompt.
Here’s the challenge: How can a new request be routed to the worker that already has the right prefix cached? How does the router know what’s in each worker’s cache?
Different open-source projects are exploring their own approaches:
Worker-reported prefix status
Dynamo has workers actively report which prefixes they’ve cached. The router then uses this real-time data to make smart routing decisions.
Router-predicted cache status
SGLang maintains an approximate radix tree for each worker based on past requests. This helps the router predict which worker is most likely to have the needed prefix, without constant updates from the workers.
Hybrid efforts
The Gateway API Inference Extension project is exploring multiple strategies to implement a routing algorithm on EPP:
The llm-d project uses a component called Inference Scheduler to implement filtering and scoring algorithms, and makes routing decisions based on a combination of factors like cache availability, prefill/decode status, SLA and load.
Distributed inference is becoming essential for deploying and scaling LLMs at larger scales. If an enterprise wants to truly optimize for metrics like latency or throughput, distributed LLM inference is the only real path forward. This goes far beyond what serverless API endpoints can achieve.
At BentoML, we’re working to ensure our customers and users can tap into these latest LLM inference optimizations. As we’ve seen in our initial experiments, there’s no one-size-fits-all solution. The best approach depends on your specific workloads, models, and what matters most to you (e.g., latency, throughput, or cost).
We’ll be sharing more about distributed LLM inference with detailed benchmarks and best practices in future posts. If you want to collaborate or stay in the loop, feel free to reach out: