S. Roy

Blog Post·2024-06-19·4 min read

Continuous Batching: How vLLM Serves Thousands of Requests

Static batching wastes GPU capacity whenever sequences finish at different times. Continuous batching fixes this by treating the decode loop as a queue — adding new requests the moment a slot opens up.

inference vllm batching systems

Tag: batching

Continuous Batching: How vLLM Serves Thousands of Requests