HN Reader

New Top Best Ask Show Job

Life of an inference request (vLLM V1): How LLMs are served efficiently at scale

171

21

3 days agoby samaysharma

© 2024 wagao

Thanks for writing this up! I learnt a bunch from it. I noticed this didn’t discuss additional layers of caching - I can see how it would fit in, but is prompt caching out of the scope of this system?

3 days agoby mhlakhani

Great write up!

Does batching add data from multiple requests into the same context, potentially decreasing perplexity? If so, are we trading off perplexity for lower operating costs?

3 days agoby r0b05

Hi, I'm the author of this post. Writing it was a great learning experience. I gained a lot of insight into vLLM. If you have any feedback or questions, feel free to drop a comment below!

3 days agoby 0xjunhao

Thanks, good read!

3 days agoby geoffbp

Great write up. We use vLLM kv cache and continuous batching as a foundation for requests in ScalarLM and also add batching optimizations in a centralized queue and by adding explicit batching support in our client.

https://www.scalarlm.com

There is more perf you can sqeeuze out of vLLM

3 days agoby gdiamos