HN Reader

Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

426

155

This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

3 days agoby zackangelo

I'm not sure if they're comparing apples to apples on the latency here. There are roughly three parts to the latency: the throughput of the context/prompt, the time spent queueing for hardware access, and the other standard API overheads (network, etc).

From what I understand, several, maybe all, of the comparison services are not based on provisioned capacity, which means that the measurements include the queue time. For LLMs this can be significant. The Cerebras number on the other hand almost certainly doesn't have some unbounded amount of queue time included, as I expect they had guaranteed hardware access.

The throughput here is amazing, but to get that throughput at a good latency for end-users means over-provisioning, and it's unclear what queueing will do to this. Additionally, does that latency depend on the machine being ready with the model, or does that include loading the model if necessary? If using a fine-tuned model does this change the latency?

I'm sure it's a clear win for batch workloads where you can keep Cerebras machines running at 100% utilisation and get 1k tokens/s constantly.

3 days agoby danpalmer

What you can do with current-gen models, along with RAG, multi-agent & code interpreters, the wall is very much model latency, and not accuracy any more.

There are so many interactive experiences that could be made possible at this level of token throughput from 405B class models.

3 days agoby LASR

To be clear a cerebras chip is consuming a whole wafer and has only 44 GB of SRAM on it. To fit a 405B model in bf16 precision (excluding kv cache and activation memory usage) you need 19 of these “chips” (and the requirement will grow as the sequence length increases for the kvcache). Looking online it seems on one wafer one can fit between 60 to 80 H100 chips, so it’s equivalent to using >1500 H100 using wafer manufacturing cost as a metric

3 days agoby perfobotto

This is seriously impressive performance. I think there's a high probability Nvidia attempts to acquire Cerebras.

3 days agoby shreezus

They have a waitlist for trying their API. You have to be a but skeptical when a company makes claims but does not offer their services to buy.

3 days agoby sumedh

So out of all AI chip startups, Cerebras is probably the real deal

3 days agoby brcmthrowaway

No mention of their direct competitor Groq?

3 days agoby fillskills

I am wondering how much cost is needed for serving at such a latency. Of course for customers, static cost depends on the pricing strategy. But still, the cost really determines how widely this can be adopted. Is it only for those business that really need the latency, or this can be generally deployed.

3 days agoby WiSaGaN

The fact that such a boost is possible with new hardware, I wonder what the ceiling is for improving performance for training via hardware as well.

3 days agoby owenpalmer

I'd like to see a tokens / second / watt comparison.

3 days agoby qwertox

Their hardware is cool and bizarre. It has to be seen in person to be believed. It reminds me of the old days when supercomputers were weird.

3 days agoby bargle0

Pretty amazing speed, especially considering this is bf16. But how many racks is this using? The used 4 racks for 70B, so this, what, at least 24? A whole data center for one model?!

3 days agoby germanjoey

Normally, I don't think 1000 tokens/s is that much more useful than 50 tokens/s.

However, given that CoT makes models a lot smarter, I think Cerebras chips will be in huge demand from now on. You can have a lot more CoT runs when the inference is 20x faster.

Also, I assume financial applications such as hedge funds would be buying these things in bulk now.

3 days agoby aurareturn

I'm so curious to see some multi-agent systems running with inference this fast.

3 days agoby gdiamos

This gets tons of press and discussion here on HN, but frankly AMD has a better overall product with the upcoming MI325x [0].

I love to see the development and activity, but companies like Cerebras are trying to compete on a single usecase and doing a poor job of it because they can only offer a tightly controlled API.

Ask yourself how much capex + power/space/cooling (opex) it requires to run that model (and how many people it can really serve) and then compare that against what AMD is offering.

[0] https://www.amd.com/en/products/accelerators/instinct/mi300/...

3 days agoby latchkey

Genuinely curious and willing to learn: what are the different inference approaches broadly? Is there any difference in the approach between Cerebras and simplismart.ai which claims to be the fastest?

3 days agoby frogfish

Cerebras features in the internal OpenAI emails that recently came out. One example:

Ilya Sutskever to Elon Musk, Sam Altman, (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 20, 2017 2:08 PM

> In the event we decide to buy Cerebras, my strong sense is that it'll be done through Tesla. But why do it this way if we could also do it from within OpenAI?

3 days agoby leobg

Not open beta until Q1 2025

3 days agoby jadbox

Holy bananas, the title alone is almost its own language.

3 days agoby dgfitz

How does binning work when your chip is the entire wafer?

3 days agoby easeout

nvidia hates this one little trick

3 days agoby gorkempacaci

Damn that's a big model and that's really fast inference.

3 days agoby arthurcolle

I wonder if Cerebras could generate video decent quality in real time

3 days agoby kuprel

Transistor（GPU）-> Integrated Circuit (WSE-3)

3 days agoby xwww

is it just me or isn't the most important contender in speed, Groq, missing from the comparison ? not sure why does it matter to put azure there, no one uses it for speed.

3 days agoby adhambadr