HN Reader

MiniMax-M1 open-weight, large-scale hybrid-attention reasoning model

349

1. this is apparently MiniMax's "launch week" - they did M1 on Monday and Hailuo 2 on Tuesday (https://news.smol.ai/issues/25-06-16-chinese-models). remains to be seen if they can keep up the pace of model releases for the rest of this week - these 2 were big ones, they aren't yet known for much else beyond llm and video models. just watch https://x.com/MiniMax__AI for announcements.

2. minimax m1's tech report is worthwhile: https://github.com/MiniMax-AI/MiniMax-M1/blob/main/MiniMax_M... while they may not be the SOTA open weights model, they do make some very big/notable claims on lightning attention and their GRPO variant (CISPO).

(im unaffiliated, just sharing what ive learned so far since no comments have been made here yet

7 months agoby swyx

In case you're wondering what it takes to run it, the answer is 8x H200 141GB [1] which costs $250k [2].

1. https://github.com/MiniMax-AI/MiniMax-M1/issues/2#issuecomme...

2. https://www.ebay.com/itm/335830302628

7 months agoby reedlaw

"We publicly release MiniMax-M1 at this https url" in the arxiv paper, and it isn't a link to an empty repo!

I like these people already.

7 months agoby vintermann

A few thoughts:

* A Singapore based company, according to LinkedIn. There doesn't seem to be much of a barrier to entry to building a very good LLM.

* Open weight models + the development of Strix Halo / Ryzen AI Max makes me optimistic that running great LLMs locally will be relatively cheap in a few years.

7 months agoby noelwelsh

This is stated nowhere on the official pages, but it's a Chinese company.

https://en.wikipedia.org/wiki/MiniMax_(company)

7 months agoby npteljes

Please come up with better names for these models. This sounds like the processor in my Mac Studio.

7 months agoby markkitti

They apparently building buzz for an IPO

https://www.bloomberg.com/news/articles/2025-06-18/alibaba-b...

7 months agoby htrp

> "In our attention design, a transformer block with softmax attention follows every seven transnormer blocks (Qin et al., 2022a) with lightning attention."

Alright, so it's 87.5% linear attention + 12.5% full attention.

TBH I find the terminology around "linear attention" rather confusing.

"Softmax attention" is an information routing mechanism: when token `k` is being computed, it can receive information from tokens 1..k, but it has to be crammed through a channel of a fixed size.

"Linear attention", on the other hand, is just a 'register bank' of a fixed size available to each layer. It's not real attention, it's attention only in the sense it's compatible with layer-at-once computation.

7 months agoby killerstorm

if they trained this scale without western cloud infra, i'd want to know what their token throughput setup looks like

7 months agoby b0a04gl