Hi there, I'm also working on LLMs in Texas Hold'em :)
First of all, congrats on your work. Picking a form of presenting LLMs, that playes poker is a hard task, and I like your approach in presenting the Action Log.
I can share some interesting insights from my experiments:
- Findin strategies is more interesting than comparing different models. Strategies can get pretty long and specific. For example, if part of the strategy is: "bluff on the river if you have a weak hand but the opponent has been playing tight all game", most models, given this strategy, would execute it with the same outcome. Models could be compared only using some open-ended strategy like "play aggressively" or "play tight", or even "win the tournament".
- I implemented a tournament game, where players drop out when they run out of chips. This creates a more dynamic environment, where players have to win a tournament, not just a hand. That requires adding the whole table history to the prompt, and it might get quite long, so context management might be a challenge.
- I tested playing LLM against a randomly playing bot (1vs1). `grok-4` was able to come up with the winning strategy against a random bot on the first try (I asked: "You play against a random bot. What is your strategy?"). `gpt-5-high` struggled.
- Public chat between LLMs over the poker table is fun to watch, but it is hard to create a strategy that makes an LLM successfully convince other LLMs to fold. Given their chain of thought, they are more focused on actions rather than what others say. Yet, more experiments are needed. For waker models (looking at you `gpt-5-nano`) it is hard to convince them not to review their hand.
- Playing random hands is expensive. You would have to play thousands of hands to get some statistical significance measures. It's better to put LLMs in predefined situations (like AliceAI has a weak hand, BobAI has a strong hand) and see how they behave.
- 1-on-1 is easier to analyze and work with than multiplayer.
- There is an interesting choice to make when building the context for an LLM: should the previous chains of thought be included in the prompt? I found that including them actually makes LLMs "stick" to the first strategy they came up with, and they are less likely to adapt to the changing situation on the table. On the other hand, not including them makes LLMs "rethink" their strategy every time and is more error-prone. I'm working on an AlphaEvolve-like approach now.
- This will be super interesting to fine-tune an LLM model using an AlphaZero-like approach, where the model plays against itself and improves over time. But this is a complex task.