Important testing excerpts:
- "...for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly."
- "I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization"
- "...if I gave a prompt like “1. e4 e5 2. ” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.” (without a space)"
- "I used a temperature of 0.7 for all the open models and the default for the closed (OpenAI) models."
Between the tokenizer weirdness, temperature, quantization, random moves, and the chess prompt, there's a lot going on here. I'm unsure how to interpret the results. Fascinating article though!