Wow, I love this benchmark - I've been doing something similar (as a joke for and much less frequently), where I ask multiple models to attempt to create a data structure like:
```
const melody = [
{ freq: 261.63, duration: 'quarter' }, // C4
{ freq: 0, duration: 'triplet' }, // triplet rest
{ freq: 293.66, duration: 'triplet' }, // D4
{ freq: 0, duration: 'triplet' }, // triplet rest
{ freq: 329.63, duration: 'half' }, // E4
]
```
But with the intro to Smoke on the Water by Deep Purple. Then I run it through the Web Audio API and see how it sounds.
It's never quite gotten it right, but it's gotten better, to the point where I can ask it to make a website that can play it.
I think yours is a lot more thoughtful about testing novelty, but its interesting to see them attempt to do things that they aren't really built for (in theory!).
https://codepen.io/mvattuone/pen/qEdPaoW - ChatGPT 4 Turbo
https://codepen.io/mvattuone/pen/ogXGzdg - Claude Sonnet 3.7
https://codepen.io/mvattuone/pen/ZYGXpom - Gemini 2.5 Pro
Gemini is by far the best sounding one, but it's still off. I'd be curious how the latest and greatest (paid) versions fare.
(And just for comparison, here's the first time I did it... you can tell I did the front-end because there isn't much to it!) https://nitter.space/mvattuone/status/1646610228748730368#m