HN Reader

The last six months in LLMs, illustrated by pelicans on bicycles

955

233

10 days agoby swyx

> I’ve been feeling pretty good about my benchmark! It should stay useful for a long time... provided none of the big AI labs catch on.

> And then I saw this in the Google I/O keynote a few weeks ago, in a blink and you’ll miss it moment! There’s a pelican riding a bicycle! They’re on to me. I’m going to have to switch to something else.

Yeah this touches on an issue that makes it very difficult to have a discussion in public about AI capabilities. Any specific test you talk about, no matter how small … if the big companies get wind of it, it will be RLHF’d away, sometimes to the point of absurdity. Just refer to the old “count the ‘r’s in strawberry” canard for one example.

9 days agoby isx726552

> This was one of the most successful product launches of all time. They signed up 100 million new user accounts in a week! They had a single hour where they signed up a million new accounts, as this thing kept on going viral again and again and again.

Awkwardly, I never heard of it until now. I was aware that at some point they added ability to generate images to the app, but I never realized it was a major thing (plus I already had an offline stable diffusion app on my phone, so it felt less of an upgrade to me personally). With so much AI news each week, feels like unless you're really invested in the space, it's almost impossible to not accidentally miss or dismiss some big release.

10 days agoby adrian17

My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample.

You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...

Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.

10 days agoby nathan_phoenix

Wow, I love this benchmark - I've been doing something similar (as a joke for and much less frequently), where I ask multiple models to attempt to create a data structure like:

``` const melody = [ { freq: 261.63, duration: 'quarter' }, // C4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 293.66, duration: 'triplet' }, // D4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 329.63, duration: 'half' }, // E4 ] ```

But with the intro to Smoke on the Water by Deep Purple. Then I run it through the Web Audio API and see how it sounds.

It's never quite gotten it right, but it's gotten better, to the point where I can ask it to make a website that can play it.

I think yours is a lot more thoughtful about testing novelty, but its interesting to see them attempt to do things that they aren't really built for (in theory!).

https://codepen.io/mvattuone/pen/qEdPaoW - ChatGPT 4 Turbo

https://codepen.io/mvattuone/pen/ogXGzdg - Claude Sonnet 3.7

https://codepen.io/mvattuone/pen/ZYGXpom - Gemini 2.5 Pro

Gemini is by far the best sounding one, but it's still off. I'd be curious how the latest and greatest (paid) versions fare.

(And just for comparison, here's the first time I did it... you can tell I did the front-end because there isn't much to it!) https://nitter.space/mvattuone/status/1646610228748730368#m

9 days agoby zurichisstained

Great writeup.

This measure of LLM capability could be extended by taking it into the 3D domain.

That is, having the model write Python code for Blender, then running blender in headless mode behind an API.

The talk hints at this but one shot prompting likely won’t be a broad enough measurement of capability by this time next year. (Or perhaps now, even)

So the test could also include an agentic portion that includes consultation of the latest blender documentation or even use of a search engine for blog entries detailing syntax and technique.

For multimodal input processing, it could take into account a particular photo of a pelican as the test subject.

For usability, the objects can be converted to iOS’s native 3d format that can be viewed in mobile safari.

I built this workflow, including a service for blender as an initial test of what was possible in October of 2022. It took post processing for common syntax errors back then but id imagine the newer LLMs would make those mistakes less often now.

10 days agoby bredren

I really enjoy Simon’s work in this space. I’ve read almost every blog post they’ve posted on this and I love seeing them poke and prod the models to see what pops out. The CLI tools are all very easy to use and complement each other nicely all without trying to do too much by themselves.

And at the end of the day, it’s just so much fun to see someone else having so much fun. He’s like a kid in a candy store and that excitement is contagious. After reading every one of his blog posts, I’m inspired to go play with LLMs in some new and interesting way.

Thank you Simon!

10 days agoby joshstrange

Here Claude Opus Extended Thinking https://claude.ai/public/artifacts/707c2459-05a1-4a32-b393-c...

10 days agoby franze

Enjoyable write-up, but why is Qwen 3 conspicuously absent? It was a really strong release, especially the fine-grained MoE which is unlike anything that’s come before (in terms of capability and speed on consumer hardware).

10 days agoby anon373839

Interesting timeline, though the most relevant part was at the end, where Simon mentions that Google is now aware of the "pelican on bicycle" question, so it is no longer useful as a benchmark. FWIW, many things outside of the training data will pants these models. I just tried this query, which probably has no examples online, and Gemini gave me the standard puzzle answer, which is wrong:

"Say I have a wolf, a goat, and some cabbage, and I want to get them across a river. The wolf will eat the goat if they're left alone, which is bad. The goat will eat some cabbage, and will starve otherwise. How do I get them all across the river in the fewest trips?"

A child would pick up that you have plenty of cabbage, but can't leave the goat without it, lest it starve. Also, there's no mention of boat capacity, so you could just bring them all over at once. Useful? Sometimes. Intelligent? No.

9 days agoby username223

If you calculate ELO based on a round-robin tournament with all participants starting out on the same score, then the resulting ratings should simply correspond to the win count. I guess the algorithm in use take into account the order of the matches, but taking order into account is only meaningful when competitors are expected to develop significantly, otherwise it is just added noise, so we never want to do so in competitions between bots.

I also can't help but notice that the competition is exactly one match short, for some reason exactly one of the 561 possible pairings has not been included.

9 days agoby NohatCoder

https://imgur.com/a/mzZ77xI here are a few i tried the models, looks like the newer vesion of gemini is another improvement?

10 days agoby qwertytyyuu

If you would give a human the SVG documentation and ask to write an SVG, I think the results would be quite similar.

10 days agoby landgenoot

Does anyone have any thoughts on privacy/safety regarding what he said about GPT memory.

I had heard of prompt injection already. But, this seems different, completely out of humans control. Like even when you consider web search functionality, he is actually right, more and more, users are losing control over context.

Is this dangerous atm? Do you think it will become more dangerous in the future when we chuck even more data into context?

9 days agoby joshuajooste05

> most people find it difficult to remember the exact orientation of the frame.

Isn't it Δ∇Λ welded together? The bottom left and right vertices are where the wheels are attached to, the middle bottom point is where the big gear with the pedals is. The lambda is for the front wheel because you wouldn't be able to turn it if it was attached to a delta. Right?

I guess having my first bicycle be a cheap Soviet-era produced one paid off: I spent loads of time fidgeting with the chain tension, and pulling the chain back onto the gears, so I guess I had to stare at the frame way too much to forget even by today the way it looks.

9 days agoby Joker_vD

The best pelicans come from running a consortium of models. I use pelicans as evals now. https://x.com/xundecidability/status/1921009133077053462 Test it using VibeLab (wip) https://x.com/xundecidability/status/1926779393633857715

9 days agoby irthomasthomas

That was a very fun recap, thanks for sharing. It's easy to forget how much better these things have gotten. And this was in just six months! Crazy!

10 days agoby nowayno583

Kaggle recently ran a competition to do just this (draw SVGs from prompts, using fairly small models under the hood).

The top results (click on the top Solutions) were pretty impressive: https://www.kaggle.com/competitions/drawing-with-llms/leader...

9 days agoby djherbis

https://www.oneusefulthing.org/p/the-recent-history-of-ai-in...

10 days agoby JimDabell

So the only bird slightly resembling a pelican beak was drawn by gemini 2.5 pro. In general, none of the output resembles a pelican enough so you could separate it from "a bird".

OP seem to ignore that pelican has a distinct look when evaluating these doodles.

9 days agoby 0points

Am I the only one who can't but see these attempts much like attempts of a kid learning to draw?

9 days agoby nine_k

The hilarious bit is that this page will soon be scraped by ai-bots as learning material, and they'll all learn to draw pelicans on bicycles using this as their primary example material, as they'll be the only examples.

GIGO in motion :-)

9 days agoby buserror

It's not so great at bicycles, either. None of those are close to rideable.

But bicycles are famously hard for artists as well. Cyclists can identify all of the parts, but if you don't ride a lot it can be surprisingly difficult to get all of the major bits of geometry right.

10 days agoby jfengel

> If you lost interest in local models—like I did eight months ago—it’s worth paying attention to them again. They’ve got good now!

> As a power user of these tools, I want to stay in complete control of what the inputs are. Features like ChatGPT memory are taking that control away from me.

You reap what you sow....

> I already have a tool I built called shot-scraper, a CLI app that lets me take screenshots of web pages and save them as images. I had Claude build me a web page that accepts ?left= and ?right= parameters pointing to image URLs and then embeds them side-by-side on a page. Then I could take screenshots of those two images side-by-side. I generated one of those for every possible match-up of my 34 pelican pictures—560 matches in total.

Surely it would have been easier to use a local tool like ImageMagick? You could even have the AI write a Bash script for you.

> ... but prompt injection is still a thing.

...Why wouldn't it always be? There's no quoting or escaping mechanism that's actually out-of-band.

> There’s this thing I’m calling the lethal trifecta, which is when you have an AI system that has access to private data, and potential exposure to malicious instructions—so other people can trick it into doing things... and there’s a mechanism to exfiltrate stuff.

People in 2025 actually need to be told this. Franklin missed the mark - people today will trip over themselves to give up both their security and their liberty for mere convenience.

9 days agoby zahlman

Definitely getting better but even the best result is not very impressive.

9 days agoby pier25

Should we be that excited about AI and calling a fraud and plagiarism machine "ChatGPT Mischief Buddy" without any moral deliberation?

9 days agoby darkoob12

I don’t know what secret sauce Anthropic has, but in real world use, Sonnet is somehow still the best model around. Better than Opus and Gemini Pro

10 days agoby spaceman_2020

As a control, he should go on fiver and have a human generate a pelican riding a bicycle, just to see what the eventual goal is.

10 days agoby deadbabe

Quite a detailed image using claude sonnet 4: https://ibb.co/39RbRm5W

10 days agoby wohoef

Here’s the spot where we see who’s TL;DR…

> Claude 4 will rat you out to the feds!

>If you expose it to evidence of malfeasance in your company, and you tell it it should act ethically, and you give it the ability to send email, it’ll rat you out.

10 days agoby dirtyhippiefree

The last animation is hilarious, represents very well the AI Hype cycle vs reality.

10 days agoby mromanuk

Honestly the metric which increased the most is the marketing and astroturfing budget of the major players (OpenAI, Anthropic, Google and Deepseek).

Say what you want about Facebook but at least they released their flagship model fully open.

10 days agoby big_hacker

Is there a good model (any architecture) for vector graphics out of interest?

10 days agoby bravesoul2

My only take home is they are all terrible and I should hire a professional.

10 days agoby neepi

Thank you, Simon! I really enjoyed your PyBay 2023 talk on embeddings and this is great too! I like the personalized benchmark. Hopefully the big LLM providers don't start gaming the pelican index!

10 days agoby atxtechbro

I think its hilarious how humans can make mistakes interpreting the crazy drawings : He says "I like how it solved the problem of pelicans not fitting on bicycles by adding a second smaller bicycle to the stack."

no... that is an attempt at it actually drawing the pedals, and putting the pelicans feet right on the pedals!

8 days agoby beefnugs

Nice post, thanks!

9 days agoby NicoSchwandner

TIL: Snitchbench!

9 days agoby m3047