I built Hot or Slop (https://hotorslop.com) - a visual Turing test where you swipe through photos and guess if they're AI-generated or real. It's part game, part crowd-sourced research experiment.
After 50 players and ~3,600 guesses, the early data is fascinating:
* Average human accuracy: 65% (barely better than random) * Top performers hit 85–90% accuracy * FLUX and Imagen models fool people 80%+ of the time * Speed matters - rushed guesses drop accuracy ~15%
Every guess feeds a transparent dataset tracking which AI models are hardest to detect and how our collective detection abilities evolve as models improve. Think of it as real-time measurement of the human-AI perception gap.
Tech stack: React 19 + Vite (frontend), Express + SQLite via sql.js (backend), Hugging Face datasets (OpenFake for synthetic images, COCO–Caption2017 for real photos). The game tracks ~15 metadata points per guess (model, prompt length, latency, confidence, etc.) for analysis.
Current challenge: sql.js loads the entire DB into memory, so I'm hitting memory limits at ~3,600 guesses. Planning to migrate to better-sqlite3 or Postgres for scale, but wanted to ship fast and iterate based on usage patterns.
The visual indicators during swipes use gradient overlays (red for AI, green for real) with EB Garamond typography for a clean aesthetic. Simple but effective UX - most people play 50+ rounds in their first session.
Curious to hear:
1. What accuracy do HN folks hit? (I suspect higher than average) 2. Any ideas on detection patterns? Some claim texture/lighting tells, others focus on compositional "tells" 3. Thoughts on scaling the analytics without moving to a full DB service?
Try it: https://hotorslop.com
(Fair warning: weirdly addictive)
No comments