We act like small models are inconsistent and incoherent, but we rarely point out that it actually matches certain mental states and capacity accurately. We may need to actually see, how would a 0.5B model handle the presidency, because … it could be accurate. Having a super large model simulate these things would not be authentic.
For example, it could be truly true that a developer is roughly as good as a 1.5b model. It could really be true, in which case we’re not valuing these models for their true simulation power (yet). Might be the best interview test, you must generate hand written code that’s better than a small model (or show better judgement).
For the presidency, the current benchmark to beat is GPT2 it seems.