Maybe I'm being too skeptical, and certainly I am only a layman in this field, but the amount of ANN-based post-processing it takes to produce the final image seems to cast suspicion on the meaning of the result.
At what point do you reduce the signal to the equivalent of an LLM prompt, with most of the resulting image being explained by the training data?
Yeah, I know that modern phone cameras are also heavily post-processed, but the hardware is at least producing a reasonable optical image to begin with. There's some correspondence between input and output; at least they're comparable.