I note the lack of human portraits in the example cases.
My experience with all these solutions to date (including whatever apple are currently using) is that when viewed stereoscopically the people end up looking like 2d cutouts against the background.
I haven't seen this particular model in use stereoscopically so I can't comment as to its effectiveness, but the lack of a human face in the example set is likely a bit of a tell.
Granted they do call it "Monocular View Synthesis", but i'm unclear as to what its accuracy or real-world use would be if you cant combine 2 views to form a convincing stereo pair.