Note that the NVIDIA container uses CUDA+cuBLAS 13.0.2 which cites "Improved performance on NVIDIA DGX Spark for FP16/BF16 and FP8 GEMMs", which seems to be your use-case.
In general, I would suspect that it mostly comes to versions of the libs.
Interestingly, there is a cuBLAS 13.1 whl on PyPI, not sure what that does.