This was mentioned recently and I had a look[0]. It seems like the benchmark is not quite saying what people think it's saying, and the paper even mentions it. The benchmark is constructed by using a model (Deepseek) to define questions of a certain level of difficulty for the models. That may result in easier problems for "low-resource" languages such as Elixir and Racket and so forth since the differentiator couldn't solve harder problems. From the actual paper:
> Section 3.3:
> Besides, since we use the moderately capable DeepSeek-Coder-V2-Lite to filter simple problems, the Pass@1 scores of top models on popular languages are relatively low. However, these models perform significantly better on low-resource languages. This indicates that the performance gap between models of different sizes is more pronounced on low-resource languages, likely because DeepSeek-Coder-V2-Lite struggles to filter out simple problems in these scenarios due to its limited capability in handling low-resource languages.
At the same time I have used Claude Code on an elixir codebase and it's done a great job. But for me, it's undefined that it would have done a worse job if I had picked any other stack.
[0]: https://news.ycombinator.com/item?id=46646007