HN Reader

Show HN: LocalScore – Local LLM Benchmark

124

Hey Folks!

I've been building an open source benchmark for measuring local LLM performance on your own hardware. The benchmarking tool is a CLI written on top of Llamafile to allow for portability across different hardware setups and operating systems. The website is a database of results from the benchmark, allowing you to explore the performance of different models and hardware configurations.

Please give it a try! Any feedback and contribution is much appreciated. I'd love for this to serve as a helpful resource for the local AI community.

For more check out: - Website: https://localscore.ai - Demo video: https://youtu.be/De6pA1bQsHU - Blog post: https://localscore.ai/blog - CLI Github: https://github.com/Mozilla-Ocho/llamafile/tree/main/localsco... - Website Github: https://github.com/cjpais/localscore

Congrats on the effort - the local-first / private space needs more performant AI, and AI in general needs more comparable and trustworthy benchmarks.

Notes: - Olama integration would be nice - Is there an anonymous federated score sharing? That way, users you approximate a model's performance before downloading it.

10 months agoby mentalgear

Contributed scores for the M3 Ultra 512 GB unified memory: https://www.localscore.ai/accelerator/404

Happy to test larger models that utilize the memory capacity if helpful.

10 months agoby jsatok

I don't know if I should trust and run this code. If it was associated to Mozilla I would. It says it is a Mozilla Builders project, but https://builders.mozilla.org/projects/ does not list it. I don't see a way to verify that localscore.ai is associated with Mozilla.

10 months agoby david_draco

The run and/or troubleshooting steps for Windows should probably include the note you need to install https://developer.nvidia.com/cuda-downloads?target_os=Window... if you have an Nvidia GPU (and probably something similar if you have an AMD GPU?). As it is right now the steps happily get you benchmarking your CPU and I'd say that might even be worth adding a "Warning: The benchmark is operating in CPU only mode, press y to continue if this is intended" type message to the program.

Edit: And for the same prompt and generated token counts it runs ~4x slower than `ollama run hf.co/bartowski/Qwen2.5-14B-Instruct-GGUF:Q4_K_M --verbose`. It's possible I'm mixing up a few things there but my results also post in the same ballpark slower than others with the same GPU so it seems something is up with the application in either case.

10 months agoby zamadatix

This is super cool. I finally just upgraded my desktop and one thing I’m curious to do with it is run local models. Of course the ram is late so I’ve been googling trying to get an idea of what I could expect and there’s not much out there to compare to unless you’re running state of the art stuff.

I’ll make sure to run contribute my benchmark to this once my ram comes in.

10 months agoby roxolotl

Congrats on launching!

Stoked to have this dataset out in the open. I submitted a bunch of tests for some models I'm experimenting with on my M4 Pro. Rather paltry scores compared to having a dedicated GPU but I'm excited that running a 24B model locally is actually feasible at this point.

10 months agoby jborichevskiy

I've been waiting for something like this. Have you considered the following based on the benchmark data that's submitted beyond the GPU?

1. User selects a model, size and token output speed and latency. The website generates a hardware list of components that should match the performance requirements.

2. User selects hardware components and the website generates a list of models that performant on that hardware.

3. Monetize through affiliate links the components to fund the project. Think like PC part picker.

I know there's going to be some variability in the benchmarks due to the software stack, but it should give a AI enthusiasts, an educated perspective on what hardware can be relevant for their use case.

10 months agoby FloatArtifact

I’m curious: does this fundamentally need to contain an actual model, or would it be okay if it generated a synthetic model itself, full of random weights? I’m picturing downloading just, say, a 20MB file instead of the multi-gigabyte one, and…

Hang on, why is https://blob.localscore.ai/localscore-0.9.2 380MB? I remember llamafile being only a few megabytes. From https://github.com/Mozilla-Ocho/llamafile/releases, looks like it steadily grew from adding support for GPUs on more platforms, up to 28.5MiB¹ in 0.8.12, and then rocketed up to 230MiB in 0.8.13:

> The llamafile executable size is increased from 30mb to 200mb by this release. This is caused by https://github.com/ggml-org/llama.cpp/issues/7156. We're already employing some workarounds to minimize the impact of upstream development contributions on binary size, and we're aiming to find more in the near future.

Ah, of course, CUDA. Honestly I might be more surprised that it’s only this big. That monstrosity will happily consume a dozen gigabytes of disk space.

llamafile-0.9.0 was still 231MiB, then llamafile-0.9.1 was 391MiB, now llamafile-0.9.2 is 293MiB. Fluctuating all over the place, but growing a lot. And localscore-0.9.2 is 363MiB. Why 70MiB extra on top of llamafile-0.9.2? I’m curious, but not curious enough to investigate concretely.

Well, this became a grumble about bloat, but I’d still like to know whether it would be feasible to ship a smaller localscore that would synthesise a suitable model, according to the size required, at runtime.

—⁂—

¹ Eww, GitHub is using the “MB” suffix for its file sizes, but they’re actually mebibytes (2²⁰ bytes, 1048576 bytes, MiB). I thought we’d basically settled on returning the M/mega- prefix to SI with its traditional 10⁶ definition, at least for file sizes, ten or fifteen years ago.

10 months agoby chrismorgan

Really awesome project!

Clicking on GPU is a nice simple visualization. I was thinking maybe try to put that type of visual representation intuitively accessible immediately on the landing page.

cpubenchmark.net could he an example technique of drawing the site visitor into the paradigm.

10 months agoby alchemist1e9

This is great, congrats for launching!

A couple of ideas .. I would like to benchmark a remote headless server, as well as different methods to run the LLM (vllm vs tgi vs llama.cpp ...) on my local machine, and in this case llamafile is quite limiting. Connecting over an OpenAI-like API instead would be great!

10 months agoby omneity

This looks super useful, especially with so many folks experimenting with local LLMs now. Curious how well it handles edge devices. Will give it a try!

10 months agoby sharmasachin98

Congrats, really cool tool!

Any plans to add a leaderboard, or more stats groupped by hardware?

9 months agoby XCSme

Why choose a combibation of llama and qwen, when you could have used just qwen models with more permissive license?

10 months agoby gunalx

Awesome stuff, congrats on launching!

10 months agoby TheFlyingPanda