Ollama is now powered by MLX on Apple Silicon in preview

by redundantly on 3/31/2026, 3:40:45 AM

Comments

by: babblingfish

LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.

3/31/2026, 4:40:12 AM

by: LuxBennu

Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path

3/31/2026, 5:01:03 AM

by: puskuruk

Finally! My local infra is waiting for it for months!

3/31/2026, 6:48:34 AM

by:

3/31/2026, 6:40:03 AM

by: codelion

How does it compare to some of the newer mlx inference engines like optiq that support turboquantization - <a href="https://mlx-optiq.pages.dev/" rel="nofollow">https://mlx-optiq.pages.dev/</a>

3/31/2026, 4:50:16 AM

by: dial9-1

still waiting for the day I can comfortably run Claude Code with local llm's on MacOS with only 16gb of ram

3/31/2026, 4:51:12 AM

by: mfa1999

How does this compare to llama.cpp in terms of performance?

3/31/2026, 5:41:00 AM

by: AugSun

"We can run your dumbed down models faster":#The use of NVFP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on key language modeling tasks for some models.

3/31/2026, 5:03:38 AM

by: brcmthrowaway

What is the difference between Ollama, llama.cpp, ggml and gguf?

3/31/2026, 5:30:24 AM

by: firekey_browser

[dead]

3/31/2026, 6:34:35 AM

Hacker News Viewer

Top 20