Ollama is now powered by MLX on Apple Silicon in preview
by redundantly on 3/31/2026, 3:40:45 AM
Comments
by: babblingfish
LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.
3/31/2026, 4:40:12 AM
by: LuxBennu
Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path
3/31/2026, 5:01:03 AM
by: puskuruk
Finally! My local infra is waiting for it for months!
3/31/2026, 6:48:34 AM
by:
3/31/2026, 6:40:03 AM
by: codelion
How does it compare to some of the newer mlx inference engines like optiq that support turboquantization - <a href="https://mlx-optiq.pages.dev/" rel="nofollow">https://mlx-optiq.pages.dev/</a>
3/31/2026, 4:50:16 AM
by: dial9-1
still waiting for the day I can comfortably run Claude Code with local llm's on MacOS with only 16gb of ram
3/31/2026, 4:51:12 AM
by: mfa1999
How does this compare to llama.cpp in terms of performance?
3/31/2026, 5:41:00 AM
by: AugSun
"We can run your dumbed down models faster":<p>#The use of NVFP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on <i>key language modeling</i> tasks for <i>some</i> models.
3/31/2026, 5:03:38 AM
by: brcmthrowaway
What is the difference between Ollama, llama.cpp, ggml and gguf?
3/31/2026, 5:30:24 AM
by: firekey_browser
[dead]
3/31/2026, 6:34:35 AM