Hacker Newsnew | past | comments | ask | show | jobs | submit | cpburns2009's commentslogin

The community edition has been open source since 2009. This isn't new.

Personally, I wouldn't trust any foreign or domestic LLM providers to not train on your data. I also wouldn't trust them to not have a data breach eventually which is worse. If you're really worried about your data, run it locally. The Chinese models (Qwen, GLM, etc.) are really competitive to my understanding.

Sir, this is 2026. You're not getting any 128GB of RAM for under $1k.

Anyone else getting gibberish when running unsloth/Qwen3.6-35B-A3B-GGUF:UD-IQ4_XS on CUDA (llama.cpp b8815)? UD-Q4_K_XL is fine, as is Vulkan in general.

Apparently it's a known issue with CUDA 13.2 [1].

[1] https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/discussi...


Yes sadly CUDA 13.2 is broken - NVIDIA will push a fix in CUDA 13.3

I have a 3 monitor setup. The one on HDMI over USB-C blinks off half of the time when I sit down, and my little Noctua desk fan will twitch.

So prosecute the crime. The tool is irrelevant.

That's the point?

If Opus 4.6 was only released two months ago, then it seems reasonable that Qwen hasn't finished fully comparing against the latest Opus.


Well don't we have numbers from both models on these benchmarks already? What else is there to do except include them in the table?


Do we? I admit ignorance in this.


Qwen3-Coder-Next works well on my 128GB Framework Desktop. It seems better at coding Python than Qwen3.5 35B-A3B, and it's not too much slower (43 tg/s compared to 55 tg/s at Q4).

27B is supposed to be really good but it's so slow I gave up on it (11-12 tg/s at Q4).


Agreed. Qwen3-coder-next seems like the sweetspot model on my 128GB Framework Desktop. I seem to get better coding results from it vs 27b in addition to it running faster.


The 8 bit MLX unsloth quant of qwen3-coder-next seems to be a local best on an MBB M5 Max with 128GB memory. With oMLX doing prompt caching I can run two in parallel doing different tasks pretty reasonably. I found that lower quants tend to lose the plot after about 170k tokens in context.


That's good to know. I haven't exceeded a 120k context yet. Maybe I'll bite the bullet and try Q6 or Q8. Any of coder-next quants larger than UD-Q4_K_XL take forever to load, especially with ROCm. I think there's some sort of autotuning or fitting going in llama.cpp.


In my experience using llama.cpp (which ollama uses internally) on a Strix Halo, whether ROCm or Vulkan performs better really depends on the model and it's usually within 10%. I have access to an RX 7900 XT I should compare to though.


Perhaps I should just google it, but I'm under the impression that ollama uses llama.cpp internally, not the other way around.

Thanks for that data point I should experiment with ROCm


From what I understand, ROCm is a lot buggier and has some performance regressions on a lot of GPUs in the 7.x series. Vulkan performance for LLMs is apparently not far behind ROCm and is far more stable and predictable at this time.


I meant ollama uses llama.cpp internally. Sorry for the confusion.


The NPU works on Linux (Arch at least) on Strix Halo using FastFlowLM [1]. Their NPU kernels are proprietary though (free up to a reasonable amount of commercial revenue). It's neat you can run some models basically for free (using NPU instead of CPU/GPU), but the performance is underwhelming. The target for NPUs is really low power devices, and not useful if you have an APU/GPU like Strix Halo.

[1]: https://github.com/FastFlowLM/FastFlowLM


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: