More

cpburns2009 · 2026-04-17T14:00:30 1776434430

The community edition has been open source since 2009. This isn't new.

cpburns2009 · 2026-04-16T19:18:26 1776367106

Personally, I wouldn't trust any foreign or domestic LLM providers to not train on your data. I also wouldn't trust them to not have a data breach eventually which is worse. If you're really worried about your data, run it locally. The Chinese models (Qwen, GLM, etc.) are really competitive to my understanding.

cpburns2009 · 2026-04-16T19:09:23 1776366563

Sir, this is 2026. You're not getting any 128GB of RAM for under $1k.

cpburns2009 · 2026-04-16T19:04:35 1776366275

Anyone else getting gibberish when running unsloth/Qwen3.6-35B-A3B-GGUF:UD-IQ4_XS on CUDA (llama.cpp b8815)? UD-Q4_K_XL is fine, as is Vulkan in general.

cpburns2009 · 2026-04-16T21:42:14 1776375734

Apparently it's a known issue with CUDA 13.2 [1].

[1] https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/discussi...

danielhanchen · 2026-04-18T10:24:23 1776507863

Yes sadly CUDA 13.2 is broken - NVIDIA will push a fix in CUDA 13.3

cpburns2009 · 2026-04-16T01:45:55 1776303955

I have a 3 monitor setup. The one on HDMI over USB-C blinks off half of the time when I sit down, and my little Noctua desk fan will twitch.

cpburns2009 · 2026-04-15T17:33:34 1776274414

So prosecute the crime. The tool is irrelevant.

runjake · 2026-04-15T19:05:54 1776279954

That's the point?

cpburns2009 · 2026-04-02T16:21:02 1775146862

If Opus 4.6 was only released two months ago, then it seems reasonable that Qwen hasn't finished fully comparing against the latest Opus.

jgbuddy · 2026-04-03T13:54:22 1775224462

Well don't we have numbers from both models on these benchmarks already? What else is there to do except include them in the table?

cpburns2009 · 2026-04-05T00:36:27 1775349387

Do we? I admit ignorance in this.

cpburns2009 · 2026-04-02T15:16:01 1775142961

Qwen3-Coder-Next works well on my 128GB Framework Desktop. It seems better at coding Python than Qwen3.5 35B-A3B, and it's not too much slower (43 tg/s compared to 55 tg/s at Q4).

27B is supposed to be really good but it's so slow I gave up on it (11-12 tg/s at Q4).

UncleOxidant · 2026-04-02T18:20:37 1775154037

Agreed. Qwen3-coder-next seems like the sweetspot model on my 128GB Framework Desktop. I seem to get better coding results from it vs 27b in addition to it running faster.

vlowther · 2026-04-02T20:33:45 1775162025

The 8 bit MLX unsloth quant of qwen3-coder-next seems to be a local best on an MBB M5 Max with 128GB memory. With oMLX doing prompt caching I can run two in parallel doing different tasks pretty reasonably. I found that lower quants tend to lose the plot after about 170k tokens in context.

cpburns2009 · 2026-04-02T21:13:49 1775164429

That's good to know. I haven't exceeded a 120k context yet. Maybe I'll bite the bullet and try Q6 or Q8. Any of coder-next quants larger than UD-Q4_K_XL take forever to load, especially with ROCm. I think there's some sort of autotuning or fitting going in llama.cpp.

cpburns2009 · 2026-04-02T14:58:18 1775141898

In my experience using llama.cpp (which ollama uses internally) on a Strix Halo, whether ROCm or Vulkan performs better really depends on the model and it's usually within 10%. I have access to an RX 7900 XT I should compare to though.

metalliqaz · 2026-04-02T15:38:32 1775144312

Perhaps I should just google it, but I'm under the impression that ollama uses llama.cpp internally, not the other way around.

Thanks for that data point I should experiment with ROCm

naasking · 2026-04-02T17:36:57 1775151417

From what I understand, ROCm is a lot buggier and has some performance regressions on a lot of GPUs in the 7.x series. Vulkan performance for LLMs is apparently not far behind ROCm and is far more stable and predictable at this time.

cpburns2009 · 2026-04-02T16:15:51 1775146551

I meant ollama uses llama.cpp internally. Sorry for the confusion.

cpburns2009 · 2026-04-02T14:54:44 1775141684

The NPU works on Linux (Arch at least) on Strix Halo using FastFlowLM [1]. Their NPU kernels are proprietary though (free up to a reasonable amount of commercial revenue). It's neat you can run some models basically for free (using NPU instead of CPU/GPU), but the performance is underwhelming. The target for NPUs is really low power devices, and not useful if you have an APU/GPU like Strix Halo.

[1]: https://github.com/FastFlowLM/FastFlowLM