Hacker Newsnew | past | comments | ask | show | jobs | submit | zargon's commentslogin

It's only in preview right now. And anyway, yes, models regularly get updated training.

But in this case, it's more likely just to be a tooling issue.


I think you mean ollama vs llama.cpp.

I do!

Damn autocorrect :)


I call it autocorrupt :)

Flash is less than 160 GB. No need to quantize to fit in 2x 96 GB. Not sure how much context fits in 30 GB, but it should be a good amount.

It seems to be 160GB at mixed FP4+FP8 precision, FYI. Full FP8 is 250GB+. (B)F16 at around double I would assume.

There is no BF16. There is no FP8 for the instruct model. The instruct model at full precision is 160 GB (mixed FP4 and FP8). The base model at full precision is 284 GB (FP8). Almost everyone is going to use instruct. But I do love to see base models released.

> ~100GB at 16 bit or ~50GB at 8bit quantized.

V4 is natively mixed FP4 and FP8, so significantly less than that. 50 GB max unquantized.


That article is a total hallucination.

"671B total / 37B active"

"Full precision (BF16)"

And they claim they ran this non-existent model on vLLM and SGLang over a month and a half ago.

It's clickbait keyword slop filled in with V3 specs. Most of the web is slop like this now. Sigh.


The Flash version is 284B A13B in mixed FP8 / FP4 and the full native precision weights total approximately 154 GB. KV cache is said to take 10% as much space as V3. This looks very accessible for people running "large" local models. It's a nice follow up to the Gemma 4 and Qwen3.5 small local models.

Price is appealing to me. I have been using gemini 3 flash mainly for chat. I may give it a try.

input: $0.14/$0.28 (whereas gemini $0.5/$3)

Does anyone know why output prices have such a big gap?


Output is what the compute is used for above all else; costs more hardware time basically than prompt processing (input) which is a lot faster

input tokens are processed at 10-50 times the speed of output tokens since you can process then in batches and not one at a time like output tokens

I'm going to blow my bandwidth allowance again this month, aren't I.


For FIM, there's Qwen3 Coder Next.

Although Mistral's model card seems to indicate that Devstral 2 doesn't support FIM, it seems very odd that it wouldn't. I have been meaning to test it.


Qwen Coder 30B A3B is far better than Qwen Coder Next imo. I may have inference issues or it's just a problem with running Coder Next at IQ4 XS, vs Q8 for the earlier/smaller model but I don't find the 80B to be much better at coding, even in instruct mode, and the insane speed and low latency of the smaller model is way more useful. Good one-line completions often happen in 300ms.

I just loaded up Qwen3.6 27B at Q8_0 quantization in llama.cpp, with 131072 context and Q8 kv cache:

  build/bin/llama-server \
    -m ~/models/llm/qwen3.6-27b/qwen3.6-27B-q8_0.gguf \
    --no-mmap \
    --n-gpu-layers all \
    --ctx-size 131072 \
    --flash-attn on \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --jinja \
    --no-mmproj \
    --parallel 1 \
    --cache-ram 4096 -ctxcp 2 \
    --reasoning on \
    --chat-template-kwargs '{"preserve_thinking": true}'
Should fit nicely in a single 5090:

  self    model   context   compute
  30968 = 25972 +    4501 +     495
Even bumping up to 16-bit K cache should fit comfortably by dropping down to 64K context, which is still a pretty decent amount. I would try both. I'm not sure how tolerant Qwen3.5 series is of dropping K cache to 8 bits.

These calculators are almost entirely useless. They don't understand specific model architectures. Even the ones that try to support only specific models (like the apxml one) get it very wrong a lot of the time.

For example, the one you linked, when I provide a Qwen3.5 27B Q_4_M GGUF [0], says that it will require 338 GB of memory with 16-bit kv cache. That is wrong by over an order of magnitude.

[0] https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF/resol...



Excellent job with this! I tried a few combinations that completely fail on other calculators and yours gets VRAM usage pretty much spot on, and even the performance estimate is in the ballpark to what I see with mixed VRAM / RAM workloads.

It's a shame that search is so polluted these days that it's impossible to find good tools like yours.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: