More

kgeist · 2026-04-22T21:22:40 1776892960

Custom constrained decoding could have solved this. Penalize comment tokens :)

kgeist · 2026-04-22T21:19:48 1776892788

Interesting, my assumption used to be that models over-edit when they're run with optimizations in attention blocks (quantization, Gated DeltaNet, sliding window etc.). I.e. they can't always reconstruct the original code precisely and may end up re-inventing some bits. Can't it be one of the reasons too?

kgeist · 2026-04-22T20:56:29 1776891389

From what I understand, ~30b is enough "intelligence" to make coding/reasoning etc. work, in general. Above ~30b, it's less about intelligence, and more about memorization. Larger models fail less and one-shot more often because they can memorize more APIs (documentation, examples, etc). Also from my experience, if a task is ambiguous, Sonnet has a better "intuition" of what my intent is. Probably also because of memorization, it has "access" to more repositories in its compressed knowledge to infer my intent more accurately.

kgeist · 2026-04-21T22:24:05 1776810245

>Latency, throughput, and routes don't matter here. When it's 10 seconds for the first token and then a 1KB/sec streamed response, whatever is fine. You can serve Australia from the US and it'll barely matter.

This may be true for simpler cases where you just stream responses from a single LLM in some kind of no-brain chatbot. If the pipeline is a bit more complex (multiple calls to different models, not only LLMs but also embedding models, rerankers, agentic stuff, etc.), latencies quickly add up. It also depends on the UI/UX expectations.

Funny reading this, because the feature I developed can't go live for a few months in regions where we have to use Amazon Bedrock (for legal reasons), simply because Bedrock has very poor latency and stakeholders aren't satisfied with the final speed (users aren't expected to wait 10-15 seconds in that part of the UI, it would be awkward). And a single roundtrip to AWS Ireland from Asia is already like at least 300ms (multiply by several calls in a pipeline and it adds up to seconds, just for the roundtrips), so having one region only is not an option.

Funny though, in one region we ended up buying our own GPUs and running the models ourselves. Response times there are about 3x faster for the same models than on Bedrock on average (and Bedrock often hangs for 20+ seconds for no reason, despite all the tricks like cross-region inference and premium tiers AWS managers recommended). For me, it's been easier and less stressful to run LLMs/embedders/rerankers myself than to fight cloud providers' latencies :)

>then put all of your data centers there

>You definitely don't need a data center in every continent.

Not always possible due to legal reasons. Many jurisdictions already have (or plan to have) strict data processing laws. Also many B2B clients (and government clients too), require all data processing to stay in the country, or at least the region (like EU), or we simply lose the deals. So, for example, we're already required to use data centers in at least 4 continents, just 2 more continents to go (if you don't count Antarctica :)

kgeist · 2026-04-19T20:18:43 1776629923

Discussed 10 months ago here: https://news.ycombinator.com/item?id=44125598

Back then the consensus was that the idea was absurd, I'm surprised they're now trying to make it into a product

kgeist · 2026-04-16T14:28:35 1776349715

Llama.cpp already uses an idea from it internally for the KV cache [0]

So a quantized KV cache now must see less degradation

[0] https://github.com/ggml-org/llama.cpp/pull/21038

kgeist · 2026-04-16T09:15:10 1776330910

>No mention of the fact that Ollama is about 1000x easier to use

I remember changing the context size from the default unusable 2k to something bigger the model actually supports required creating a new model file in Ollama if you wanted the change to persist (another alternative: set an env var before running ollama; although, if you go that low-level route, why not just launch llama.cpp). How was that easier? Did they change this?

I remember people complaining model X is "dumb" simply because Ollama capped the context size to a ridiculously small number by default.

IMHO trying to model Ollama after Docker actually makes it harder for casual users. And power users will have it easier with llama.cpp directly

kgeist · 2026-04-15T23:17:14 1776295034

I wonder why it's so bad. Do they just paste a CSV into the raw model? Because in my experience, even small local models can handle it reasonably well if the harness forces them to write & run a Python script that parses the table and performs the calculations, instead of relying solely on next-token prediction.

kgeist · 2026-04-08T22:38:01 1775687881

>the same Napster vs Gnutella analogy, the same celebrity email filtering idea, the same obscure FDR gold ban interest, the same weird hyphenation errors

Dunno it assumes their cypherpunk group must always discuss strictly cryptography and never discuss anything else. It could be just some off-topic ideas floating around in their community.

For me, the only solid, damning evidence would be statistical methods of text analysis like they do to prove authenticity of a literary work.

kgeist · 2026-04-08T22:05:54 1775685954

>and while I agree humans can make similar mistakes/confabulations, I overwhelmingly feel that there is no "there" there.

What really opened my eyes a couple weeks ago (anyone can try this): I asked Sonnet to write an inference engine for Qwen3, from scratch, without any dependencies, in pure C. I gave it GGUF specs for parsing (to quickly load existing models) and Qwen3's architecture description. The idea was to see the minimal implementation without all the framework fluff, or abstractions. Sonnet was able to one-shot it and it worked.

And you know what, Qwen3's entire forward pass is just 50 lines of very simple code (mostly vector-matrix multiplications).

The forward pass is only part of the story; you just get a list of token probabilities from the model, that is all. After the pass, you need to choose the sampling strategy: how to choose the next token from the list. And this is where you can easily make the whole model much dumber, more creative, more robotic, make it collapse entirely by just choosing different decoding strategies. So a large part of a model's perceived performance/feel is not even in the neurons, but in some hardcoded manually-written function.

Then I also performed "surgery" on this model by removing/corrupting layers and seeing what happens. If you do this excercise, you can see that it's not intelligence. It's just a text transformation algorithm. Something like "semantic template matcher". It generates output by finding, matching and combining several prelearned semantic templates. A slight perturbation in one neuron can break the "finding part" and it collapases entirely: it can't find the correct template to match and the whole illusion of intelligence breaks. Its corrupted output is what you expect from corrupting a pure text manipulation algorithm, not a truly intelligent system.

famouswaffles · 2026-04-09T01:11:19 1775697079

>And you know what, Qwen3's entire forward pass is just 50 lines of very simple code (mostly vector-matrix multiplications).

The code being simple doesn't mean much when all the complexity is encoded in billions of learned weights. The forward pass is just the execution mechanism. Conflating its brevity with simplicity of the underlying computation is a basic misunderstanding of what a forward pass actually is. What you've just said is the equivalent of saying blackbox.py is simple because 'python blackbox.py' only took 1 line. It's just silly reasoning.

>After the pass, you need to choose the sampling strategy: how to choose the next token from the list. And this is where you can easily make the whole model much dumber, more creative, more robotic, make it collapse entirely by just choosing different decoding strategies. So a large part of a model's perceived performance/feel is not even in the neurons, but in some hardcoded manually-written function.

So ? I can pick the least likely token every time. The result would be garbage but that doesn't say anything about the model. The popular strategy is to randomly pick from the top n choices. What do you is keeping thousands of tokens coherent and on point even with this strategy ? Why don't you try sampling without a large language model to back it and see how well that goes for you ?

>Then I also performed "surgery" on this model by removing/corrupting layers and seeing what happens. If you do this excercise, you can see that it's not intelligence. It's just a text transformation algorithm. Something like "semantic template matcher". It generates output by finding, matching and combining several prelearned semantic templates. A slight perturbation in one neuron can break the "finding part" and it collapases entirely: it can't find the correct template to match and the whole illusion of intelligence breaks. Its corrupted output is what you expect from corrupting a pure text manipulation algorithm, not a truly intelligent system.

What do you think happens when you remove or corrupt arbitrary regions of the human brain? People can lose language, vision, memory, or reasoning, sometimes catastrophically.

kgeist · 2026-04-09T09:43:00 1775727780

>The code being simple doesn't mean much when all the complexity is encoded in billions of learned weights. The forward pass is just the execution mechanism. Conflating its brevity with simplicity of the underlying computation is a basic misunderstanding of what a forward pass actually is. What you've just said is the equivalent of saying blackbox.py is simple because 'python blackbox.py' only took 1 line. It's just silly reasoning.

Look at what a transformer actually does. Attention is a straightforward dictionary look up in like 3 matmuls. A FFN is a simple space transform rule with a non-linear cutoff to adjust the signal (i.e. a few more matmuls and an activation function) before doing a new dictionary lookup in the next transformer block. Add a few tricks like residual connections, output projections, and repeat N times.

So yeah, the actual inference code is 50 lines of code, and the rest is large learned dictionaries to search in, with some transforms. So you're saying my one-liner program that consults a DB with 1 million rows is actually 1 million lines of code? Well, not quite.

This trick, coupled with lots of prelearned templates, is enough to fool people into believing there's "there" there (the OP's post above). Just like ELIZA back in the day. Well, apparently this trick is enough to solve lots of problems, because apparently lots of problems only require search in a known problem (template) space (also with reduced dimensionality). But it's still just a fancy search algorithm. I think the whole thing about "emergent behavior" is that when a human is confronted with a huge prelearned concept space, it's so large they cannot digest what is actually happening, and tend to ascribe magical properties to it like "intelligence" or "consciousness". Like, for example, imagine if there was a huge precreated IF..THEN table for every possible question/answer pair a finite human might ask in their lifetime. It would appear to the human there's intelligence, that there's "there" there. But at the end of the day it would be just a static table with nothing really interesting happening inside of it. A transformer is just a nice trick that allows to compress this huge IF..THEN table into a few hundreds gigabytes.

>So ? I can pick the least likely token every time. The result would be garbage but that doesn't say anything about the model. The popular strategy is to randomly pick from the top n choices. What do you is keeping thousands of tokens coherent and on point even with this strategy ? Why don't you try sampling without a large language model to back it and see how well that goes for you

I was referring to the OP post's:

  there is no "there" there

It doesn't even "know" what the actual text continuation must be, strictly speaking. It just returns a list of probabilities that we must select. It can't select it itself. To go from "list of probabilities" to "chatbot" requires adding additional hardcoded code (no AI involved) that greatly influences how the chatbot behaves, feels. Imagine if an actual sentient being had a button: you press it, and suddenly Steven the sailor becomes a Chinese lady who discusses Confucius. Or starts saying random gibberish. There's no independent agency whatsoever. It's all a bunch of clever tricks.

>What do you think happens when you remove or corrupt arbitrary regions of the human brain? People can lose language, vision, memory, or reasoning, sometimes catastrophically.

In an actual brain, the structure of the connectome itself drives a lot of behavior. In an LLM, all connections are static and predefined. A brain is much more resistant to failure. In an LLM changing a single hypersensitive neuron can lead to a full model collapse. There are humans who live normal lives with a full hemisphere removed.

famouswaffles · 2026-04-09T14:06:54 1775743614

I get irritated when people act like they know what they are talking about but then it's just nonsense they keep spitting out. I'm honestly sick of it. There's a fair amount of LLM interpretability research out there. If you're actually interested in knowing better then go read them. I'll even link what i find interesting. All this talk of lookup tables is nonsensical. You have no idea what you're talking about.

>It doesn't even "know" what the actual text continuation must be, strictly speaking. It just returns a list of probabilities that we must select. It can't select it itself. To go from "list of probabilities" to "chatbot" requires adding additional hardcoded code (no AI involved) that greatly influences how the chatbot behaves, feels. Imagine if an actual sentient being had a button: you press it, and suddenly Steven the sailor becomes a Chinese lady who discusses Confucius. Or starts saying random gibberish. There's no independent agency whatsoever. It's all a bunch of clever tricks.

You are not making any sense here. Producing a probability distribution over next tokens is the model’s decision procedure. Sampling is just the readout rule for turning that distribution into a concrete sequence. Yes, decoding choices affect style, creativity, determinism, and failure modes. That is true. It does not follow that the model is therefore “just tricks” or that the intelligence-like behavior lives outside the network.

>In an actual brain, the structure of the connectome itself drives a lot of behavior. In an LLM, all connections are static and predefined. A brain is much more resistant to failure. In an LLM changing a single hypersensitive neuron can lead to a full model collapse. There are humans who live normal lives with a full hemisphere removed.

You are moving goalposts. Fact is: randomly corrupting a system damages it. This is not a meaningful test of whether a system is "truly intelligent." Random lesions to human cortex are also catastrophic. The hemispherectomy cases you mention involve surgical removal of diseased tissue with significant neural reorganization over time, not random weight corruption. That's not even a fair comparison.

LLMs are also deeply redundant. If they weren't, techniques like quantization or layer pruning wouldn't work.