More

mathisfun123 · 2026-05-16T12:53:50 1778936030

There are literally only 2 "fabfull" processor companies (Intel and Samsung) so you're saying something completely meaningless.

dadoum · 2026-05-16T13:01:34 1778936494

Actually there are more if you count the ones which are not at the cutting edge but your point still stands, most high-end silicon companies only do design.

mathisfun123 · 2026-05-16T05:11:43 1778908303

You're at G, which is absolutely the only place I'd expect to be doing this in a mature/adult/non-psychotic way.

mathisfun123 · 2026-05-14T18:25:39 1778783139

> Yes, it is possible to complete a PhD in 3-4 years, but it's not really good for your career.

this is such a "trust me bro it's good for you" con.

i graduated in 3.5 years and went directly to FAANG where i make 2x the highest paid TT at the T10 school i graduated from. do you really have the gall to tell me that it wasn't good for my career to accelerate my PhD and thereby minimize its cost (i.e., opportunity cost).

> A PhD is more like an apprenticeship

the vast majority of advisors have no skills other than how to hack the pub game. they literally have zero clue about the research. the remainder are the "exceptions that prove the rule".

mathisfun123 · 2026-05-14T17:41:54 1778780514

> I'm curious and not an expert here, do you know why the TTFT is so much worse on Mac?

because the GPUs aren't as fantastic as everyone assumes?

> might also be less optimised in MLX?

prefill has gotta be one of the most optimized paths in MLX...

bigyabai · 2026-05-14T20:34:57 1778790897

No you don't understand, on Apple Silicon my CPU has comparable memory bandwidth to a $400 Pascal-era GPU. With the unified memory architecture, that means my iGPU gets 2016-levels of DDR transfer speed with none of the upsides of CUDA. It's the most cutting-edge hardware ever put in a personal computer, without a doubt.

fgfarben · 2026-05-15T08:19:24 1778833164

Please show me on the 2016-era $400 Pascal GPU where you can install the 256 GB of VRAM.

bigyabai · 2026-05-15T15:38:42 1778859522

We're quite lucky that Nvidia didn't ship a 256gb system at sub-500gb/s transfer rate, is my point.

fgfarben · 2026-05-15T18:33:35 1778870015

> Nvidia didn't ship a 256gb system at sub-500gb/s transfer rate

DGX Spark has 128 GB and only 273 GB/s BW. Are we lucky that NVIDIA did ship something even worse than what you specified? I'm confused.

People have been complaining [1] about how little VRAM NVIDIA ships with their GPUs for decades. Their whole game has been "oh, you want more VRAM? Buy more or pay us 50x for server grade with 10x as much VRAM. The more you buy, the more you save."

Apple did everyone a solid by shipping something way out of that distribution. We now know more than we did before! We know that a 284B parameter model with 13B active params (or 35B with 3B active, or 671B with 37B active) can outperform a 2T model and draw a fraction as much power. How can you think that's a bad thing?

You could point out that Apple didn't invent the idea of MoE. Everyone knows that. But other than Macs, there simply were no machines with >100GB VRAM directly coupled to ~50 TFLOP/s of compute until the DGX Spark last Dec. If you wanted to run a model with more than 32 GB of weights, you had to either pay up for dozens of GPUs idling at hundreds of watts or really pay up for some $50,000 server GPUs idling at... also 100-200W each.

I feel lucky to have a $3k machine on my shelf that can run DS4-Flash with 1M context at 20t/s while drawing ~150W and making very little noise. The best part? It idles at 30W with DS4 loaded, dropping to 6W after a reboot. There isn't a single GPU on the market that can match that in the same shoebox volume.

[1] https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRlOW0N...

bigyabai · 2026-05-15T19:10:12 1778872212

The DGX Spark is also a niche, arbitrarily limited machine that will not displace serious datacenter workloads. It's targeted directly at the homelab LARPers and arguably a waste of money versus similarly priced GPU clusters. A 256gb Spark at LPDDRX5 transfer rates would be a genuine travesty.

You can try to weasel out any sort of edge justification you want - these are not industry-grade machines. They are slow, expensive, bandwidth-constrained SOCs that don't hold a candle to either datacenter GPUs or even decade-old gaming GPUs. It's worth criticizing when Apple does it, and also worth criticism when Nvidia does it. The only difference being that Nvidia has natural datacenter buy-in, while Apple can't even justify their own hardware in the face of TPU inference costs: https://9to5mac.com/2026/03/02/some-apple-ai-servers-are-rep...

fgfarben · 2026-05-16T00:38:02 1778891882

What even is an industry grade machine?

Would you own a computer if the smallest computer you were allowed to buy was a $27,000 Supermicro rack that draws 900 watts all the time?

mathisfun123 · 2026-05-11T19:30:28 1778527828

What exactly are you upset about? Someone observing that MLIR is extremely complex and dependent on LLVM...?

awestroke · 2026-05-11T19:35:25 1778528125

The quoted writing is AI slop, and OP is reacting to the fact that they did not write even the introductory text themselves (or at least bother to edit out clear AI/slop indicators)

mathisfun123 · 2026-05-11T19:41:22 1778528482

... Who cares...

debugnik · 2026-05-11T19:44:56 1778528696

Clearly I.

u_fucking_dork · 2026-05-11T20:09:11 1778530151

[flagged]

grosswait · 2026-05-11T20:17:17 1778530637

Orange Reddit. Unfortunately that rings a little too true these days. Hopefully a stage that reverts at some point.

mathisfun123 · 2026-05-11T19:16:33 1778526993

this dude is a distinguished engineer at siemens commenting the dopiest/reddit level takes. lolol.

rogermeier · 2026-05-11T20:02:22 1778529742

agree not related to the rust to cuda compiler, you are right! But I have to say worth to look at upcoming new stuff, as this is kind a wow rust on good old CUDA.

mathisfun123 · 2026-05-11T17:56:05 1778522165

every GPU related post has a comment which makes my eyes roll all the way back. this is the one for this post.

mathisfun123 · 2026-05-11T15:34:12 1778513652

> get them via SME

I have no idea what this means - AMX was replaced by SME on M4. It's a new unit not just an "abstract intrinsic" (which would make zero sense).

dagmx · 2026-05-11T16:15:33 1778516133

I’m not sure what part is confusing you or how to word it another way to make more sense to you.

What I’m saying is that instead of using the secret AMX instructions, just use SME , assuming they have the hardware available to them.

AMX isn’t truly gone afaik , at least according to the folks who have been looking at it. It’s just deprecated and it seems like the architecture treats them somewhat like aliases, preventing concurrent use within a process.

mathisfun123 · 2026-05-09T06:32:51 1778308371

https://github.com/triton-lang/triton

https://github.com/tenstorrent/tt-mlir

https://github.com/onnx/onnx-mlir

https://github.com/openxla/stablehlo

plenty more - just google

mathisfun123 · 2026-05-07T19:51:50 1778183510

this take is peak dunning-kruger:

https://github.com/llvm/llvm-project/tree/main/llvm/test/Cod...