@enumerator4829

enumerator4829@sh.itjust.works · 3 hours ago

Wrote a longer reply to someone else, but briefly, yes, you are correct. Kinda.

Caches won’t help with bandwidth-bound compute (read: ”AI”) it the streamed dataset is significantly larger than the cache. A cache will only speed up repeated access to a limited set of data.

enumerator4829@sh.itjust.works · 3 hours ago

Yeah, the cache hierarchy is behaving kinda wonky lately. Many AI workloads (and that’s what’s driving development lately) are constrained by bandwidth, and cache will only help you with a part of that. Cache will help with repeated access, not as much with streaming access to datasets much larger than the cache (i.e. many current AI models).

Intel already tried selling CPUs with both on-package HBM and slotted DDR-RAM. No one wanted it, as the performance gains of the expensive HBM evaporated completely as soon as you touched memory out-of-package. (Assuming workloads bound by memory bandwidth, which currently dominate the compute market)

To get good performance out of that, you may need to explicitly code the memory transfers to enable prefetch (preferably asynchronous) from the slower memory into the faster, á la classic GPU programming. YMMW.

enumerator4829@sh.itjust.works · 5 hours ago

You are correct, I’m referring to on package. Need more coffee.

enumerator4829@sh.itjust.works · 5 hours ago

All your RAM needs to be the same speed unless you want to open up a rabbit hole. All attempts at that thus far have kinda flopped. You can make very good use of such systems, but I’ve only seen it succeed with software specifically tailored for that use case (say databases or simulations).

The way I see it, RAM in the future will be on package and non-expandable. CXL might get some traction, but naah.

enumerator4829@sh.itjust.works · 9 hours ago

I don’t think you are wrong, but I don’t think you go far enough. In a few generations, the only option for top performance will be a SoC. You’ll get to pick which SoC you want and what box you want to put it in.

enumerator4829@sh.itjust.works · 9 hours ago

Apparently AMD couldn’t make the signal integrity work out with socketed RAM. (source: LTT video with Framework CEO)

IMHO: Up until now, using soldered RAM was lazy and cheap bullshit. But I do think we are at the limit of what’s reasonable to do over socketed RAM. In high performance datacenter applications, socketed RAM is on it’s way out (see: MI300A, Grace-{Hopper,Blackwell},Xeon Max), with onboard memory gaining ground. I think we’ll see the same trend on consumer stuff as well. Requirements on memory bandwidth and latency are going up with recent trends like powerful integrated graphics and AI-slop, and socketed RAM simply won’t work.

It’s sad, but in a few generations I think only the lower end consumer CPUs will be possible to use with socketed RAM. I’m betting the high performance consumer CPUs will require not only soldered, but on-board RAM.

Finally, some Grace Hopper to make everyone happy: https://youtube.com/watch?v=gYqF6-h9Cvg