As I replied to a child comment - this is a nice idea that just isn't tenable in...

c7b · 2026-06-13T12:20:57 1781353257

Could you put some numbers and examples behind the efficiency gap between data center and consumer-grade AI hardware? Did you include examples like the RTX Spark on the consumer side? I was always amazed at the low power consumption of unified memory style architectures. In absolute terms and even more so compared to consumer-grade GPUs. I'd be genuinely interested in a comparison with data-center-grade hardware.

ux266478 · 2026-06-13T06:03:40 1781330620

AI hardware is for inference, not training. Training uses normal HPC crap. Superpods aren't really power efficient, it's kind of a meme, and it stems from limiting the power draw of other components by having less of them. It's more of a rounding error.

> you'd end up consuming so much excess electricity it would be cheaper on net to simply take the money that would have gone to the power bill and spend it on your own datacenter.

Costs spread over a large population, it really doesn't matter. You're not getting hundreds of thousands of people to pitch half their monthly electric bill to pay for someone else's datacenter. They will pay the electricity themselves quite happily though, if all they need to do is give you compute. This isn't new.

Interconnect is the bottleneck for distributed training, nothing else really.

sho · 2026-06-13T06:45:15 1781333115

> AI hardware is for inference, not training

Not sure what you are referring to, unless you don't think h100/h200/b200 are "AI hardware"

> Superpods aren't really power efficient

Maybe not compared to a specialized rig with multiple 4090s, but that is the best case for consumer hardware - the vast majority will be dramatically less efficient than that

Anyway, I agree the interconnect is by far the biggest obstacle and seems insurmountable, I should probably have led with that.

rurban · 2026-06-13T10:51:14 1781347874

You got it wrong. Inference can use crap GPU's. Training needs the 100x more expensive big guns. Our training machine is 100x more expensive than our inference machine.

pksebben · 2026-06-13T06:14:34 1781331274

Bit of a doozie though, that one.

I recall getting really excited over hinton's FF foray, right before he bailed on AI as a societal direction (which, if anyone ever had the right, I suppose he does). If one squints, one can see a backprop-free base being much easier to train on geographically distributed and heterogenous hardware.

Davidzheng · 2026-06-13T10:05:42 1781345142

Are you sure most of frontier cost isn't inference in RL environments?

dyauspitr · 2026-06-13T06:57:23 1781333843

That makes no sense. It’s basically the same calculations for training as well.

WithinReason · 2026-06-13T07:58:48 1781337528

Efficiency difference between training on GPUs and TPUs is 2x at best. You can get very efficient with tensorcores, converging to TPU efficiency. In the end math is math, you can't make a multiplication more efficient than it already is on GPU.

schobi · 2026-06-13T08:37:33 1781339853

I guess this was more related to syncing GPUs.

If you were to take 500 computers with older 1080 GPUs, you might have enough compute/ram equivalent to an H200 GPU for training such a model. Maybe take 10000.

But if those machines are spread over 10000 homes, wired with residential internet service, training a large model will not get anywhere.

You go from "data in the same HBM memory chip" at 4.8TB/s or "data in adjacent GPU" with NVlink at 1.2 TB/s down to 25 MBit/s upload speed. Accessing the next piece of data is going to be about a Million times slower. At the same time you will heat a thousand times more, for a Million times longer.

incrudible · 2026-06-13T09:21:32 1781342492

You need to train independently and merge rarely. The problem is the merge step. Weights are too entangled, you are not going to get an improvement commensurate to the effort. Otherwise, everyone would do it. It is an open research problem.

filup · 2026-06-13T10:33:03 1781346783

That sounds like the way. Everyone trains their own small problems to maximally compressed weights and then merges.

zozbot234 · 2026-06-13T08:35:21 1781339721

The power-constrained part of compute is data movement, not the elementary arithmetic per se. Anyway, it's very possible to tweak the underlying design to increase throughput a lot for any given power budget at the cost of high latency. This seems especially useful for training workloads where we don't really care about latency as much.

Cider9986 · 2026-06-13T07:31:19 1781335879

What makes you think Deepseek or GLM won't catch up to Fable level? Why would there be a break in the trend now?

zozbot234 · 2026-06-13T08:40:36 1781340036

DeepSeek and GLM (plus Kimi) are at or above Sonnet level wrt. favorable workloads like coding. They're not close to Opus or the latest GPT yet, and Fable is even higher than that. Other workloads relying more on real-world knowledge have them even further behind, and this can't be mitigated without making the model itself bigger and harder to host locally.

thepasch · 2026-06-13T09:22:15 1781342535

> They're not close to Opus or the latest GPT yet

Disagreed. GLM-5.1 is easily as good as Opus 4.5 for all the coding purposes I could throw at it, which is the model that kicked this entire hype cycle into overdrive in the first place.

Cider9986 · 2026-06-13T08:47:03 1781340423

I've found GLM to be comparable or better than Opus at writing and at a fraction of the cost.

zozbot234 · 2026-06-13T08:51:25 1781340685

Writing does not rely on real-world knowledge all that much, other than knowledge of language itself. Even tiny models can achieve that, it's even easier than coding.

metalspot · 2026-06-13T11:00:21 1781348421

The key thing here is that effective intelligence = model capability / cost. If you drive down the cost of inference you can have higher effective capability even with a technically less capable model. There is nothing in Anthropic/OpenAIs general reasoning capabilities that can't be easily done much better with a purpose built harness for a domain specific task.

kuboble · 2026-06-13T08:14:58 1781338498

I think there are at least few question marks.

One being that extrapolating from like 3 data points is hardly science. All trends break at some point.

The other is that the measures to prevent distillation of their models (if it was a secret sauce of Chinese models) could work if nobody is allowed to use them.

incrudible · 2026-06-13T09:16:43 1781342203

> As I replied to a child comment - this is a nice idea that just isn't tenable in reality. AI hardware isn't just hilariously faster than consumer GPUs, it's also hilariously more power-efficient and has hilariously better connectivity. Every one of these dimensions kills the idea.

The first part is not really true though, the chips are not that much faster, the DRAM is not that much faster, and in aggregate it does not matter because there is just so much more consumer hardware out there (although perhaps that is changing as supply shifts toward datacenters).

The interconnect and data locality is the problem. If you could train it like e.g. you can render a scene with monte carlo ray tracing, any result from any node could be merged with any other and the combined result would have converged closer to the limit. I am sure research in that direction exists, it just has not proven effective within the scales it has been attempted.