As I replied to a child comment - this is a nice idea that just isn't tenable in reality. AI hardware isn't just hilariously faster than consumer GPUs, it's also hilariously more power-efficient and has hilariously better connectivity. Every one of these dimensions kills the idea.
The far, FAR superior power efficiency means that even if you did harness every public GPU or GPU-like device on earth, you'd end up consuming so much excess electricity it would be cheaper on net to simply take the money that would have gone to the power bill and spend it on your own datacenter.
And even if electricity was free, having those GPUs spread over the world with internet-level latency will slow everything down by factors of thousands to millions - if it's feasible at all. Regardless, you're not getting fable-oss this decade, maybe even not this century.
It would be better for governments to buy and own their own datacenters, maybe as a coalition, and dedicate their operation to the public good. I believe that is what we actually have to do.
AI hardware is for inference, not training. Training uses normal HPC crap. Superpods aren't really power efficient, it's kind of a meme, and it stems from limiting the power draw of other components by having less of them. It's more of a rounding error.
> you'd end up consuming so much excess electricity it would be cheaper on net to simply take the money that would have gone to the power bill and spend it on your own datacenter.
Costs spread over a large population, it really doesn't matter. You're not getting hundreds of thousands of people to pitch half their monthly electric bill to pay for someone else's datacenter. They will pay the electricity themselves quite happily though, if all they need to do is give you compute. This isn't new.
Interconnect is the bottleneck for distributed training, nothing else really.
You got it wrong. Inference can use crap GPU's. Training needs the 100x more expensive big guns. Our training machine is 100x more expensive than our inference machine.
How is the result of training stored? How big is that? It seems reasonable to assume we’ll eventually plateau and all we’ll need is relatively infrequent training.
Not so often. The GPU's are running 100% for 3 weeks for a training run. We do images only, but it's the same process. And then we can use the costly GPU's for inference, local model coding agents.
Training is about 4x a year. But it depends what ideas the PM or the costumers have. If they has more, more training tasks. Eg. more viruses to detect.
Not sure what you are referring to, unless you don't think h100/h200/b200 are "AI hardware"
> Superpods aren't really power efficient
Maybe not compared to a specialized rig with multiple 4090s, but that is the best case for consumer hardware - the vast majority will be dramatically less efficient than that
Anyway, I agree the interconnect is by far the biggest obstacle and seems insurmountable, I should probably have led with that.
I recall getting really excited over hinton's FF foray, right before he bailed on AI as a societal direction (which, if anyone ever had the right, I suppose he does). If one squints, one can see a backprop-free base being much easier to train on geographically distributed and heterogenous hardware.
Efficiency difference between training on GPUs and TPUs is 2x at best. You can get very efficient with tensorcores, converging to TPU efficiency. In the end math is math, you can't make a multiplication more efficient than it already is on GPU.
If you were to take 500 computers with older 1080 GPUs, you might have enough compute/ram equivalent to an H200 GPU for training such a model. Maybe take 10000.
But if those machines are spread over 10000 homes, wired with residential internet service, training a large model will not get anywhere.
You go from "data in the same HBM memory chip" at 4.8TB/s or "data in adjacent GPU" with NVlink at 1.2 TB/s down to 25 MBit/s upload speed. Accessing the next piece of data is going to be about a Million times slower.
At the same time you will heat a thousand times more, for a Million times longer.
You need to train independently and merge rarely. The problem is the merge step. Weights are too entangled, you are not going to get an improvement commensurate to the effort. Otherwise, everyone would do it. It is an open research problem.
The power-constrained part of compute is data movement, not the elementary arithmetic per se. Anyway, it's very possible to tweak the underlying design to increase throughput a lot for any given power budget at the cost of high latency. This seems especially useful for training workloads where we don't really care about latency as much.
Could you put some numbers and examples behind the efficiency gap between data center and consumer-grade AI hardware? Did you include examples like the RTX Spark on the consumer side? I was always amazed at the low power consumption of unified memory style architectures. In absolute terms and even more so compared to consumer-grade GPUs. I'd be genuinely interested in a comparison with data-center-grade hardware.
DGX Spark is effectively prosumer hardware, better than most consumer stuff but still not comparable to actual datacenter gear. You can't just look at TDP in isolation without also comparing performance.
It's more than the raw hardware, it's the interconnect and communication between the hardware at scale. These models are trained on hundreds of thousands of GPUs today. You _will_ start to see cross-datacenter training runs but this needs to efficiently decide when and how to communicate across datacenter, which bears a very high cost compared to intra-datacenter communication.
Dunno, in a sense, torrents came among similar restrictions. Everything at consumer level was just plain awful and at dial up level, mebbe ISDN if you were very lucky, with fiber only available to ridiculously rich people and corps. But with restrictions, came approaches on how to mitigate them.
Yes but not violations of the laws of physics. You need extremely fast communications, memory bandwidth, etc; you cannot get that with distributed training. You're up against the speed of light and the interconnect that powers the internet. You will always have horrifically slow latency compared to if you pack the servers together in the same place with specialized networking.
<< You will always have horrifically slow latency compared to if you pack the servers together in the same place with specialized networking.
Agree about the physics; disagree about the larger point.
I am not questioning that servers packed together may achieve an optimal result in how we are currently doing things, but, and this is my point, what if we didn't.
<< you cannot get that with distributed training
This is entirely the wrong question to ask. The question to ask is: how it could be adapted to distributed training.
You know what I'm surprised to find out this is far more feasible than I assumed; DiLoCo + INTELLECT models demonstrate how feasible decentralized training is already, that is very surprising to me that you can get that far with so much less communication bandwidth. Not only that, but that distributed training is _more_ feasible as you scale since compute needed scales as the square of parameter count but communication scales linearly so the overhead penalty goes down.
I think the most important problem is that you have to marshall enough compute to be meaningful, and that is going to be more and more difficult as frontier compute requirements grow.
It is a genuinely interesting problem ( above my mental abilities, but there are people smarter than me that could make it work ). I agree that compute could end up being an issue as things progress. Still, it seems that portions of what would be necessary kinda exists.
But, and it is not a small but, there is no money in it. In fact, big orgs are bound to lose money should something like that succeed.
I used it as an example. I understand the problem is hard. My larger point was that this is exactly how actual progress tends to take place. Well, that and porn.
> It would be better for governments to buy and own their own datacenters, maybe as a coalition, and dedicate their operation to the public good. I believe that is what we actually have to do.
100% agree. The US government basically has to nationalize AI and capture an outsize portion of the revenue from it in order to fix the economy, as the combination of debt burden and interest rate pressure from de-dollarization/global realignment is going to push us into a death spiral, and even if AI is a smash hit, the ~19% federal capture of corporate revenue isn't nearly enough to pull us out of it. The people owning the compute infrastructure and capturing more profit from AI at that layer is the safest, cleanest way to increase revenue capture, a sovereign wealth fund is a mediocre idea because it's possible to play shell game with stocks and redirect profit/debt (venture capital is quite good at this!).
>> The US government basically has to nationalize AI and capture an outsize portion of the revenue from it
Currently AI has generated no profit. And as it sits, is a non viable business.
I refuse to include the sellers of shovels as AI revenue.
If the companies buying the shovels are still losing money, then the tool supplier fortunes have nothing to do with the economics of the AI application layer, who is losing money on every prompt.
It's the most naive opinion that keeps getting shoveled around. You have a product that is viewed as essential by businesses, with revenue growing by 10x a year and geopolitical ramifications that have continued to rear their heads and your opinion is "this is all an unprofitable shill". It is extraordinary to me that people really believe this. Whether or not labs run at a loss today is absolutely irrelevant. There is of course steady state economics that make sense, and its currently not well known what the profitability picture is right now, so to say "Currently AI has generated no profit" is also just speculation and not a very insightful one at that.
That businesses view it as essential...is not a profitability argument.
Businesses also bought dot com infrastructure, telecom fiber, crypto platforms, metaverse tools, and overbuilt SaaS. The question is whether the AI application layer can charge more than its full cost and the costs are inference, infrastructure, depreciation, R&D, customer acquisition, support, compliance, security, and error remediation.
The numbers so far do not inspire confidence. OpenAI reportedly did $4.3B in revenue in the first half of 2025 while burning $2.5B, and Microsoft said OpenAI related losses reduced its own quarterly net income by $3.1B. An MIT 2025 enterprise AI study found $30 to 40B spent on GenAI with 95% of organizations seeing zero return.
One of the core technical reason is that hallucination destroy enterprise economics. If SAP hallucinated 2% of invoices, or Oracle returned fake rows 2% of the time, nobody would call that early stage friction. They would call it unusable for core operations.
In legal AI, even specialized tools have been measured hallucinating 30% of the time. The problem is that as AI gets better it is confidently, plausibly wrong. That forces humans to verify it.
So the cost does not disappear. It moves from doing the work to checking the work. AI coding has the same issue. If an autopilot got you there faster but one flight in ten became unstable unless the pilot constantly supervised it, that is not productivity.
For the bull case to work, the usage must explode, the quality must improve, prices must fall, reliability must rise, legal risk must shrink, and margins must expand and all this at once. I would say that instead of a business model, this is five miracles stacked on top of each other.
I've heard that the API calls by themselves are ~60% profit if you ignore capital expenditures. The labs haven't generated profit because they're constantly sinking money into the next generation of larger models to stay relevant. Dario has talked about the economics of this a lot, and I do believe him there.
There's clearly also a lot of pent up demand in the corporate world for inference, the problem is that it's currently expensive enough that enterprises are balking at the cost before they've had a chance to refine processes and see projects through to fruition. That's a tractable problem to solve though.
That's true, but if the frontier doesn't advance there's no depreciation or ongoing capital expenditure. If all the frontier labs agreed to stop making stronger AI and just try to sell what they've already trained today, their books would turn green in a hurry.
> The US government basically has to nationalize AI and capture an outsize portion of the revenue from it in order to fix the economy, as the combination of debt burden and interest rate pressure from de-dollarization/global realignment is going to push us into a death spiral, and even if AI is a smash hit, the ~19% federal capture of corporate revenue isn't nearly enough to pull us out of it.
Any actual numbers to back this up? I don't see how nationalizing a very cutting edge technology outside of wartime is going to go super well. The leverage that these companies have is the same leverage that TSMC has: you can't just take over and expect things to rocket at the pace its going
WRT government data centers, there is certainly precedent for independent researchers getting HPC time on systems owned by US national labs, research institutions, universities, and then publishing their results as part of the public good.
One would question why this hasn't already happened as the rule and as opposed to the proliferation of private data centers. However, I am sure the answers are plain and perhaps saddening to us all.
DeepSeek and GLM (plus Kimi) are at or above Sonnet level wrt. favorable workloads like coding. They're not close to Opus or the latest GPT yet, and Fable is even higher than that. Other workloads relying more on real-world knowledge have them even further behind, and this can't be mitigated without making the model itself bigger and harder to host locally.
Not true. Big models buy you baked in knowledge and long context cohesion. A model can be trained to use search and knowledge base tools more efficiently to mitigate the former, and harnesses/workflows can be designed to push models into small parallel threads to mitigate the latter.
The thing that big models will always bring to the table is the ability to YOLO weak/under-specified prompts, and spend less time in the loop making sure work gets partitioned correctly. For smaller/simpler tasks the P(success) difference isn't that big.
Knowledge-base access is not very useful in general because a model doesn't have well-defined "known unknowns" that might trigger an agentic search of the outside knowledge base. Plus surfacing knowledge you don't know much about is itself hard.
These things sound plausible, but have they actually been demonstrated? Wouldn't anyone who succeeded in making such a small but useful LLM be raking in the money now?
Cursor's composer 2.5 is a perfect example. It's right on the heels of the frontier (for coding only) for an order of magnitude cheaper. As much as I've shit on Cursor in the past, I do think the company is well positioned to pick up people getting sticker shock on Anthropic tokens, if they can get their marketing down.
It is, but the US labs have been pushing parameters heavily. There was a pullback from big models after GPT4.5 in particular, but with a shift towards emphasis on post training and the good results Google got with scaling Gemini 3, all the labs started to push scaling again, which is the reason the frontier is getting more expensive. So that 1T isn't as big as it sounds, the American frontier is probably sitting at 3-5T at least.
Disagreed. GLM-5.1 is easily as good as Opus 4.5 for all the coding purposes I could throw at it, which is the model that kicked this entire hype cycle into overdrive in the first place.
Writing does not rely on real-world knowledge all that much, other than knowledge of language itself. Even tiny models can achieve that, it's even easier than coding.
The challenge with writing is the lab collapsing the distribution around "tasteful" writing, when the people making decisions about training data aren't able to effectively discriminate it.
The key thing here is that effective intelligence = model capability / cost. If you drive down the cost of inference you can have higher effective capability even with a technically less capable model. There is nothing in Anthropic/OpenAIs general reasoning capabilities that can't be easily done much better with a purpose built harness for a domain specific task.
One being that extrapolating from like 3 data points is hardly science. All trends break at some point.
The other is that the measures to prevent distillation of their models (if it was a secret sauce of Chinese models) could work if nobody is allowed to use them.
> It would be better for governments to buy and own their own datacenters,
I mean thats good, but they'd have to also build thier own dataset. Which involves either paying people, or breaking the law.
Plus if they do manage to make it work, they will not get any tax revenue from it, as it'll remove the need for labour, which is where a huge amount of tax revenues come from.
its a deeply hard problem with lots of second/third order effects.
> As I replied to a child comment - this is a nice idea that just isn't tenable in reality. AI hardware isn't just hilariously faster than consumer GPUs, it's also hilariously more power-efficient and has hilariously better connectivity. Every one of these dimensions kills the idea.
The first part is not really true though, the chips are not that much faster, the DRAM is not that much faster, and in aggregate it does not matter because there is just so much more consumer hardware out there (although perhaps that is changing as supply shifts toward datacenters).
The interconnect and data locality is the problem. If you could train it like e.g. you can render a scene with monte carlo ray tracing, any result from any node could be merged with any other and the combined result would have converged closer to the limit. I am sure research in that direction exists, it just has not proven effective within the scales it has been attempted.
If folding@home is a useful yardstick by which we might estimate the amount of GPU-ish capability that civilians might be coaxed into donating to a shared enterprise, yeah, it doesn't look pretty. This is extremely rough napkin math but comparing to xAI's Collosus 2 for example, for training workflows you're probably looking at 4-5 orders of magnitude the capability of all of folding@home combined. That's 100,000 times faster.
Very rough math like I said but I doubt it's directionally wrong.
And even if you did force literally everyone on earth with some sort of GPU to max it out 24/7 in service of an open source AI training enterprise - you would waste so much power trying to use that inefficient consumer hardware with the worst latency imaginable that it would be cheaper and faster to get everyone to instead chip in some cash to buy a datacenter with blackwell chips instead! So the idea has no legs whatsoever.
Plus a scientific project to benefit all of humanity doesn’t have quite the same ring as the thing thats stealing your job, from the volunteer’s perspective
it's down 99% since that peak. But let's compare to it anyway.
It's pretty useless to compare raw FLOPS, but as a general hand-waving guesstimate, F@H is currently doing about 25 petaflops in a mix of FP16 and 32. AI usually trains at FP8, but to keep things fair the H100 is quoted at 60 FP64 teraflops per unit, so that's 12 FP64 exaflops given its 200k count.
So F@H at its peak did 2.43 exaflops@FP16/32. Colossus 1 does 12@FP64. These numbers are very hand-wavy, but I think the point is made.
By the way, I'm not trying to crap on F@H - I think it's an outstanding project and I've run it in the past. But a volunteer group simply cannot compete with well-funded, concentrated effort like what's going into AI.
I don't think insulting people is a great way to contribute. Not everyone who sees things differently than you has "psychosis".
Your reflexively negative comments on anything relating to AI are as insight-free as they are numerous; it's all just vague shitting-on without even a hook or argument that could be engaged with and debated. It's pretty tiring, honestly. If you really think your point of view is valuable and others should pay attention to it, rather than just filtering it out like the trollish noise it usually is, why don't you put a little more effort in?
An enduring, confounding quality of LLMs is that even minor differences in prompting content and style, harness type and environment can lead to radical differences in the output and perceived performance and ability. In my environment and in my "style", Fable has been a huge step up, to the extent that I am seriously considering paying for a second $200/m account just to get more usage out of the next 10 days. I'm also starting to prepare my organization for what I now see as the completely inevitable end of human-written code.
All that said, considering Anthropic's heavy-handed nerfing I'm not surprised Fable did poorly in a security-focussed benchmark. And this benchmark seems poor anyway - penalising a model for "cheating" by knowing the answer from its training data? That's not the model's fault, that's a lazy benchmark.
Same story with me. To be clear, I am a subscriber, though I tend to hold out for the ultra-cheap last ditch retention deals they through at you. But I take them with a grain of salt these days. They have a narrative like anywhere else, and they don't let the full facts get in its way.
Michael Crichton said it best:
“Briefly stated, the Gell-Mann Amnesia effect is as follows. You open the newspaper to an article on some subject you know well. In Murray's case, physics. In mine, show business. You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward—reversing cause and effect. I call these the "wet streets cause rain" stories. Paper's full of them.
In any case, you read with exasperation or amusement the multiple errors in a story, and then turn the page to national or international affairs, and read as if the rest of the newspaper was somehow more accurate about Palestine than the baloney you just read. You turn the page, and forget what you know.”
What bleeding? Anthropic wants as much of that "bleeding" as possible. The interaction data gathered from genuine human CC subscription usage of their models goes directly into their RL training, it's invaluable and they are more than happy to lose money on the inference to get it. That data is what xAI was recently willing to pay $10b to cursor to get.
They want you to use Claude Code. They hate other UI surfaces like OpenCode etc purely because they lose control over that data, so they're subsidizing the inference without getting what they actually want, the data (they still get some of it of course, but it's much less ergonomic for them. Those tools often abstract away the subagent calls, for example). OpenCode can collect that data themselves, so by allowing subscription there, Anthropic sees itself as subsidizing another org getting that data. Hard no.
And tools like OpenClaw are useless because they're mechanical and don't represent actual users interacting with the service - again, subsidizing but not getting the reward.
It's all very simple once you understand their motivations.
I am no-where near as concerned by this as I was a year ago, when I was expecting the axe to fall at any moment before the Chinese labs achieved some sort of escape velocity. I now think it's too late, all the cats are out of all the bags, there's no moat except maybe a temporal one of a few months, the genie is out of the bottle.
There is no secret sauce the US labs have that the Chinese ones don't, or won't have soon enough. Deepseek 4 and Kimi 2.5 are not quite Claude 4.5/GPT5.5 but there's no fundamental principle missing - they are strong evidence that there's no real advantage the "frontier" labs possess that isn't related to scale, which they will gain in time (if they even need to). The RL post-training techniques that work are widely known and easily copied. All Deepseek is really lacking is data, which they're getting - and the harder Anthropic/the USG makes it to access claude in china, the more of that precious data they'll get!
I used to sort of entertain the "fast take-off breakaway" scenario as being plausible but not really anymore. The only genuine moat the frontier labs have is their product take-up, which isn't nothing, far from it, but it's not some unbreakable technological wall. Too late guys - it might have been too late for quite some time.
I wish it was true. I would gladly use a GPT 5.2 high model equivalent for coding (6 months old) if it was offered cheaper by Deepseek or Kimi. And I'm sure that's an extremely prevalent opinion by the millions of Claude and Codex users who are bothered by the costs.
However, they just don't perform that well in practice. That's the real issue. You can actually see it when you move away from open benchmarks. Deep seek 3.2 is 4% on Arc-AGI 2 [1], while GPT 5.2 high is 52% and GPT 5.5 pro high is 84.6%. That's the real reason why nobody is using these models for serious work. It's incredibly frustrating.
In addition, I already feel the pain myself on the model restriction. I'll asking my codex 5.5 agent to crawl a website - BOOM, cybersecurity warning on my account. I'll ask it to fix SSH on my local network - another warning. I'm worried about the day my account would be randomly banned and I cannot create a new one. OpenAI already asks you to perform full identification in order to eliminate these warnings - probably exactly for that - so that if they ban you, it's permanent.
I worked extensively on ARC AGI before and one thing is SURE as hell. OpenAI and Gemini in particular use this as marketing material. You can correlate the benchmark release with stock price increase. They feed synthetic datasets of ARC into their models to boost the numbers. There is no doubt in my mind Gemini is no better than DeepSeek other than being specifically fine tuned for ARC AGI. Heck, they even say so and they say they have paid annotations for ARC. Again, economic incentives.
In terms of whether these models are actually better at the benchmarks, likely not. See ARC 3, where the gap is diminishingly small.
I've also worked extensively on ARC AGI 1/2, and I mainly agree. Marketing and training. Performance of LLMs on ARC is most importantly a function of training on grid/table-like data. It doesn't have to be specifically synthetic ARC data though. Training an LLM to be better at perceiving grid-like arrangements of data in a spatial way like an image, rather than just tabular, is hugely useful for things outside of ARC benchmarks, though it's a narrow skill. Hence, I'm sure they do it. I want them to do that. I believe the labs when they say they didn't train specifically for ARC-AGI 1/2 (where did Google say otherwise? I don't see it). But it does not mean the models are getting better at general purpose reasoning. They were already plenty good enough at that. You can describe ARC images in words and reason about it using a level of intelligence LLMs have had for years: they're designed to be easy! LLMs just couldn't reason about image-like grids very well.
Why do you think DeepSeek isn't also fine tuned on ARC AGI? Maybe they're more fine tuned on ARC AGI but still get worse scores. There's no way to know.
My gut feeling is that ARC doesn’t play as big of a role in the Chinese model manufacturer landscape. It’s one byproduct but China is focusing on resource efficiency (for political reasons and low compute). So unlike OpenAI, poor performance on ARC doesn’t hurt as much if the model works well. OpenAI literally hinges on hype so the insane economic bets they make somehow pay off. If you have billions and the future of the company on the line, you ace the exam any way you can. We noticed this early on that whenever some dataset of ARC was released suddenly the classes of problems in that dataset GPT would do well on. But it just doesn’t generalise. They fine tune like crazy. I bet they fine tune for raspberry counting at this point. Again, for OpenAI the perception of moat is everything! Keep that in mind
True, ARC is mostly an artificial "human-like AGI" benchmark that doesn't really reflect any plausible workload. Very different from things like Humanity's Last Exam that reflect real-world knowledge and are now getting closer and closer to saturation even with open models.
Why are you bringing up an outdated Chinese model from 6 months ago to compare to a US model from 6 months ago? The outdated Chinese model will have performance from ~12 months ago, obviously. But today's Chinese model DeepSeek 4 has performance not far from the US model 6 months ago; 46% compared to 52% from 5.2.
Kimi K2.5 has also been superseded by a finer tuned Kimi K2.6 three weeks ago. Moonshot's Kimi models appear to be the favored Chinese model, at least for coding, and not Deepseek V4. z.AI's GLM 5.1 is also worth mentioning as rather competent for coding, also released in April.
Those models too will not be beating US AI labs by your metrics (although for coding, Kimi K2.6 might beat the very uneven Gemini depending on the situation), but in your critism at least consider the state of the art in your comparisons.
I have been using Deepseek v4 pro for personal projects and home infra related work for last couple of weeks. It's quality of work is not bad at all, it is fairly fast and given the fraction of the cost compared to Claude, I can keep going which makes it a very compelling option. Looking forward to trying out Kimi 2.6, thanks for the recommendation.
Even without the discount, I'll have to think about whether I need the 100 EUR tier of Anthropic Max, or whether downgrading to Pro and using DeepSeek is good enough. And they're also up on OpenRouter and other places.
Been using those models, not quite comparable with Opus 4.6/4.7 but with max reasoning, pretty good for a variety of dev tasks! Only big problem is no ability to process images, so can't really do browser use for some semi-automated testing, I'd have to write Playwright tests even when I don't want to.
I've been using OpenCode Go ($10/month) for personal projects (I have Claude subscription for $DAYJOB) and for the tinkering around that I do for myself the quality of the open weight models and the limits of the OpenCode plan are sufficient. I agree that for a lot of dev tasks they're quite good!
I've been using Deepseek 4 Pro (instead of Sonnet 4.6) as the developer LLM (Opus is the planner) and it's been great. Not super fast, with all the reasoning, but has been writing good code, and I think I paid $5 so far (whereas with Sonnet I'd have run out of the weekly limits on Max for weeks now).
Definitely recommended, though it's crucial that you have GPT 5.5 review the code afterwards.
Hum, I'm using it [0] with my Ollama Cloud subscription since the last two weeks and I love it. Never reached the 5 hours usage limits of the $20 plan (on side projects) where I would reach it sometimes in ONE prompt with Opus.
I 100% agree with you, but I've been convinced over the last year that it's a time and scale issue, not anything fundamental.
The Chinese models right now are in a weird spot. Compared to the frontiers, both their pre and post training is woeful - tiny, resource constrained in every dimension including human, slow. I'd compare it to OpenAI 5 years ago except I think even then OpenAI had way more!
But they "cheat" quite a lot in distillation and very benchmark-focussed RL and that's where you get this superficial quality in the leaderboards that doesn't match up when you go off-script. Arc is a great example in that it really belies an "inferior soul" at the heart of it all.
What gives me great hope though is that those same scaling laws that Altman and others have been hyping forever will absolutely kick in for the Chinese labs just as they did for the US ones, and I don't think anything can stop that process now. So they will catch up. It won't be tomorrow, but it's not going to be 10 years either. 3-5 would be my reasonably educated guess.
And the final risk, that China itself might try to restrict availability of the tsunami of GPU or other AI hardware it will inevitably produce - well, I just can't really imagine a country that has been configuring itself for the last 40 years as a single purpose export machine deciding that actually, no, it doesn't want to export something.
About the model restrictions - absolutely. I've been trying to do security research on my own software and the frontier models immediately get suspicious. I've been playing with the local ones much more this year basically because of this. They have deficiencies, for sure - they feel very "hollow" compared to the major labs. But I've talked to a lot of people, and the consensus is pretty clear - just a matter of time.
Just an observation: constraints often result in creative solutions. I wouldn't be surprised if a smaller lab makes a big breakthrough because they have to.
> I'd compare it to OpenAI 5 years ago except I think even then OpenAI had way more!
Say what? 5 years ago OpenAI had received around $139 million in funding, and they’d just come out with GPT3 with 175B parameters, a 2048 context window, trained on 300B tokens on a 10,000 V100 cluster which would have cost maybe $4-13 million at the time for their training run.
Meanwhile Deepseek V3’s famously frugal training was $5M, and Chinese AI companies are raising billions in funding. Sure American AI companies are raising tens (and maybe hundreds in the case of OpenAI, if you count their circular funding rounds) of billions but they’re grossly inefficient, and we’ve already hit the limits of the scaling laws where there’s little point in increasing the number of parameters of a model.
Oh, it was written in a paper, must be correct then, no further investigation required just believe it at face value! No track record of academic dishonestly, and definitely no incentives to fudge the numbers.
Have you tried the latest DeepSeek v4 Pro inside of the Claude Code harness? It's not listed in that site.
It definitely 'feels like' it is as good as Claude for many regular web app coding tasks (though I don't have real benchmarks). And it is comically cheap.
I'm not suggesting it is better than the latest Claude or codex models, but it seems 'good enough' for a lot of use cases in my limited real world testing.
I'm starting to feel like a parrot, but people seem to forget that software engineering is actually a very narrow slice of the white collar pie. You don't need a mega-model which can reason about 100 000 lines of code when you want to create a nice PPT (which consumed literally hours of your life before) to impress your boss. SOTA models will probably be used for frontier research, complex coding tasks, large scale data analysis, etc. And the average Joe shall be able to buy a pre-configured box with a plug-and-play harness and run medium models air-gapped. Or use such models through cloud APIs dirt cheap if privacy is not a concern.
On the same topic but from a slightly different angle - as SOTA models get more capable, the 'quality' and 'feel' of the experience they provide in each domain is heavily dependent on the reinforcement learning the vendor does for that specific domain. After all, many fields have 100 flavors of "good answers," but the model has to pick one answer.
Benchmarks are not very good at capturing this yet. But it could be the case that DeepSeek v4 Pro is 100% as good as Claude Opus 4.7 at scaffolding a basic Rails app, but absolutely terrible at creating a credible business plan that another businessperson would think is real. That's a made-up example, but you get the point.
The end result will be a lot of people arguing about which model is "better," but "better" depends heavily on the task and how that model was trained to interact with the user for that task. Two users may have very different qualitative experiences using the exact same model, despite the benchmarks.
Creating a nice PPT is actually hard because it requires visual capabilities and so-called "computer use" (really, GUI use) of fiddly proprietary software. The nice thing about the coding case compared to a lot of disparate white-collar work is that it's all plain ASCII text. You can already ask a coding model to create a nice TeX/beamer slideshow (or whatever the Typst-based equivalent is) but whether your boss will be duly impressed by that is anyone's guess.
Tangential, but in our opinion corporate PPTX automation is an unsolved problem, even with Claude for PowerPoint (and it's worse with everything else common out there). Its harness (a) is not tuned very well for corporate use and (b) even if it were, fails to manage the specific business knowledge within each org needed to create effective (i.e. audience tailored) presentations.
They're not even that much cheaper (1/2 price per task according to Artificial Analysis) once you account for lower token usage of GPT-5.5. I can't justify it when factoring in the extra time wasted, and the cheap codex usage I get through the monthly plan. Frontier intelligence is not a commodity product ... yet.
Arc has no predictive power whatsoever. I always use the best models available. So far I haven't found a task that chineses models cannot solve very quickly and reasonably. Do you have any examples where they failed for you?
If you want something close to claude, use glm 5.1 with claude code. Their subscription price is no longer x10 times cheaper now though (at best 2 times cheaper)
And yet Claude six months ago was amazing and good enough for you.
This shows that AI cloud consumption is just a conspicuous consumption status symbol, nobody knows why they need cloud AI or what problem they are even solving.
Which is why, I believe, the big AI companies are starting to focus and roll out vertical products more. They know that the models themselves aren't sticky, people can easily switch between different models with not much hassle.
I think the big AI companies are trying to transform into the next Microsoft. Completely capture both enterprise and consumers.
"I think the big AI companies are trying to transform into the next Microsoft. Completely capture both enterprise and consumers."
That is going to be a failing strategy though. Whatever OpenAI or Anthropic implement, Microsoft and Google can trivially copy and provide to their existing customers that are already deeply invested in their platforms.
> There is no secret sauce the US labs have that the Chinese ones don't, or won't have soon enough.
Over last year it seems that the only thing US labs are ahead is money spent. At least half of technical innovations if not more came from Chinese labs and was published openly.
Broad and deep capital markets are a real competitive moat for the USA. No other country or economic bloc can quickly deploy huge amounts of capital to new opportunities nearly as fast. China can work around that to an extent with a command economy that focuses resources on national strategic priorities but it's slower and less effective over the long term.
Nah. If that was an actual disadvantage then the USA wouldn't already be the world leader in most technology sectors. Capital is only one of several constraints.
All of the reasons in the article also apply to Chinese companies. If a Chinese model becomes good enough to make it significantly easier to hack Chinese government servers, do you think they'll allow random people unfettered access to it?
The economic pressures are the same, too. Currently, Chinese models are offered for cheap or in some cases provide weights for free because that's the only way to gain traction. (That closed-weight releases by Baidu, Bytedance, iFlyTek etc. hardly generate any buzz bears that out, as does the fact that when Alibaba does a closed-weight release, someone always gets confused because they associate the Qwen brand with open models.) At some point, their investors are going to want profits, not just user counts. That means higher prices, or no more new models.
If there's no secret sauce and all you need is scale, that would actually be kind of the worst-case scenario for catching up to the frontier, since scaling is expensive and the frontier model companies have easier access to capital as well as higher revenues.
> If a Chinese model becomes good enough to make it significantly easier to hack Chinese government servers, do you think they'll allow random people unfettered access to it?
They aren't trying to become that good, nor do they need to in order to have real positive impact. Models like Mythos are estimated to be humongous even on a datacenter-wide scale, which is actually a big factor in its limited availability at present. It's mostly helpful as a one-of-a-kind proof of concept, to answer the question of whether AI can still plausibly scale by growing capabilities and what happens to alignment concerns when you do that.
I expect every company to try to make a model as good as they possibly can, especially now that Mythos has served as a proof of concept to demonstrate that there's lots of interest in AI for cybersecurity. But if they don't try, that hardly assuages concerns about not being able to access the very best models, does it?
Harness engineering is a moat. There’s user loyalty and reliance on the chassis that Claude is on, for example, just like there’s more market share by MacOS+WindowsOS over Linux Open Source.
The industry on tooling have been very much moving in direction of "plug the AI of your choosing" for a while now, and given how much Anthropic fights the 3rd party tools they are definitely afraid to be left in the dust.
> just like there’s more market share by MacOS+WindowsOS over Linux Open Source.
It's hard to change OS. It's not hard to jump from one AI tool to another
It's absolutely NOT a moat. Making a harness is the EASY part.
If you had said "marketing is a moat" then yes, I would say you were right. But creating a harness equal to or better than Claude Code is trivial. The CC harness is actually shit. There are tons of open-source harnesses than work better than CC while using Opus via OpenRouter.
But 1) people use other models with that same harness. 2) I moved on from Claude Code and all the features I cared for up and running in less than a couple days. Without even looking for available plugins or extensions.
I mean, if that’s the case, then Anthropic themselves are currently actively filling in that moat with nice, solid, walkable dirt. Claude Code may have been a moat 6 months ago but these days you’ll want to replace the “m” with a “bl”.
I agree the genie is out of the bottle technologically. I'm less convinced that means access stops being politically and economically important. The bottle may be gone but the best lamps are still expensive
But a “good enough” lamp just got a lot cheaper. The cost of tokens on DeepSeek V4 Pro is so low I don’t even think about and currently am trying to figure out useful things for as many agents simultaneously running as I can. What would have cost $150 less than a year ago now costs 35¢.
Likewise Qwen 3.6 absolutely blows me away and that’s on a 35b 6-bit model on a local 5090. Same thing, busy trying to find stuff to do to keep it busy 24/7.
I can still find some niches for Opus 4.7 but being able to attack problems and not worry about consumption is a game changer.
Even assuming this holds, what utility you gain by the best models depend completely by your workload. If you have tasks that require performance 10 and DeepSeek has 9, you will gladly pay for SotA models.
I would agree, the only thing Kimi is really missing is stability and harness training, For general chat tasks I consider it mostly on par. Occasionally I'll give the same problem to Kimi, Claude, GPT, Gemini and it's not unusual to see Kimi correctly figure out some kind of weird extra thing that the others missed, like some kind of mentally unstable savant.
> There is no secret sauce the US labs have that the Chinese ones don't, or won't have soon enough
This is not just about mainland China though. The current US government is extremely selfish and self-centered. Other countries really need to consider for their own long-term situation here.
I’m flocking from GPT to opus every week for the past 3 months and always come back.
The point isn’t that gpt is better, it’s that it is so much better for my work it isn’t even sticky, it’s reinforced concrete. I use opus 1% of the time because it writes better and it’s sticky there.
Yes I’ll switch approximately immediately if opus or Gemini (which I use more than opus!) is better for what I do, but at this point frontier model tokens are not fungible.
There will always be dataset and training quirks, and the provider’s own biases and focus, granting one model an edge over the others in some specific domain.
The large AI houses arguably ensure that model switching be a natural action for their clients, by switching the default model of their flagship offerings every few months. Such is the price of progress.
Today's tech echoes 1960-1970 mainframe era: very centralized around a handful of companies controlling "massive cloud compute" in bespoke mainframe-like topology.
All of that will all be legacy in a couple of years. Today's B200 clusters are tomorrow's e-waste. Decentralization might happen gradually or abruptly. But to me it's obvious that we'll be thinking of high-tech tensor processors and GPUs the way we thought of individual transistors and tube amplifiers in the 1980s.
If AI turns out to be the revolution it purports to be, than the underlying hardware will change much more rapidly than it did with ICs and microprocessors in the late 1970s. Today's hot is tomorrow's junk.
Hardware depreciation timescales are actually getting longer, not shorter, because frontier hardware like B200 clusters is highly bottlenecked. It's not just a RAMpocalypse out there, we're seeing early signs of production bottlenecks with GPUs and maybe even CPUs.
One thing that is potentially different this time is that Moore's Law has stopped scaling. Computers aren't getting smaller exponentially. They're getting bigger with multiple chips glued together to make up for Moore's Law.
It's basically converted sand. Most of that conversion happens in Taiwan at the moment. Which is considered, by China, to be one of their provinces and as a protectorate by the usa. Hence the interest in that region....
West has been incentivized to build their own fabs for years but still fumbles that effort. All the billions spent hardening the south china sea and taiwans chip manufacturing from the future chinese invasion would have probably paid for a lot of manufacturing capacity stateside.
Mainland China is growing its own RAM manufacturing capacity. They are too tiny to make a real dent into the RAMpocalypse yet but this can potentially change.
> The thing is that even if I was wrong (I'm not) and AI was somehow helpful for software engineering (it isn't), I still wouldn't want to use it.
So even if you were wrong on the facts (you are) you still wouldn't change your mind? In other words, you're unreasonable and know you're unreasonable and think that's totally fine?
Not sure if the article was edited later, but there are five sentences after that one, expanding on the author's reasoning for their position.
> Next time, lead with that.
The post is titled "I Will Never Use AI to Code"... whether you agree or disagree with the author's position, he's certainly not burying the lede here.
I also can't help but notice the author isn't telling other people not to use AI, he's merely stating his own preferences and articulating his reasoning in depth. Why attack him for expressing his personal preferences in how he goes about his own work, which presumably does not affect you in any way?
> I actually like writing code. Why would I want to give up something I enjoy?
This line was good and lines up well with why I use minimal AI. But indeed the rest of the article shouldn't have really been needed then if this was the point.
That's actually the approach we took with https://gentility.ai/ - we either provide almost-raw SQL query access to the DBs themselves or we synthesize from API into DuckDB via parquet and make that available to the agent to just directly query. It works well - my philosophy is to give agents the sharpest tools you can, and SQL is the best tool there is.
I understand the instinct to try to make a proprietary moat around it all but I think the pattern is useful and obvious enough that all big orgs will be doing something very similar within 5 years or so.
The problem description is spot on, but the solution isn't. No-one is going to sit in that chat and "collaborate" on each other's stuff in real time all day. You may as well just all sit around a screen.
I welcome the experimentation, there will definitely be something new, but this ain't it. New primitives are needed, at a higher level of conceptualization, not merely a fancy new interface.
The far, FAR superior power efficiency means that even if you did harness every public GPU or GPU-like device on earth, you'd end up consuming so much excess electricity it would be cheaper on net to simply take the money that would have gone to the power bill and spend it on your own datacenter.
And even if electricity was free, having those GPUs spread over the world with internet-level latency will slow everything down by factors of thousands to millions - if it's feasible at all. Regardless, you're not getting fable-oss this decade, maybe even not this century.
It would be better for governments to buy and own their own datacenters, maybe as a coalition, and dedicate their operation to the public good. I believe that is what we actually have to do.
reply