The one thing I feel it seems to under estimate is the likelihood of improvement. Even the authors acknowledge it's not even worth comparing local models from a year ago to what we have now. In fact, people widely see Opus 4.5 in November last year - 8 months ago - as the first time agentic coding became viable broadly viable even with frontier hosted models.
So why would we lock in hard on any concept at this point of what a local model is and isn't good for? Whatever it is right now, it probably won't be that in a year. It might be naive optimism to think we'll ever get to long horizon tasks with models that run on consumer / pro grade hardware. But so far the naive optimists are winning.
Right. Opus 4.5 8 months ago, good enough for agentic coding. How far behind that are open weight models? More than 8 months? But how much more? When will they reach Opus 4.5 level? A few months from now? A year from now? Never?
I don't know about better but it's certainly different. It's painfully slow through claude code vscode extension compared to copilot but maybe "smarter", I feel like I have to correct it less using sonnet on both. I don't use opus much because of the cost but coworkers say the difference between harnesses there is also pronounced.
GLM 5.2 came out today and the early reports have been quite good. Very difficult to run except on prosumer hardware, but small business could quite easily (or something like open router).
And a big thing that's missing is ... the harness comparison. Ot plays a very big role. I use forge, and I have been inpressed with what it can do given all the limitations of local models.
Since the author is referring to a specific model, I think it makes sense to ignore how the model (or local models in general) may improve over time.
It's like buying a car: I drive that car and get attuned to its characteristics; I don't think how that car (or similar cars) may improve. That's my tool and I want to make the most of it.
It is true that switching a local models it technically very cheap, but there's a considerable time investment in squeezing the most out of it, which may not work on a newer version of that model.
> Engineering for the sake of engineering has no value to the economy
I think that's the adventure we're on now. If recreating something is low cost, what is the value in investing in designing it well in the first place? We can empirically discover issues and the the AI to address them.
I certainly routinely find in supervising what the LLM is writing that it's making terrible internal design choices and correct them. Usually things one level up from code. "This will cache every image on the client and cause a huge amount of bloat. Change it to pull the image in real time from the server" kind of stuff. You do slowly build that up in the project documentation - "Never store unnecessary data on the client: we assume they are using low powered devices without substantial storage". But it takes time and the road to discovering that empirically is through a lot of unhappy users.
So I think there is still a lot of room for genuine engineering - that is, at the technical design level. Levels up from that - code structure etc - are much less clear. I am guessing that over time we will heavily optimise code written by AI for maintenance by AI. Which may be mostly about matching the context window to the code module size. Factoring something to 5 modules may be less of a good idea if it means the context window has to hold all of them for the LLM to work. But that is the path of discovery we are on which history tells us is a 20 year journey.
Now you get not just the 5 LoC to review but a 5 page essay to read in the form an auto-generated review as well. Which makes the submitter even more indignant when you start nit picking things about how it's implemented.
I think it's a near universal phenomenon that people with extraordinary amounts of power become victims of their own hubris. Once you get sufficiently decoupled from the consequences of your own actions, it is near impossible to tether yourself to a calibrated sense of reality.
So I genuinely think that Amodei thought here that he was building a moat - set a very high bar for safety at exactly the line Anthropic but nobody else meets, and then declare anything less to be too unsafe to be allowed. That would put a permanent halt to open models, Chinese models and throw a significant barrier in front of competitors - if OpenAI is about to release something competitive with Mythos, they would have to immediately double back and implement at least equivalent safeguards. It might cost them months at the most critical juncture in Anthropic's history, when they are filing for IPO.
Having said this, I am sure they calculated in the possibility of their own model being restricted. They probably still see it as a win because it acts as a strong endorsement of them as the market leader and the model as the most powerful available model. So I think both things are true, but we are in the "plan B" scenario now rather than "plan A".
> I am sure they calculated in the possibility of their own model being restricted.
Doubt. Had they foreseen this, they would have started verifying the identity of their customers. That would have allowed them to keep their US customers when the US government banned foreign persons from accessing Fable. Since they were forced to turn off Fable for everyone, it follows that they were not prepared for that possibility at all.
The big problem Anthropic faces isn't implementing a KYC workflow, but the fact that many if not most of their own employees are no longer allowed to work on Fable/Mythos.
Also their API customers and downstream customers (e.g. Cursor users) would also need similar infra, and probably a decent amount of users would just choose another model that doesn't require ID & an immigration status check.
And API is much more profitable (relatively) than subscribers for them.
I think it will atleast take one month to set up correct flow for user authentication for both their subscribers and for API's (Cursor etc).
Others are thinking such a export control ban is good for them as it shows entire world Anthropic models are best, I disagree. It will wreck the company, IPO considerations etc, when the models may be best but their are no users to use them.
If such an export control ban stays in place for this particular model or future models of Anthropic, revenue will be affected. My guesss is around 50% of subscribers are non-US citizens, meaning direct 50% revenue loss.
Altman/Musk used to also do moral superiority, "Dangers of AI", lobbying to ban "non-safe" models etc. But they largely stopped doing it after 2025. While Anthropic/Dario increased the intensity of moral authority, and got hit back with exactly they were asking for.
> they would have started verifying the identity of their customers.
Very good point. Yes i think this part goes to hubris. Amodei probably didn't think the ban would cut along those lines if it happened. And in fact it wouldn't surprise me if the government specifically made it that way (singling out foreign nationals) as a way of punishing Anthropic for putting them in this position. It's clear they absolutely hate being dictated to by anybody, but especially Amodei and they probably thought through what would hurt them a lot to implement and deliberately made it that way.
Unless, conspiracy hat on, they wanted to be prevented from serving Fable because it's too expensive for them to run, and they want some external authority to blame for shutting down access.
Seems rather stingy - 6 months is barely longer than you will get on a free signup deal for a lot of online products anyway. Kind of worse than nothing if it causes you to adopt work patterns that aren't sustainable for the project after the offer ends.
Jetbrains gives away for free infinity years of a $180+ per year subscription (its more expensive in the first year or for orgs)[1] for open source authors, students, and more. Sure, the per-month price tag is not as high but after year 4 you saved much more.
After using JetBrains IDEs for years I can hardly really get into anything that isn't vertically integrated. Language servers are THE WORST -- I love Zed but only use it for things that don't require language integration at all.
It's like how after using Apple hardware for years I couldn't put up with most Windows laptops -- either they were HiDPI ultrabooks with no performance or they were sloppy gamer machines with no class.
An online product that was brought into existence by processing all the open source software in the world and makes money by selling the resulting knowledge base, should be accessible free of charge by the producers of that open source software.
> OpenAI may reject, suspend, or revoke any Program benefit for any reason in its sole discretion, including without limitation if it reasonably believes that an applicant or recipient: [...] (ii) used multiple identities or accounts to obtain more than one benefit
Using a plus sign is subaddressing [1] and most ESPs[2] will route to the main address ( multiple@addre.es) . So you can use use multiple+email@adress.es, multiple+xyz@adress.es and both will route the email to you.
In my experience most SaaS apps do not filter this out and allow re-sign ups with sub-addresses.
Gmail has an additional behavior that dot character is ignored in local component of the address . multiple@gmail.com, mult.iple@gmail.com mult.ip.le@gmail.com all route to the same inbox as well.
This is not true (anymore?). I have a rather unfortunate exact naming collision with a family member. They use the full name without dot for the local gmail component, I use a dot between the first and last name.
Two or three mails have been misplaced in a decade.
It would be feasible to change something like that without breaking security now.
Google can hardly start allowing/routing a new account for first.last@gmail.com when you were getting it for years even though your account is firstlast@gmail.com and sensitive communication like say from your bank would routed there.
The only thing I can think of that would give Amazon reasons to dislike Mythos / Fable is that Anthropic really ruined their Bedrock story by imposing data retention requirements that cross a red line in regulatory compliance. It's just possible that Jassy would rather have nobody use Fable than doing it on the basis of, effectively, a direct data trust relationship with Anthropic.
It is hard to plug it together into this still being in Amazon's interest in the long run, but I could see a potential scenario where there was some bad blood with Dario on it if he previously committed to completely air gapped processing from a data point of view and now he went back on it.
Amazon’s AWS and core delivery business are fairly mature, and with consumer sentiment poor and non-AI tech contracting, having a growth vertical like Bedrock is good for shareholders. Without their own core tech, Amazon will be paying rent on AVs in a couple of years - or worse, they will lose all of the benefits or their logistics monopoly because an AV semi can afford to be inefficient
Im just curious how this is going to play out.. Will anthropic just stop releasing its stuff on bedrock now? Will they try to start moving their operations out of the US? If so, to where?
I know everyones excited about Football and the Knicks, but this is far more exciting and interesting than any sport could be.
In theory, there currently isn't any technical mechanism for the retention Anthropic is requiring here to happen. Either Amazon is not being truthful (unlikely) or Anthropic is requiring them to significantly alter the architecture they have built around hosting of their models. But then doing so would put them in strict violation, I would imagine, of the commitments they have made to enterprises - many of which themselves are mandated into meeting these requirements by their own legal or regulatory requirements.
There are missing pieces I can't reconcile here - but one answer is simply that there aren't answers and that would be why Jassy is EXTREMELY pissed at Anthropic right now.
it feels like it's mostly just tuned to up it's level of capability on long horizon tasks - stop context rot and keep persisting at all costs until a goal is done.
The base intelligence does not feel much greater to me.
Listen - that's the sound of millions of companies and users doubling down on Chinese models.
It might be a national security problem for other nations to have access to these models. But it's equally now a national security problem for any other nation to depend on them. Or US tech in general.
As it happens, the current number-two article on HN is about a similar consequence of Chinese export controls--a car manufacturer developing electric motors that do not use rare earths:
The incentives around OSS become stronger the further down in the list of market leaders a company is. The #1 company has no particular incentive to push open software apart from a belief that the market is going to be come commoditised anyway. But the 2nd or 3rd largest player has actual incentives to break the market up and remove software quality as a consideration. No #10 may as well not bother with a proprietary option since if they make it a software quality battle they're going to lose each customer 9 times anyway.
Just because the Chinese are running export controls in one market doesn't mean that they're going to close of access to AI. They might, but each market should be considered in isolation.
And it is nearly always hubris - the people making these decisions are surrounded by yes-men who built their whole career pumping up the egos of their superiors.
A sample of one, but I was getting more stuff done despite Fable uses tokens twice as fast as Opus, because it understood the goals so well and worked to achieve them.
Can you give an example of what those "toughest problems/great code" are? I don't need to know the prompt nor the output, but the general idea, what it is about.
Some very tough computational geometry problems I couldn't solve on my own, nor with the assistance of other AIs or my colleagues. Fable did them all first go. The most impressive built a custom optimizer with a ludicrous number of adaptive switches that absolutely crunched through an error surface with a bunch of nontrivial nullspaces and some wild curvature. That optimizer is of independent interest; it's not totally novel in theory, but the implementation is an impressive piece of engineering.
Was it a better prompt? Have you tried giving the same prompt to other models?
I have found out that the mistakes of other models (which I choose first to save money) help me refine the prompt more and more, until I am fed up and pick Opus 4.8 (for example) which magically seems to get it right, but there is a lot of pre-work there...
Yes I have tried giving these same prompts to other models. The difference has been painfully clear. Switching back to Opus, it is completely unable to do anything that I had asked of Fable without significant conceptual and engineering errors. Functioning code, sure, but not even remotely capable of accomplishing the task to the accuracy I need. Sonnet, GPT 5.5, Gemini, DeepSeek, it's all the same deal. I accepted this in the past because that was just how it was. Now it's tremendously irritating.
I wish Fable really was only a minor upgrade so that it wouldn't be missed, but this feels like the difference between having a post-doctoral colleague and educating a student that I have to constantly guide and correct. It's so profound for me that some of the reactions in this thread feel like they come from another reality entirely. Or maybe they just got instantly diverted to Opus, who knows.
Given the same usage limits, I was able to get more stuff done and not even hit the usage limits, because I wasn't working on constantly fixing what Opus was trying to do, Fable just understands the task correctly and works great with the given context.
Same here, not n=3, plus the above 3 reporting, so n=6 and rising
Fable was definitely better for a variety of tasks, even accounting for using 2X the token rate, like the way it used the tokens faster reduced the wasted tokens, as least for the subset of those who already knew at least some optimizations...?
Yep. I love open source but there isn’t a model that comes close still to the closed source options like Opus 4.8 and that’s obvious from most people I see across the software industry as well. There are at least another few models after Opus from OpenAI and Anthropic most would go down the list using before any of the Chinese models at this point.
Opus 4.8 has taken such a beating over the last couple of days since the release of fable, videos online of people referring to it like the “redheaded stepchild” (is there a better way of saying this, this sounds racist) basically at this point, everyone is going to be seriously disappointed to fall back to that.
Yeah, not sure where the phrase originated but it does sound bad when you put some thought into it. My sister is a redhead and people loved to make fun of her growing up, telling her there's no way two parents with brown hair could have a kid with red hair, so the mailman (who also had red hair) was obviously her dad.
MiniMax M3 is surprisingly powerful, and open weight (or is about to be). There's others in this space too: MiMo v2.5, GLM 5.1. There's quite a few to pick from if you want strong models running on "your" hardware.
For starters, there's a C++ application written with MFC and an absolute ton of inline assembly and threading (yes, in a 1990's C++ application). I'm porting it to MacOS/Linux currently.
Opus 4.6+ is able to make slow progress, but it takes several revisions per workstream. It requires constant supervision as it often creates convoluted solutions that expand the code in bloated ways. It works, but still requires my constant input.
Fable was able to almost one shot most of the big migrations with very few bugs, and was able to fix those bugs with 1 review pass. I almost didn't believe it. I was able to put it on a task (with dangerous permissions) and come back hours later to see it done, working, and clean.
I tried DeepSeek v4 and it wasn't able to make any meaningful progress at all. It kept creating dangling pointers and had trouble understanding the inline assempbly needed to be replaced if we were to compile for 64 bit. It kept getting stuck and looping on the same problem, without making progress.
What I do use DeepSeek for is lots of my automations on my websites. I find DeepSeek is fantastically cheap and fast and effective as summarization, collation, generating reports, finding and reporting issues from logs, etc. But I haven't found a way to get it to effectively port 90's C++ code to modern, cross-platform standards. But I want to be clear- I really like DeepSeek and use it wherever I can.. I mean.. it's so affordable!
I was building a cli tool that showed a graph of git commits, kind of like git log --graph, and deepseek v4 simply could not figure out a specific ui quirk where things weren't lining up correctly. I spent like $0.10 and 30 minutes trying to figure it out on deepseek.
Then I had deepseek summarize the bug, gave it to Opus, and it solved it in $1.12 and five minutes.
Fun fact, I was trying this afternoon Deepseek vs Opus 4.8 high, and I was surprised at how good Deepseek was. It outperformed Opus 4.8 on multiple occasions.
Found just later I was using v4 flash and not pro (for mistakenly setting the model to deepseek-chat and not v4-pro).
There are aspects about Deepseek I don't like though, when pushed against it will eagerly bend instead of reasoning and advocating for his points, something Opus 4.7 and later models started doing a lot (even when wrong).
Which models? Im curious what kind of more specific hypothesis you're willing to put forth. Anthropic going to lose 20-30-40-50% of users to Deepseek? What?
I quit paying for Claude Code to buy z.ai's coding plan for use with OpenCode. I'm not a power user, but I don't regret switching away from Claude. OpenCode is generally nicer for my work.
Why z.ai and not an ollama pro plan that can use all the open models? Real question, not snark. I've only ever done ollama and wonder what I'm missing.
Because I bought a year's subscription in December, when it was still $6/mo :P
I have decently capable hardware, but stuff like Qwen 3.6 and Gemma 4 still doesn't compare to agentic editing with a frontier model. Right now, OpenCode's $10/mo "Go" plan is what I'd be looking to try once my year expires.
As a non-US person, I will use whatever is the best and reasonably priced. I could not give one iota about who makes or hosts these models. The origin or political leanings of these models mean nothing in my usage calculus.
Chinese models are next, the whole reason this is happening is because they don't want China to steal their tech. It is no secret anymore that they have been distilling US models. That's why it is explicitly aimed at foreign nationals.
TikTok wasn’t open source and able to be run on your own hardware. Banning Alibaba’s models (or even a personal fork of them) running on your machines seems hard to defend in court.
The two main bills I'm aware of are the Decoupling America's AI Capabilities from China Act and No Adversarial AI Act. The former would have made it illegal for any American citizen to simply use DeepSeek. I couldn't find any lobbying data, but the obvious effect is that Americans would be forced to pay for more expensive domestic alternatives.
A House committee also recently probed Cursor and Airbnb for using Chinese models, rather than more expensive American alternatives. A sexagenarian Congressman gave a nonsense quote that he certainly did not come up with himself,[1] which sounds very similar to language Anthropic uses in its marketing materials.[2][3]
Moolenaar's quote: "The AI models these companies use are trained by China’s censorship regime and introduce hidden vulnerabilities that put Americans’ data and businesses at risk." That is, Americans using Chinese-trained AI models are exposed to some form of cybersecurity risk.
That's not really a threat model described in either of the Anthropic posts you share, which mainly talk about the risks of allowing authoritarian regimes to use powerful US-trained models, and the geopolitical risks of authoritarian countries developing strong AI before democratic/liberal countries do.
They'll have to remove sections like this from their AI Action Plan
> We need to ensure America has leading open models founded on American values. Open-
source and open-weight models could become global standards in some areas of business and
in academic research worldwide. For that reason, they also have geostrategic value. While the
decision of whether and how to release an open or closed model is fundamentally up to the
developer, the Federal government should create a supportive environment for open models.
US will ban American companies from using Chinese models and also ban them from dealing with companies who use Chinese models. “Code produced by Chinese models may be deliberately introduce backdoors and vulnerabilities” that kind of thing.
To do what? I mean they’re good models, but frankly, they fucking suck (relatively speaking). I’m not looking to going back to a week of back-and-forth with the LLM once I’ve gotten used to all this one shotting.
works really well with pi for small to medium sized coding tasks for me - C++ is an interesting case since it's probably more challenging just due to the complexity of the syntax. But it works great with Groovy which is another slightly off-mainstream language (these days).
I use DSv4 through opencode. I use it from deepseek directly, not through a third-party platform.
I mostly do C# and some frontend. I was starting to feel really depressed and unengaged at work because I was starting to use AI far too much like a magic slot machine. I'm now making a conscious effort to go back to using it as a tool used a bit more deliberately.
I'm not even using the pro model. The flash version is fast so I can keep it interactive rather than context switching to reddit while the model is working, and it turns out using my brain means I don't really need the model to be that smart.
The one thing I feel it seems to under estimate is the likelihood of improvement. Even the authors acknowledge it's not even worth comparing local models from a year ago to what we have now. In fact, people widely see Opus 4.5 in November last year - 8 months ago - as the first time agentic coding became viable broadly viable even with frontier hosted models.
So why would we lock in hard on any concept at this point of what a local model is and isn't good for? Whatever it is right now, it probably won't be that in a year. It might be naive optimism to think we'll ever get to long horizon tasks with models that run on consumer / pro grade hardware. But so far the naive optimists are winning.
reply