You're obviously just a loser trying to rage bait. But China outclasses the West on "brains" and hours worked by an order of magnitude. In <10 years they've started entire industries where the west had half a century of r&d as a moat and have reached parity with the West in almost all cases in less than a decade.
You do understand that China copied the R&D from west. It cannot be that China invented all the technology in < 10 years where other countries had to research for decades. One example is that Germany was the leading research in solar panels, but China was able to replicate and mass produce it, but without initial German investment on r&d, it would have taken decades longer for China to obtain that technology
Its clear to me these models are useless on any real world task, a 4% pass rate on $20-30/hr Upwork tasks. This whole trend of agentic engineering is a giant money grab.
Missing some recent models on that list, but I think most crucially, the harness is fixed —- one of the major learnings of the last few months is that harness and eval (“looping” and support / tooling around it) is really critical. I would guess these numbers are the floor.
For instance, some of these tasks include creating videos, and one of the common reported failure mode is truncated videos, or not all videos being created. This sort of failure mode is currently best managed by an outer evaluation loop; no frontier model will, when managed by an eval loop, submit work like this right now.
Did you even read the release, it wasn't broadly released to anyone..
At their request, we are starting with a limited preview for a small group of trusted partners whose participation has been shared with the government, before releasing more broadly.
This comment is an excellent example why the average llm user is basically a slot machine user who thinks "this one is hot, this one is lucky, this one is better than the others" and constantly switching between models on a whim of some occulted understanding that only they posses.
Also, who cares about some 80% benchmark.. They train on these public benchmarks in order to impress people like yourself that subscribe meaning to them. How come they only get 4% pass on $20-30/hr Upwork tasks? It seems to me like these benchmarks are basically useless... There's a thing called variance, I'm not sure why a higher scores on a few tests would lead you to believe you have access to a model that they say you don't have access too..
Don't appreciate the slander, but I'll respond anyhow.
Contrary to your predisposition, we're actually quite peeved that we might be seeing results from 5.6 instead of 5.5, as it's muddying our own internal data.
We've run the tasks on this benchmark hundreds of times for our own internal harness. It got magically better yesterday. Last week we were seeing worse performance (sub-80%).
I agree that benchmarks don't mean much for real world use, and I'm a bit disappointed at the lack of variety in the published benchmarks so far.
With that said, 88.8% is higher than Mythos, and the highest I've seen from vanilla Codex. If 5.6 is any better than 5.5, you'd think they would avoid publishing just one coding-related benchmark with a score that equals their previous model.
> I'm not sure why a higher scores on a few tests [..]
It's not just higher scores, the API is no longer flagging tests for cybersecurity warnings that it's been flagging for weeks.
One shot prompting/tooling is the only reasonable way to use an llm in my opinion. You should not be having an LLM operating for hours creating thousands of lines of new code that you can never review or maintain. You can actually be highly productive modifying a single file or two at a time, ideally as focused and little context as possible, without the llm being given full permission to add as much context as possible along the way to maximize revenue for the developers of the harness.
The agentic engineering paradigm is just a narrative trend pushed by AI companies to get people to 10x their token consumption per prompt. It plays into people's laziness and addiction to dopamine too causing addict like behavior in people that fall prey to this trend.
If I do that, I'm literally slower then just doing the change without sufficiently specifying it to the model.
I can see how a junior dev or generally someone that's not particularly knowledgeable about the language or framework they're working with may benefit from such usage, but for experienced people there is very little value in that approach.
I say this because I've just had to face this decision this month with Copilot introducing the usage based billing. I attempted to scale back my usage, first with non-opus - output essentially became discardable as it continually hallucinated no existing fields in the responses of Apis etc... Then my scoping the changes smaller and smaller, until I ultimately gave up and reduced usage to just generating tests.
I agree. And at work it has been producing some of the worst GUI test cases I have ever seen.
What is tested often makes no sense at all, completely implausible edge cases are tested on internals, while it doesn't create tests for the overall application using user events.
And some things in these test cases are downright ridiculous: instead of instantiating your classes, it sets up some barebones fake objects reimplementing some of the behavior of your actual class, then ignores the TypeScript errors via force cast or similar.
Then it proceeds to slap some test ids on the output, stubs components and dependencies more or less randomly, adds some assertions on test ids and calls it a day.
Apparently that's good enough for many colleagues to open a MR for that garbage.
That said, at home with SOTA models I happily hand large units of work to it, outsource much of the thinking, and get workable results. I think this is the future.
I see little value in throwing a ton of context at an llm and waiting 10-20 minutes for a coin flip on whether or not its going to produce junk. I'd rather do quick 60 second turns, get most of the way there and fix the rest myself if I have to. I'd rather honestly just not use them.
Well the point was that id rather spend 30 seconds doing it myself then formulate a prompt with enough context for the model to implement it within 60 seconds. Also these numbers are unrealistic.
Everyone that I've ever interacted with and claims to prompt in "seconds" actually needs multiple minutes to think about the solution they want the model to implement - and then need twice as long to formulate that into a sentence which provides the model enough context to actually do that
So the more realistic estimates are "I'd rather spend the 2 minutes just implementing the minor change myself, instead of spending 1.5 minutes thinking about it, then 2.5 minutes writing the prompt and then waiting 1 minute for it to finish"
I would agree with all those points, and my numbers are a little off. I really just don't want to use any of it. I'm more excited about fast FIM autocomplete that works well, something like cursor tab without cursor. If something can increase my wpm and take strain off my fingers that would be nice. At this point latency and accuracy is terrible though.
The trick is to do something else in those 20 minutes (or, ideally, even longer).
That's the main value I've been getting out of coding agents. I have them do (comparatively) simpler tasks or explorative tasks in the background while I'm in a meeting, doing code reviews, or otherwise working on something else.
I think that LLMs will stay, but I also think we've plateaued and that big companies will fail and fall and we will have another years long "halt" of any real advancements coming to the public.
Similar to how ML was all the hype about 12 years ago and then it submerged again for a couple of years.
> we will have another years long "halt" of any real advancements coming to the public
One can hope. Probably an unpopular take here but I'm tired boss.
The software world has a huge backlog of things that can all be done with the tech we currently have, no breakthrough advancements needed, but none of it will get prioritized when we're all forced to run on the new and shiny treadmill. Ever since LLM hype its like the javascript culture of a new framework every 10 minutes has infected every other vertical of software development and I'm exhausted.
This is probably the dumbest possible way to do it. Just buy tokens through open router and you could run it all month 24/7 at 100tps for practically nothing. There are tons of ways to pay for things without giving your personal information.
Vibe coders need to be forced to spend one day learning basic CSS before they're allowed to use an LLM to make a website and the internet would be a lot more pleasant as we move forward with slopification.. It doesn't have to be sloppy, and doesn't take all that much studying to at least be able to steer an llm in the right direction to make something look nice. At this point everything is just the same 3 colors and a centered flex column with weird spacing.
reply