I think it can't be improved because it's measuring the wrong thing. A junior engineer becomes a senior when they stop being told what code to write and start solving business needs. Therefore often the highest paid engineers aren't the ones who would do the best on leetcode - or SWE bench pro verified.
Maybe AGI is possible and we'll have software defined human intelligence that's completely autonomous but that's not coming in the next slightly better RL trained LLM and if existed likely wouldn't be under our control anyway
Googles machine translation team wrote the Attention is all you need paper that introduced transformers specifically to solve the problem that you can just model language by mapping one word to another. I'd be floored if they weren't using the tech they invented for intended purpose
Yeah. LLMs, machine translations, CJK keyboards, they are all the same technology; faster cars to each others, not cars vs horse drawn carriages. It'll be surprising if they didn't directly apply any applicable learnings back to Google Translate.
I think it's because they're running out of ideas too BUT the current generations of foldables (galaxy fold 7 for example) are essentially indistinguishable from non folding phones when closed. Yes, that means they could have made a thinner phone over all - the Galaxy Fold is the same thickness as the iPhone 17 pro max but both are twice as thick as the air - but I think consumers have gotten use to thick heavy phones - its why the SE and air don't sell as well IMO
Rescheduler is impractical because scheduling is environment specific. You might have for example a database that needs three nodes and you have three servers, there's no where to reschedule those pods to in that case.
In the cloud you can use cluster autoscaler or karpenter to automatically handle the unhomed pods however.
What happens is, that when one node goes down, the pod gets removed and moves to another node. The node comes backup but k8s doesn't rebalance that pod despite having an anti-affinity.
For me, that this feature doesn't exist in core k8s, is bad. It should be able to do so. Controllable for sure.
Insert the "No god no" meme here - you really shouldn't be updating nodes in place and thus shouldn't be restarting nodes.
I'm aware bare metal exists and it's not always practical to just provision more servers, yet I think for most workloads you're not getting the benefit of Kubernetes if you have say 3 servers and lose 1/3 of your capacity to do software updates.
I’ve never understood the gatekeeping people wrap around kubernetes.
Even with small 3 node cluster of of raspberry pis, you can run anything you can run in simple docker, and have it survive outages/reboots/etc.
At home, I have a few raspberry pis, orangepi RV (riscv nodes), and my main nodes are large high core and RAM VMs running on Proxmox.
Each one has different capabilities. Some have lots of fast storage attached for longhorn, some have 10Gb/25Gb networking, etc.
And the great part is if I wanted to collapse down to just the SBCs? I would just need to scale down some replicas of high men or high cpu stuff I’m testing.
Of course at job, I just pick the node shape and capabilities I need and don’t think about it.
Yeah, I’m probably the exception for running kubernetes at home, but I would argue if you are running more than a handful of docker containers, you should probably be using kubernetes anyway.
Especially if you care about things being up, or want to be able to seamlessly shuffle stuff around for maintenance. Not to mention my entire infrastructure is repeatable with just a small git repo of fluxcd stuff
I'm not personally trying to gatekeep kubernetes, everyone should do what works for them. However, if I'm putting my professional credibility and/or my sleep schedule on the line, I would not advise anyone to do this.
Even at home, I run stuff that needs to be highly available enough that I wouldn't go this route when there are better options.
Okay, well you’ve still not highlighted was is the preferred method here, since even with a 42u rack and a cluster in GCP, you still wouldn’t run kubernetes
It just takes time to design your hardware/software stack to be able to survive reboots and recover back to ideal states. I guess nobody really enjoys rebooting machines, but at the same time, I don't think people should be afraid of doing it.
It makes sense when you consider LLMs don't generalize very well, so they're heavily dependent on how good (how varied as well as how high quality) the training data is
Well it might explain why pro-Claude vs pro-Codex people keep talking past each other on this forum. I see people all the time assuming that anybody who likes Codex must be some sort of bot because of their own biases, but I work almost exclusively in Rust and find Codex extremely competent (and a much better overall engineer), don't trust Claude/Opus at all... but I see in this bench it scores lower on TypeScript etc. than Opus does.
Essentially the more turns you have the more the agent is likely to fail since the error compounds per turn. Agentic model are tuned for “long horizon tasks” ie being able to go many many turns on the same problem without failing.
For the tasks in SWE-Bench Pro they obtained a distribution of agent turns, summarized as the box plot. The box likely describes the inter-quartile range while the whiskers describe the some other range. You'd have to read their report to be sure. https://en.wikipedia.org/wiki/Box_plot
That's a box plot, so those are not error bars but a visualization of the distribution of a metric (min, max, median, 25th percentile, 75th percentile).
The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.
Maybe AGI is possible and we'll have software defined human intelligence that's completely autonomous but that's not coming in the next slightly better RL trained LLM and if existed likely wouldn't be under our control anyway
reply