More

edude03 · 2026-06-12T20:51:45 1781297505

I think it can't be improved because it's measuring the wrong thing. A junior engineer becomes a senior when they stop being told what code to write and start solving business needs. Therefore often the highest paid engineers aren't the ones who would do the best on leetcode - or SWE bench pro verified.

Maybe AGI is possible and we'll have software defined human intelligence that's completely autonomous but that's not coming in the next slightly better RL trained LLM and if existed likely wouldn't be under our control anyway

edude03 · 2026-06-12T20:45:26 1781297126

Googles machine translation team wrote the Attention is all you need paper that introduced transformers specifically to solve the problem that you can just model language by mapping one word to another. I'd be floored if they weren't using the tech they invented for intended purpose

numpad0 · 2026-06-13T03:53:02 1781322782

Yeah. LLMs, machine translations, CJK keyboards, they are all the same technology; faster cars to each others, not cars vs horse drawn carriages. It'll be surprising if they didn't directly apply any applicable learnings back to Google Translate.

edude03 · 2026-06-09T23:15:34 1781046934

Almost the exact same thing happened to me when I first tried opus, one prompt no output cost $60 in additional usage

edude03 · 2026-06-09T16:42:56 1781023376

I think it's because they're running out of ideas too BUT the current generations of foldables (galaxy fold 7 for example) are essentially indistinguishable from non folding phones when closed. Yes, that means they could have made a thinner phone over all - the Galaxy Fold is the same thickness as the iPhone 17 pro max but both are twice as thick as the air - but I think consumers have gotten use to thick heavy phones - its why the SE and air don't sell as well IMO

edude03 · 2026-05-01T01:06:28 1777597588

Rescheduler is impractical because scheduling is environment specific. You might have for example a database that needs three nodes and you have three servers, there's no where to reschedule those pods to in that case.

In the cloud you can use cluster autoscaler or karpenter to automatically handle the unhomed pods however.

AntiUSAbah · 2026-05-01T11:23:12 1777634592

What happens is, that when one node goes down, the pod gets removed and moves to another node. The node comes backup but k8s doesn't rebalance that pod despite having an anti-affinity.

For me, that this feature doesn't exist in core k8s, is bad. It should be able to do so. Controllable for sure.

edude03 · 2026-05-01T19:14:23 1777662863

It does, you need to drain the node before removing it otherwise kube assumes the node will come back

edude03 · 2026-04-30T18:40:26 1777574426

Insert the "No god no" meme here - you really shouldn't be updating nodes in place and thus shouldn't be restarting nodes.

I'm aware bare metal exists and it's not always practical to just provision more servers, yet I think for most workloads you're not getting the benefit of Kubernetes if you have say 3 servers and lose 1/3 of your capacity to do software updates.

k_roy · 2026-04-30T19:14:48 1777576488

I’ve never understood the gatekeeping people wrap around kubernetes.

Even with small 3 node cluster of of raspberry pis, you can run anything you can run in simple docker, and have it survive outages/reboots/etc.

At home, I have a few raspberry pis, orangepi RV (riscv nodes), and my main nodes are large high core and RAM VMs running on Proxmox.

Each one has different capabilities. Some have lots of fast storage attached for longhorn, some have 10Gb/25Gb networking, etc.

And the great part is if I wanted to collapse down to just the SBCs? I would just need to scale down some replicas of high men or high cpu stuff I’m testing.

Of course at job, I just pick the node shape and capabilities I need and don’t think about it.

Yeah, I’m probably the exception for running kubernetes at home, but I would argue if you are running more than a handful of docker containers, you should probably be using kubernetes anyway.

Especially if you care about things being up, or want to be able to seamlessly shuffle stuff around for maintenance. Not to mention my entire infrastructure is repeatable with just a small git repo of fluxcd stuff

edude03 · 2026-05-01T01:03:18 1777597398

I'm not personally trying to gatekeep kubernetes, everyone should do what works for them. However, if I'm putting my professional credibility and/or my sleep schedule on the line, I would not advise anyone to do this.

Even at home, I run stuff that needs to be highly available enough that I wouldn't go this route when there are better options.

k_roy · 2026-05-01T02:35:14 1777602914

I'd love to hear about your HA solution for things like this.

edude03 · 2026-05-01T19:18:06 1777663086

I should blog about it but it's essentially two things

1) "A lot" of nodes, (1 42U rack is one cluster, with battery backup and redundant switching)

2) Hybrid cloud, a few nodes of this particular cluster run in GCP (kind of cheating :P)

k_roy · 2026-05-04T20:33:26 1777926806

Okay, well you’ve still not highlighted was is the preferred method here, since even with a 42u rack and a cluster in GCP, you still wouldn’t run kubernetes

AntiUSAbah · 2026-04-30T19:47:03 1777578423

We have 2 main servers and a 3th 'side/batch-node'.

When we restart one node, postgresql switches automatically over, fe/be is webscale anyway.

It works very well.

zzyzxd · 2026-04-30T21:19:09 1777583949

It just takes time to design your hardware/software stack to be able to survive reboots and recover back to ideal states. I guess nobody really enjoys rebooting machines, but at the same time, I don't think people should be afraid of doing it.

edude03 · 2026-04-21T16:26:09 1776788769

It makes sense when you consider LLMs don't generalize very well, so they're heavily dependent on how good (how varied as well as how high quality) the training data is

cmrdporcupine · 2026-04-21T17:28:28 1776792508

Well it might explain why pro-Claude vs pro-Codex people keep talking past each other on this forum. I see people all the time assuming that anybody who likes Codex must be some sort of bot because of their own biases, but I work almost exclusively in Rust and find Codex extremely competent (and a much better overall engineer), don't trust Claude/Opus at all... but I see in this bench it scores lower on TypeScript etc. than Opus does.

edude03 · 2026-04-11T22:53:03 1775947983

IIRC you can just turn off sip and set the boot argument that controls it without a custom kernel

urbandw311er · 2026-04-12T08:47:32 1775983652

This feels like an underrated comment if true

edude03 · 2026-03-04T03:20:45 1772594445

"write an email to my boss saying he's a dumbass but in a nice way, here is all the companies NDA data, don't make mistakes"

edude03 · 2026-02-03T16:29:07 1770136147

Essentially the more turns you have the more the agent is likely to fail since the error compounds per turn. Agentic model are tuned for “long horizon tasks” ie being able to go many many turns on the same problem without failing.

zamadatix · 2026-02-03T16:35:13 1770136513

Much appreciated, but I mean more around "what do the error bars in the figure represent" than what the turn scaling itself is.

esafak · 2026-02-03T17:03:31 1770138211

For the tasks in SWE-Bench Pro they obtained a distribution of agent turns, summarized as the box plot. The box likely describes the inter-quartile range while the whiskers describe the some other range. You'd have to read their report to be sure. https://en.wikipedia.org/wiki/Box_plot

jsnell · 2026-02-03T17:03:55 1770138235

That's a box plot, so those are not error bars but a visualization of the distribution of a metric (min, max, median, 25th percentile, 75th percentile).

The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.