More

nsingh2 · 2026-06-12T23:39:10 1781307550

Could you provide some details, if possible, like what model & thinking effort, what kinds of tasks? I used to swap between Claude Code and Codex often, and these days use Codex more because of the usage limits. Wondering if I should go to Claude for a month, I get a strange FOMO when I read vague comments like this.

The one major difference I noticed is that the GPT models are more analytical (e.g. better at mathematical analysis, code review) vs Claude models tend to write more straight forward code. Besides that I don't really see any significant differences.

There are a few gotchas with swapping, like being careful with AGENTS.md/CLAUDE.md naming (Claude Code only recognizes CLAUDE.md, and I think Codex only works with AGENTS.md), and updating skill files to match the tool.

colechristensen · 2026-06-12T23:58:14 1781308694

I just symlink AGENTS.md and CLAUDE.md

I was using gpt-5.5 high. Writing terraform code for GCP, debugging app launch and Dockerfile issues, that sort of thing. It was going in loops hallucinating features of GCP, looking things up in strange ways, running terraform apply after being explicitly told in the last interaction not to, and overall not solving problems. These were very straightforward tasks and it couldn't be trusted for five minutes. It's the difference in what I would trust an early senior engineer to do vs what I would trust an unreliable high school intern to do.

nsingh2 · 2026-06-10T01:53:50 1781056430

This isn't about training on the output tokens from Anthropic models, it's just about using their models to build things like pretraining pipelines, etc. Even if you train on your own data.

From the phrasing, it might as well be that any ML or infra. related work that even incidentally looks like it could be used to train LLMs may trigger a silent nerf.

nsingh2 · 2026-06-10T01:30:33 1781055033

It's such an obviously bad policy, it's mind-boggling that they thought this was a good idea. It just breeds paranoia and mistrust, especially when people are already a bit paranoid about silent model quantification for cost cutting reasons.

SXX · 2026-06-10T05:39:30 1781069970

Its not pranoia when entity you are dealing with cant be trusted and will do everything to abuse your trust.

llelouch · 2026-06-10T05:40:58 1781070058

What's the alternative? Not release the model at all?

"Make the guardrails better" isn't very hard and probably not worth the effort.

hagbarth · 2026-06-10T06:06:06 1781071566

The alternative is to be explicit when you nerf, so users know what they are working with.

port11 · 2026-06-10T07:13:20 1781075600

I guess people would just game the system and find ways around these guardrails.

rootlocus · 2026-06-10T10:11:07 1781086267

They have enough info on you and your sessions to eventually catch you, label you as bad faith actor and ban you automatically. I don't think many would risk it.

schnitzelstoat · 2026-06-10T12:51:34 1781095894

That seems to be working well for Mythos. Just never release it and keep talking about how 'dangerous' it is to pump up the IPO price.

SamvitJ · 2026-06-10T13:15:56 1781097356

Do you mean "quantization" not quantification?

nsingh2 · 2026-06-10T14:10:25 1781100625

Yup, I meant to write quantization there.

KennyBlanken · 2026-06-10T02:57:02 1781060222

Another "knob" is reducing the thinking time...

nsingh2 · 2026-05-21T15:21:08 1779376868

Works for me, Firefox 151.0 on Linux

nsingh2 · 2026-05-17T01:55:31 1778982931

Proof search isn't new, but I don't think that captures the value of LLMs.

They act as a learned proposal mechanism on top of hard search. Things like suggesting relevant lemmas, tactics, turning intent into formal steps, and ranking branches based on trained knowledge.

Maybe a kind of learned "intuition engine", from a large corpus of mathematical text, that still has to pass a formal checker. This is not really something we've had to this extent before.

> They do not think

That claim seems less useful, unless “think” is defined in a way that predicts some difference in capability. If the objection is that LLMs are not conscious, fine, but that doesn't say much about whether they can help produce correct formal proofs.

nsingh2 · 2026-05-02T20:36:04 1777754164

I've been using GPT-5.4, and more recently 5.5, with Codex CLI + Ghidra MCP for reverse engineering a game without many issues. Injecting code is where it usually balks at, but I'm just trying to discover and parse structures from game memory.

I did get a refusal when trying to read in-game currency, even though modifying it would do nothing. It has some strange boundaries.

nsingh2 · 2026-04-27T20:22:10 1777321330

> a business that puts employees first and profits for owners last can often have a shit ton of profits for owners.

Owners can make 100x that shit ton if they put profits for owners first, so why wouldn’t they do that instead? Out of the goodness of their own hearts?

nsingh2 · 2026-04-26T03:57:56 1777175876

> The solution, if there is one, has to come from innovation from the private economy.

Why? The problems of offshoring, consolidation, automation, you described came from private sector incentives (not to mention debt driven consumption, and turning basics like housing, healthcare, and education into profit centers)

Why would those same incentives magically fix the problem on their own?

> And there isn't too much the US government can do to revert this economic decline

This is ahistorical. The post great depression economy that led to the “American Dream” was supported by huge public spending and actions by the government [1]. Revitalization happened before, it can happen again.

So much came from FDR/New Deal, social security, labor law, housing finance, banking regulation, securities regulation. Saying the US government can't really do that much is ridiculous.

[1] https://www.archives.gov/milestone-documents/president-frank...

nsingh2 · 2026-04-23T21:51:16 1776981076

These plots are terrible. Why is categorical data connected across categories with lines? Why not just use bar plots?

Like in the "Web Vulns in OSS" plot, white box data for Opus 4.7 is not available, but the absurd linear interpolation across categories implies it should be near 60.

scottyah · 2026-04-23T22:14:53 1776982493

It's just an ad thinly disguised as useful data.

wmf · 2026-04-23T22:31:15 1776983475

I think the x axis is meant to be time but they screwed it up.

nsingh2 · 2026-04-16T15:56:46 1776355006

It's a combination of factors. There was rate-limiting implemented by Anthropic, where the 5hr usage limit would be burned through faster at peak hours, I was personally bitten by this multiple times before one guy from Anthropic announced it publicly via twitter, terrible communication. It wasn't small either, ~15 minutes of work ended up burning the entire 5hr limit. That annoyed me enough to switched to Codex for the month at that point.

Now people are saying the model response quality went down, I can't vouch for that since I wasn't using Claude Code, but I don't think this many people saying the same thing is total noise though.