What's dangerous is Opus 4.8's proclivity to create backdoors and no-op critical security code. Claude Web counted 27 instances of this I had cataloged over the last few months, and Fable 5 found more. Fable 5 may do this too, but I didn't get a long enough chance to test it since it kept downgrading to Opus 4.8 on every prompt saying, "This model has safety measures that flagged something in this session", even when asking Fable 5 to fix the security issues it found that Opus 4.8 created. You have a model that presumably can write secure code and identify security vulnerabilities, but as a security measure, they say we're going to force you to use a model that creates security holes. This is backwards. Considering the scale, Opus 4.8 is creating more issues than Mythos or Fable 5 is patching.
What you're describing only applies to security or biotech downgrades. A downgrade related to the model believing that you're doing something related to model development is invisible and silent and internal.
When I reported this, Anthropic sent me an email on Tuesday saying, "You have been approved into the Cyber Verification Program", but it's still downgrading. Is this a bug? What's the point of the Cyber Verification Program if Fable 5 downgrades when you tell it to write secure code?
They've publicly apologised for the invisible PEFT that deliberately makes the model dumb on some tasks. Whether they still do it, or will once again do it in future in more subtle ways, is something we can't verify.
Personally I think they have proven themselves to be the stewards of AI in the same way Exxon Mobil are the stewards of petroleum.
There is in /config "Switch models when a message is flagged" now which can be set to false, but I had no chance to see what happens then, does it just stop or what.
Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Send feedback with
/feedback or learn more
1. Switch to Opus 4.8
2. Edit prompt and retry with Fable 5
Yes, telling Fable 5 to write secure code triggers a downgrade to Opus 4.8. This is doubly bad because Opus 4.8 keeps no-oping critical security code. Is this a bug or by design? I have been approved for the Cyber Verification Program: Fable 5 keeps downgrading to Opus 4.8 even when approved for Cyber Verification Program #67107 https://github.com/anthropics/claude-code/issues/67107
That's approaching the problem from the worst possible angle. If your security depends on you catching 1 message in a sea of output and quickly rotating the credential everywhere before someone has a chance to abuse it then you were never secure to begin with.
Not just because it requires constant attention which will eventually lapse, but because the agent has an unlimited number of ways to exfiltrate the key, for example it can pretend to write and run a "test" which reads your key, sends it to the attacker and you'll have no idea it's happening.
I sent email to Anthropic (usersafety@anthropic.com, disclosure@anthropic.com) on January 8, 2025 alerting them to this issue: Claude Code Exploit: Claude Code Becomes an Unwitting Executor. If I hadn't seen Claude Code read my ssh file, I wouldn't have known the extent of the issue.
To improve the Claude model, it seems to me that any time Claude Code is working with data, the first step should be to use tools like genson (https://github.com/wolverdude/GenSON) to extract the data model and then create why files (metadata files) for data. Claude Code seems eager to use the /tmp space so even if the end user doesn't care, Claude Code could do this internally for best results. It would save tokens. If genson is reading the GBs of data, then claude doesn't have to. And further, reading the raw data is a path to prompt injection. Let genson read the data, and claude work on the metadata.
I agree with you but I think there's a "defense in depth" angle to this. Yes, your security shouldn't depend on noticing which files Claude has read, since you'll mess up. But hiding the information means your guaranteed to never notice! It's good for the user to have signals that something might be going wrong.
There's no defense "in depth" here, it's like putting your SSH key in your public webroot and watching the logs to see if anyone's taken your key. That's your only layer of "defense" and you don't stand any chance of enforcing it. Real defense is rooted in technical measures, imperfect as they may be, but this is just defense through wishful thinking.
Obviously, don't put your SSH keys in a public webroot. But let's say you're managing a web server and have a decent security mindset. But don't you think it's better to regularly check the logs for evidence of an attack vs delete all the logs so they can't be checked?
reply