Anthropic on Tuesday launched Claude Fable 5, its most succesful public mannequin. However inside two days, customers started reporting that its security system was blocking benign or authentic prompts.
Fable 5 is the primary public mannequin derived from Anthropic’s Mythos household, whose authentic iteration confirmed uncommon ability throughout coaching at discovering software program bugs and exploiting them to disrupt or take management of methods. That raised sufficient concern inside Anthropic that the corporate grouped cybersecurity with different high-risk domains, together with biology and chemistry, when setting limits on Mythos-derived public fashions.
For Fable 5, which means prompts flagged as delicate in these areas are routed to Claude Opus 4.8, a much less succesful mannequin with its personal guardrails. Anthropic says the fallback impacts about 0.05% of queries and notifies customers when it occurs.
However studies of false optimistic studies rapidly mounted. That’s as a result of Anthropic erred on the facet of warning when it designed the classifiers used to detect and downgrade doubtlessly harmful makes use of of its mannequin. It was additionally challenged to stability accuracy with transparency.
Attempt telling that to builders. Throughout social media, individuals have complained aboutClaude Fable 5 rejecting queries about the whole lot from RNA sequencing information for sheep to résumé modifying, to purchasing lists.
“The phrase ‘most cancers’ is flagged as a biosecurity danger by Claude Fable 5!” said scientist Derya Unutmazon X. “Our Anthropic overlords deciding which prompts the peasants are allowed to make use of.,” added founder and developer Bojan Tunguz on X.
Anthropic now says it’s engaged on the issue. “A hidden safeguard is tougher to probe and work round,” Anthropic says in an announcement emailed to Quick Firm. “This implies the safeguards could be focused rather more narrowly. A visual safeguard must solid a wider internet to be extra strong, leading to extra requests being incorrectly flagged.”
“We made the fallacious tradeoff and we apologize for not getting the stability proper,” the corporate provides.
Now Anthropic says it’s working to refine the classifiers in order that much less queries set off false positives. For Claude subscribers, question downgrades (to Opus 4.8) might be extra apparent. Builders accessing Fable 5 through the Claude API will see a motive for the mannequin’s refusal of a immediate, the corporate says.
In the meantime, at the very least one AI researcher seems to have coerced Fable 5 into responding to a banned immediate. Pliny the Liberator claimed on X to bypass Fable 5’s filters roughly 24 to 48 hours after launch. Pliny described utilizing a multi-agent method involving a beforehand jailbroken Claude Opus 4.8, together with strategies together with question decomposition, long-context framing, fiction and narrative buildings, and educational taxonomies.
Earlier than launch, Anthropic mentioned greater than 1,000 hours of inner and exterior red-teaming, together with bug bounty efforts, had recognized no common jailbreaks. The corporate has acknowledged that stopping all subtle, multi-turn, or agentic assaults is probably going not attainable and says it continues to refine its classifiers.

