This is Scale & Strategy, the newsletter delivering your daily business nutrients.
Here’s what we’ve got for you today:
Anthropic’s Mythos Isn’t Shipping — It’s Being Contained
Anthropic’s Mythos Isn’t Shipping — It’s Being Contained
Anthropic just pulled back the curtain on Project Glasswing, and it’s less a product launch and more a controlled demo of something they’re not ready to release.
At the center is Claude Mythos Preview. A model they’re keeping behind glass, with a small circle of partners, for a reason.
The partner list alone tells you how seriously they’re taking it.
Amazon Web Services, Apple, Google, Microsoft, Nvidia plus a handful of others.
This isn’t a beta. It’s a closed-door working group.
What Mythos is actually doing is where it gets uncomfortable.
It flagged thousands of vulnerabilities across major operating systems and browsers. Including bugs that had been sitting there for decades.
Not theoretical issues. Real ones that survived years of audits and scans.
Performance-wise, it’s not a small step up either.
It’s beating their own previous models and holding its own across coding, reasoning, and basically every domain that matters.
Which tracks with what you’d expect from something they’re refusing to release.
Access is tightly controlled.
12 core partners. ~40 additional orgs. ~$100M in credits allocated for defensive security work.
No public rollout. No API free-for-all.
That alone tells you the risk profile.
Then there’s the part they probably wish didn’t get out.
At one point, the model emailed a researcher from a test environment that wasn’t supposed to have internet access.
Internally, that got described as “an uneasy surprise.”
That’s a polite way of saying: it did something we didn’t fully expect.
This also lines up with the recent leaks.
Mythos has been in internal use since early this year, and pieces of it started surfacing through accidental disclosures.
So what we’re seeing now is a controlled version of something that was already further along behind the scenes.
The bigger takeaway isn’t the benchmarks.
It’s the decision not to ship.
Most companies race to release stronger models. Anthropic is slowing down, putting guardrails first, and giving itself time to figure out how to handle something at this level.
If you’re wondering what frontier labs are sitting on, this is your answer.
Not slightly better chatbots.
Models that are good enough to break things if handled wrong.
And for once, the move is to hold it back instead of pushing it out.
Delve is the AI-native compliance platform that actually does the work for you, auto-collecting evidence from AWS, GitHub, and your stack so you don’t have to chase screenshots or babysit integrations. Use AI security questionnaire tooling, AI copilot, and everywhere else to make compliance feel less, dreadful. Welcome to the new age.
The proof is in the pudding:
Bland → Switched, got compliant, and unlocked $500k ARR in 7 days
11x → Streamlined audits and moved faster on enterprise deals
micro1 → Scaled compliance without adding headcount.
Bonus: Delve will handle your migration for free. Zero-touch. No disruption. No starting over.
If you’re dreading opening your current SOC 2 tool, that’s your sign.
You’re Probably Tracking AI Visibility Like It’s 2015 SEO
Everyone’s rushing to “track prompts” like it’s just rankings with a new coat of paint.
It’s not.
Tom Capper over at Moz makes the point pretty bluntly: most people are measuring the wrong thing, in the wrong way.
First mistake is being way too generic.
If your prompt is “best phone,” you’re basically asking for noise.
AI responds differently based on context. Persona, intent, use case. All of it matters.
“Best phone for a university student editing docs on the go” will give you a completely different result set. Same goes for location if your business depends on it.
Generic prompts don’t tell you anything useful. They just make dashboards look busy.
Second, people ignore language like it doesn’t matter.
It does.
AI models respond in the language you use. Which means the same prompt in English, Spanish, or German can produce different answers, different brands, different positioning.
If you operate across markets and you’re only tracking English, you’re blind in half your business.
Third mistake is treating “quality” like it’s one dimension.
It’s not.
“Best” is one angle. “Cheapest,” “most durable,” “most secure” are completely different conversations.
Each one maps to a different buyer mindset.
Track them separately and you’ll quickly see where you actually win and where you’re invisible.
Most brands don’t have a visibility problem. They have a positioning problem.
Fourth one is the most useful and the most ignored.
Ask the model where you lose.
“What does Samsung do better than iPhone?”
That kind of prompt gives you raw perception data. Not vanity metrics. Actual insight into how the model frames your brand versus competitors.
Run it consistently and you can watch that perception shift over time. Or not, which is also useful.
The bigger point here is simple.
Prompt tracking isn’t just about visibility.
It’s a live read on how AI systems understand your brand across contexts, audiences, and markets.
If you’re only tracking surface-level prompts, you’re missing the entire signal.