Follow me on this: if Mythos benchmark numbers and model card info is correct (which I have little reasons to doubt), Anthropic can easily synthesize a better Opus, given they don't want to release Mythos. So, I bet we'll soon see Opus 4 at the level of GPT 5.4 or greater.
Latest Posts by antirez
One of the most powerful automatic coding (autocoding) trick that almost nobody uses: "Create this new project as specified in SPEC.md using as guide for coding style, design sensibility, comments, ..., the code at /foo/bar/". Style transfer is very powerful.
I'm still trying to post stuff here. But people seem to have moved back on X, regardless of the terrible ownership. At least, Bluesky should try to give some real advantage in the product side, but the strict character limit alone is already discouraging.
Super odd that the Gemma 4 release highlights specifically the ELO score: the least meaningful benchmark ever. We should ask AI labs to stop doing this thing.
I bet the small Gemma 4 models are going to kill all the specialized ASR models for audio to text transcription tasks.
Me› Sorry if in my last question I ended it with double "??", I'm not the kind of person that puts more than a single "?" at the end of a question.
GPT› Understood. I did not read it as emphasis from you in any meaningful way.
1st of April of 12 years ago I released this post introducting of the HyperLogLog in Redis. Most people thought it was a joke, because of the name of the data structure and because it improved the state of art of the Google paper. antirez.com/news/75
Indeed, that's why I say "at the level we can understand it". However what I mean is, not just to know mechanically how a Transformer works. You can imagine the kind of Bayesian inference happening block after block, that is, you can have a mental model of what could happen.
It does not matter I know how an LLM work (at least at the level we understand NNs in general and Transformers specifically): in many ways, talking with AI is beneficial in order to have a different POV and advices even in the non technical side of my life.
What's the logical reason for Wikipedia not automatically integrating information from other languages of the same page in languages that miss the same info quality / quantity? This can be only done "online", marking the text in a different color, removing most of the issues.
Sometimes when a model is released it has stellar performances, and then a few weeks later it is somewhat less shiny. Often times it is just human bias. However when the AI provider swears it is the same model checkpoint, are you sure it didn't turn on some aggressive KV cache quantization?
This isn't the LLM itself using a partial compaction tool (which is what I'm referring to). Imagine this:
1) ... good work done ...
2) ... a lot of tokens on a dead attempt ...
You want the LLM to automatically jump after "1" summarizing why "2" is bad, and continue.
I mean: the ability of the LLM to jump back to save context when a bad path was taken, then returning at a early context point with a small summarization of the wrong path discarded.
I mean, this should be made *by the model itself* generating a "Jump back" tool call.
One thing agents harnesses should be able to do is: to jump back in history trimming what follows, just injecting some self-steering text. I wonder if they can already do it. It looks very useful.
Please write to your MEPs to oppose untargeted mass scanning of private communications (Chat Control). On March 11 Parliament voted 458-103 to limit scanning. Today they vote again (!!!) and those protections could be watered down. Act now: fightchatcontrol.eu
I'm thinking that because of automatic programming certain complicated and elegant programs we wrote in the past may basically become a form of art for future generations. Like the manuscript books of the middle age for us.
For this reason you can't replace code with specification that is compiled into code each time. The specification would end as a pile of details that is better captured by the code side. This accumulation process of fine details (in the code) creates quality of real world systems.
Biggest mistake in AI coding era: to believe that specifications should be either natural language OR something else. The best combo is a natural language high level specification (the intend), plus code (as it gets written) documenting the finer behaviors.
Day 2 of Monster Scale Summit! Today, we have @skamille.themanagerswrath.com, @antirez.bsky.social, @tigerbeetle.com (Joran), @dominiktornow.bsky.social, @teivah.dev, Avi Kivity, and lots more. Guaranteed 🔥🔥🔥 www.scylladb.com/monster-scal...
Who thinks "clean room" is needed to reimplement and put it into a new license does NOT understand copyright. Clean room is a trick to make litigation simpler, it is not mandated by law: rewrites are allowed. The new code just must not copy protected expressions. Linus was Unix-aware.
Every morning there are two news: the news about the release of some major upgrade to some AI, and the news about the lack of DeepSeek v4 release.
Users report that asking Sonnet 4.6, via the Anthropic API, "What's your name", it reports "DeepSeek" with a high frequency. Labs cross-fine-tune on chain of thoughts and use other models as RL signal consistently. Also the pretraining, a *key* step, is on mostly public data. Anthropic disappoints.
Il post code & methodology, and the anti-contamination steps taken.
A clear room emulator of the Z80, the Spectrum, and CP/M written by Claude Code. Vey skeptical with the compiler experiment (more complex for sure) made by Anthropic because of the fundamental flaw of not providing the agent with the specifications / papers.
You see, I'm ok with AI usage in many ways. Yet I can't understand why people that don't have deficiencies in written expression use it for writing. Emails. Blog posts. Comments. Why? We only lose something in this way. We want your voice.
Today I had to fight with GPT 5.3 to defend my position on the complexity of a specific command of the new Redis type I'm adding (released soon I hope). It had a great point about the worst case, but the typical case was as I claimed. We reached an agreement mentioned both... :D