Kimi K2.5

Can Kimi topple the closed source giants? Do skills files actually work?

Read in

tl;dr

Can Kimi K2.5 beat GPT 5.2 and Opus 4.5?
Do skills files actually help your agents?
Google’s world model Genie 3 gets released to the public

Releases

Kimi K2.5

Moonshot AI have released an updated version of their 1 trillion parameter open source model, Kimi K2.5. This version departs from its previous version (and most Chinese models in general) by being a multimodal model, meaning it supports both text and image inputs.

Benchmarks

Benchmarks show it competes with the frontier closed source models

Kimi has been known for its interesting personality and writing style, something that was unique compared to all other LLMs. That personality has been degraded a bit (it sometimes says “You’re absolutely right!”), but this has come at better expressiveness in agentic tasks, which we can see as it sits on top of the Design Arena leaderboard.

Design arena leaderboard

For coding tasks it still lags behind Opus 4.5 and GPT 5.2, the two top tier models right now. This is actually the case from what I have seen for most tasks. On benchmarks it is in the top tier, but in the real world it is in the tier below, with models like GLM 4.7, Gemini 3 Flash, and Sonnet 4.5.

Model	$ per million (input)	$ per million (output)	Tokens per second
Kimi K2.5 Thinking	$0.6	$3	30
Gemini 3 Flash	$0.50	$3	75
GLM 4.7	$0.60	$2.20	90
Claude Sonnet 4.5	$3	$15	57
GPT 5.2	$1.75	$14	34
Claude Opus 4.5	$5	$25	64

Numbers from OpenRouter.

GPT 5.2 and Opus 4.5, although being the top models, are there for different reasons. GPT 5.2 is cold and very literal, but is thorough and extremely smart. Opus on the other hand understands user intent very well, and is great to talk to, but makes more mistakes.

I feel like the comparison is very similar for Kimi K2 and Gemini 3 Flash. Kimi is the cheaper version of Opus and Gemini is the cheaper version of GPT 5.2.

For cheap coding, I think I will still turn to GLM 4.7, but for all other tasks Kimi beats it out, which means it’s a top 5 model in the world right now. I highly recommend checking it out if you haven’t already.

Artificial Analysis benchmarks

The Artificial Analysis benchmarks also corroborate its similarity to Gemini 3 Flash

Research

Skills are not enough

If you have been using any agentic coding tool (Claude Code, Cursor, etc) you have probably heard of skills. Skills are markdown files that contain instructions for LLMs on how to do specific tasks or use certain libraries that the model may not have been trained on.

What Vercel found out is that just because you have these skills, doesn’t mean the models will use them. By default, most frameworks will just tell the LLMs that they exist, but its up to the LLMs to actually read them when needed.

Score breakdown

What they found was that the models will not call skills on their own unless specifically told to and even when you tell them directly in your AGENT.md they still won’t use them when needed.

They found to get agents to calls when needed properly they had to add this to their AGENTS.md file

IMPORTANT: Prefer retrieval-led reasoning over pre-training-led reasoning for any {your skill content} tasks. {List of paths to skills files here}

This bypasses the actual skills loading and calling tools that frameworks have and instead just gives the model the direct paths to look at instead, which it understands to do much better.

This is most likely due to the fact that models are much more used to reading and looking at files, as that’s just a general coding task that they have to do, versus using the custom skill calling tools that they have in their harnesses. This goes to show the importance of utilizing things that the model has already seen a lot of versus making your own new abstraction for them to go and try and use.

Quick Hits

Genie 3 Public Release

Google’s world model Genie 3 has been released publicly.

An AI world model is basically a video game engine that generates each frame on the fly based on your inputs. There is no game engine, code, or any other additional state that is used, it is purely an AI model. You can give it a starting frame, or just a text description of the world that you want, and then you can interact with the world from there and it will generate it all on the fly for you as you go in real time.

A discarded pack of cigarettes in Penn Station — From Riley Goodside on Twitter

Note: to access the model you will need Google’s AI Ultra subscription, which is $125 a month for the first 3 months and then $250 a month after that.

Finish

I hope you enjoyed the news this week. If you want to get the news every week, be sure to join our mailing list below.

Ascii art

Ascii art by Design by Aron on Twitter

Stay Updated

Subscribe to get the latest AI news in your inbox every week!

← BACK TO NEWS