Read in

I often get asked what the best models are right now for a given task, and what models and tools I am using right now myself. This is meant to be a living document to keep track of what models are at the top of my mind right now.

Coding

Coding is by far my number 1 use case for AI, and I try all of the new models that show promise in the agentic coding space.

Agentic coding is when models perform autonomous software engineering, navigating your code base to understand where to make changes, editing the files, and then optionally running the code or tests that you have to verify its changes. This differs from regular coding ability, as non agentic coding is just how well a model can write a script one shot (usually).

For instance Claude Code, Cursor, Codex-cli are all agentic coding tools, while using the ChatGPT website is a more normal coding use case.

My coding style with these models is not the typical vibe coding that most other people use. I am a software engineer from the pre-LLM takeover world, so I tend to be a bit more opinionated about how I want my codebase to be structured. I also like to have nontrivial control over the design of the product that I am building.

Because of this, my coding setup has a lot of documentation about the specific project that I am working on in CLAUDE.md, cursorrules, or AGENTS.md files. I have templates for webapps to guide LLM’s on the frameworks that I want to use and the rules for how and where to add code to it. My prompts also tend to be very involved as well, especially for newer projects or larger features. I have had many prompts that are thousands of words long, describing in detail the pages and interactions that I want my app to have.

Another major thing I do with the agent before it starts coding is go through a design phase where it can ask questions and then outline how it will be building the features that I requested. That way I can see where the LLM is not understanding what I am asking from it and potentially tweak features based on feedback from the LLM.

In my mind, there are 3 major players in the agentic coding framework space to consider:

Cursor
Claude Code
Codex

The rest of the companies are derivative of these 3 and may have a slightly different user experience, but will be around the same performance of the three I listed above.

In my mind AI “wrappers” like Cursor will end up losing long term, since OpenAI and Anthropic will be able to train their models specifically to their frameworks (Claude Code and Codex), making the framework fit like a glove for the model.

With all that being said, what are my recommendations for what to use?

The best model and framework out there by far is GPT-5 codex in OpenAI’s Codex platform. It has an unprecedented level of attention to detail, and is also very thorough with the work that it does. It has the best bug fixing ability, and very rarely introduces any new bugs to the codebase. The one downside of it is that you can run into rate limit issues if you are on the $20 plan (this is the same plan that you would have for the ChatGPT website, so you probably already have one!)and are using it heavily. It is not as bad as Anthropic’s rate limits on their $20 plan, and if you step up to the $200 OpenAI plan, then the rate limits are basically non-existent.

The GPT-5-codex model is a finetune of GPT-5 meant specifically for agentic coding in Codex.

Another benefit of Codex is that it is available everywhere. They have a CLI extension similar to Claude Code, a VSCode extension, and you can also use it from the ChatGPT website and mobile app, allowing you to code on the go.

With these rate limits though, you will often need another LLM to fall back on, which is where the second model I use, GLM 4.6, comes in.

GLM 4.6 is from the Z.ai lab in China, and is the best open source coding model right now. Z.ai offers a $3/month plan that gives you 4x the usage limits as the Anthropic $20/month plan, while being almost 10x cheaper. It also integrates directly into Claude Code instead of needing another separate framework, so you get all the niceties and integration that Anthropic has added to their product.

I use GPT-5 for building out large features and initial mockups using the large prompts that I write, and then I will open 4 instances of Claude Code using GLM 4.6 and have each model address a separate bug fix, tweak, or small feature for the app. If GLM is unable to fulfill my request (usually weird bugs) then I escalate the issue to Codex to fix. I have yet to run into an issue GPT-5 was unable to solve.

Many of you are probably thinking “What about Claude? Where does it stack up against GPT-5 and GLM 4.6?”.

My opinion (and also the popular consensus from what I’ve seen) give the following rankings:

Gpt-5 Codex
Claude Sonnet 4.5 and Opus 4.1 (use Sonnet please, Opus will chew through credits/money much faster and isn’t any better)
GLM 4.6
Claude Sonnet 4

GPT-5 is a noticeable, nontrivial bump in quality vs Sonnet 4.5, and GLM 4.5 is good enough, especially considering the much better rate limits. Because of this Sonnet sits in no man’s land, where it is noticeably worse than GPT-5, and not a better value than GLM 4.6.

Monthly cost breakdown:

$20 for ChatGPT Plus
$3 for GLM Coding Plan

Day to day use

General

For most of my other, non-coding queries, I need the AI to do some form of research or comparison, and for that I turn to GPT-5 on the ChatGPT website. I am not the biggest OpenAI fan, but GPT-5 for both coding and general use is the best model out there by far. The experience on ChatGPT is seamless, and is what I would build if I were to do it myself.

For pretty much any use case, GPT-5 will be the best model. What makes it so great, outside of its raw intelligence, is its very low hallucination rate. Hallucinations are a killer for most AI systems, especially in live production environments, and with GPT-5 they are finally at a low enough rate where I feel comfortable using it for customer facing applications. There have been numerous queries that I have given where I would expect any other LLM to hallucinate an answer, but GPT-5 does not. It is the only model that I have used where it told me it didn’t know the answer to my question, without my having to prompt it to do so.

Multimodal

The one place that GPT-5 falls short is image and video understanding. For image understanding you can look at Gemini 2.5 Pro and Qwen3 VL (the big one, not the 30B one).

For any other modalities (video and audio) Gemini 2.5 Pro is the go to, nothing else is close (for an LLM).

Local

At home, I run Qwen3 VL 30B. This model is a mixture of experts model with 3B active parameters, making it very fast (>150 tokens per second on my 3090). It is the best small model right now that can do both image and text inputs.

Usually, especially with these smaller models, the regular text performance is degraded nontrivially when finetuning the model for image understanding, but the Qwen team has managed to maintain most of the model’s text abilities. For its size, it is the best image and text model out there right now.

The one downside with Qwen3 VL 30B however is that it uses all of the available VRAM on my system when I have it loaded, so I am unable to run any other models at the same time, like automatic speech recognition or text to speech model.

For this case, I run Qwen3 4B, giving a good balance of size and quality.

For the Qwen models, be sure you are using the 2507 versions, they are the updated versions and have a much better chat post training. Only the 4B, 30B, and 235B variants have been updated.

Image generation

Overall

The best image generation model out there right now by a decent margin is Seedream 4. It can generate images up to 4k, and only costs 3 cents per image on sites like Fal or Replicate.

To see the benchmarks for the best image generation models, check out Artificial Analysis.

Local

For myself, I prefer to generate images locally using ComfyUI, using loras to control the style. Also at the scale I generate images at (over 12k so far) it saves me a bunch of money to do so at home instead of paying to use a model.

I have switched back and forth between Qwen Image and Flux for my local setup, and have settled (for now) on using Flux for the majority of my workflows. I think Qwen is the better base image generation model, but it just doesn’t have the community support that Flux has (very few loras) and also is about 25% slower to run as well.

Also with large finetunes that make a better Flux base model like FluxMania and PixelWave, I am able to close the gap in terms of performance compared to Qwen.

Current base model:
FluxMania Legacy

Some of my favorite loras:
Synthwave
Retro Anime
Daubrez Style
Luminous Shadowscape
Illustration Concept

Video generation

Closed Source

For video generation, Sora 2 from OpenAI is the best out there right now. It has some of the best physics understanding, native audio generation, and competitive pricing at $0.10 per second. The one downside is that it is rather restrictive to use via the API, for instance you can use an image of a person as the first frame for the video.

Other notable models to check out if the Sora restrictions are too harsh for you: Veo3 and Kling 2.5 Turbo.

Once again, similar to the image models, Artificial Analysis has a leaderboard to check out the models and their rankings. Of note, Sora 2 is not a part of the models evaluated on the Artificial Analysis site.

Open Source

For open source models, the Wan 2.2 series of models from Alibaba is the king. There are many different variants, including speech to video, camera control models, and also animation.

Most of them, while being the best we have, are a bit lacking compared to the closed source models. The only one that isn’t is the Wan 2.2 Animate model.

This model allows you to take an input video and a reference image, and add the reference image to the video. It is very good at doing this, giving extremely realistic and detailed outputs.

Automatic Speech Recognition

There are not really any closed source models per say for the ASR space, mostly just companies selling systems that have ASR models. Two of these companies you can check out are Deepgram and Assembly AI. They can do ASR, and speaker identification (diarization), amongst many other audio related tasks.

Diarization is the main selling point of these systems when compared to the open source landscape, since we have many strong ASR models, but our diarization models are very poor in comparison.

For local models you have Whisper and Nvidia Parakeet and Canary. I personally think the Whisper models are a bit silly, using a poor architecture and a brute force training scheme. It is, however, very well supported in the open source community making it easy to use.

Parakeet and Canary are both much faster, and more accurate models from Nvidia, but they do lack the large number of languages that Whisper supports.

I personally run the Parakeet 0.6b model, which is 15x faster than the fastest Whisper variant, while also having 15% lower WER.

Text to Speech

ElevenLab is the King for closed source models.

For open source, there are a ton of options, with more being added every week. The ones that have stood out to me tend to be the small and fast ones, since they are the most practical for real world deployments where latency matters.

Because of this, I use Kokoro for voice generation on the edge and NeuTTS for voice cloning and cases where high quality audio is needed.

What models do I recommend?

Coding

Day to day use

General

Multimodal

Local

Image generation

Overall

Local

Video generation

Closed Source

Open Source

Automatic Speech Recognition

Text to Speech

Stay Updated