Codex defaults to medium, but I later switched to high.

Clarifying “Medium”

The most confusing part of this topic is that medium has more than one meaning. As of 2026-04-08, in OpenAI’s public documentation, GPT-5.4’s reasoning.effort supports none, low, medium, high, and xhigh, with a default of none. However, within the same documentation, verbosity also has low, medium, and high, and the default value for GPT-5.4’s verbosity is medium. So, if you see something online that says “the default is medium,” don’t immediately assume it refers to the “thinking level.” Often, they are talking about completely different things. If you are using it directly in Codex and see that the default is medium, I am more inclined to interpret it as a preset provided by the product layer rather than the underlying default value mentioned in the model documentation. If we don’t separate this distinction, our subsequent discussions will constantly contradict each other.

The Official Documentation Doesn’t Fully Explain the Gap

Let’s look at the official documentation. The public documents currently confirm a few things:

gpt-5.4 is the officially recommended default model for general coding tasks.
In the code generation guide, the examples provided by the official team for gpt-5.4 directly use reasoning: high.
Codex-oriented models like gpt-5.3-codex explicitly support low, medium, high, and xhigh on their public pages.
gpt-5.4-pro is another line; it’s not simply a matter of turning up the dial on regular gpt-5.4. It is an independent model designed for “thinking longer with more compute power.”

However, the official documentation hasn’t provided a particularly useful chart, such as:

Exactly how much lower the success rate is when using medium compared to high.
How much extra time or tokens are spent using xhigh compared to high.
In coding scenarios, what kind of tasks are worth jumping straight to xhigh.

In other words, the official team gave you the knobs, but they didn’t draw out the experience curve for you.

What’s Actually Useful is How the Tiers are Separated on the Leaderboard

A flash of inspiration struck me, so I went to check the code leaderboard on Arena, and now it’s much clearer. The code leaderboard on arena.ai separates the tiers. The page update time is 2026-04-01, and as of when I write this article:

gpt-5.4-high (codex-harness) ranks 6th with a score of 1457
gpt-5.4-medium (codex-harness) ranks 16th with a score of 1427
gpt-5.3-codex (codex-harness) ranks 18th with a score of 1407 Looking at these numbers together makes the meaning very direct. For the same GPT-5.4, the difference between high and medium isn’t just a “slight experience gap”; it represents a noticeable tier separation. If you only read the statement “GPT-5.4 is very strong,” the information is actually insufficient because the leaderboard itself has separated high and medium into two distinct lines. When people say something is “very strong,” they are most likely referring to the performance achieved with the high-thinking tier, not speaking on behalf of the medium tier. Of course, the leaderboard isn’t the absolute truth for your project. It measures agentic coding + harness scenarios, not just a single local repository you have. But the direction is very clear: when it comes to coding, the inference tier genuinely changes the results, not just the speed.

How Should I Choose Now

Simply put, my current usage is very straightforward. Use medium for these scenarios:

Changing a few small files
Fixing obvious bugs
Getting the model to spit out a draft first
When speed is key and I don’t want to wait too long Use high as the daily default:
Modifying multiple interconnected files
When there are slightly vague parts in the requirements
When I need to read the code before making changes
When judgment is required, not just code completion I reserve xhigh for the tough nuts:
High-risk refactoring
Troubleshooting long chains of issues
Architectural changes
When high fails to solve the problem after two rounds The most crucial point here is not how amazing xhigh is, but not using medium as a “cure-all.” The real problem with medium isn’t that it’s weak, but that it too easily gives you the illusion of “good enough” on complex tasks. The result is saving a little time in the first round, but doing much more rework later.

Back to GPT-5.4, What Level is Actually Powerful?

So, finally back to that question: When people say “GPT-5.4” online is very powerful, what level are they referring to? My judgment is that if the tier isn’t specified, it’s more reliable to assume they mean the higher thinking level when saying “GPT-5.4 is powerful.” At least in coding scenarios, don’t directly interpret it as medium. If the other party is talking about gpt-5.4-pro, that’s an entirely different matter; that’s a separate, more computationally intensive version. I previously wrote about AI Coding Interaction Based on Command Line, which was more about changes in interaction methods. Looking back now, the change in interaction is one thing, but what level the model is actually running at has become another, more practical issue. I am very clear on this now: high is sufficient for daily use; if that fails, try xhigh. This balance point between speed, cost, and success rate seems more correct.

References

Writing Notes

Original Prompt

While using $blog-writer codex, I have a question: the default thinking level is medium. What is the difference between the remaining 'high' and 'xhigh' capabilities? Which one should I use for daily use? The official documentation doesn't provide clear instructions, and online sources say that GPT-5.4 is very powerful—which level of thinking are they referring to? Suddenly, I remembered the large model ranking: https://arena.ai/leaderboard/code. Here, it clearly states the thinking levels of large models. It seems that gpt-5.4-high (codex-harness) ranks sixth. Using 'high' by default should be sufficient. If it still can't handle it, I can try 'xhigh' to balance cost and speed.

Writing Approach Summary

Use “daily use high, xhigh as fallback” as the main judgment point, rather than creating a tier encyclopedia.
Separate reasoning and verbosity to avoid confusing the two medium levels mentioned in public documentation.
Official materials are mainly used to confirm supported tiers, default values, and code generation examples; do not fabricate an ability gap table that is not provided by the official sources.
The Arena leaderboard uses the rankings and scores from the 2026-04-01 page to provide factual anchors supporting the claim that “high is clearly superior to medium.”
Structurally, first explain the source of confusion, then define the boundaries according to official statements, and finally conclude with practical daily selection advice.