Clarifying “Medium”
The most confusing part of this topic is that medium has more than one meaning.
As of 2026-04-08, in OpenAI’s public documentation, GPT-5.4’s reasoning.effort supports none, low, medium, high, and xhigh, with a default of none. However, within the same documentation, verbosity also has low, medium, and high, and the default value for GPT-5.4’s verbosity is medium.
So, if you see something online that says “the default is medium,” don’t immediately assume it refers to the “thinking level.” Often, they are talking about completely different things.
If you are using it directly in Codex and see that the default is medium, I am more inclined to interpret it as a preset provided by the product layer rather than the underlying default value mentioned in the model documentation. If we don’t separate this distinction, our subsequent discussions will constantly contradict each other.
The Official Documentation Doesn’t Fully Explain the Gap
Let’s look at the official documentation. The public documents currently confirm a few things:
gpt-5.4is the officially recommended default model for general coding tasks.- In the code generation guide, the examples provided by the official team for
gpt-5.4directly usereasoning: high. - Codex-oriented models like
gpt-5.3-codexexplicitly supportlow,medium,high, andxhighon their public pages. gpt-5.4-prois another line; it’s not simply a matter of turning up the dial on regulargpt-5.4. It is an independent model designed for “thinking longer with more compute power.”
However, the official documentation hasn’t provided a particularly useful chart, such as:
- Exactly how much lower the success rate is when using
mediumcompared tohigh. - How much extra time or tokens are spent using
xhighcompared tohigh. - In coding scenarios, what kind of tasks are worth jumping straight to
xhigh.
In other words, the official team gave you the knobs, but they didn’t draw out the experience curve for you.
What’s Actually Useful is How the Tiers are Separated on the Leaderboard
A flash of inspiration struck me, so I went to check the code leaderboard on Arena, and now it’s much clearer.
The code leaderboard on arena.ai separates the tiers. The page update time is 2026-04-01, and as of when I write this article:
gpt-5.4-high (codex-harness)ranks6thwith a score of1457gpt-5.4-medium (codex-harness)ranks16thwith a score of1427gpt-5.3-codex (codex-harness)ranks18thwith a score of1407Looking at these numbers together makes the meaning very direct. For the sameGPT-5.4, the difference betweenhighandmediumisn’t just a “slight experience gap”; it represents a noticeable tier separation. If you only read the statement “GPT-5.4 is very strong,” the information is actually insufficient because the leaderboard itself has separatedhighandmediuminto two distinct lines. When people say something is “very strong,” they are most likely referring to the performance achieved with the high-thinking tier, not speaking on behalf of the medium tier. Of course, the leaderboard isn’t the absolute truth for your project. It measures agentic coding + harness scenarios, not just a single local repository you have. But the direction is very clear: when it comes to coding, the inference tier genuinely changes the results, not just the speed.
How Should I Choose Now
Simply put, my current usage is very straightforward.
Use medium for these scenarios:
- Changing a few small files
- Fixing obvious bugs
- Getting the model to spit out a draft first
- When speed is key and I don’t want to wait too long
Use
highas the daily default: - Modifying multiple interconnected files
- When there are slightly vague parts in the requirements
- When I need to read the code before making changes
- When judgment is required, not just code completion
I reserve
xhighfor the tough nuts: - High-risk refactoring
- Troubleshooting long chains of issues
- Architectural changes
- When
highfails to solve the problem after two rounds The most crucial point here is not how amazingxhighis, but not usingmediumas a “cure-all.” The real problem withmediumisn’t that it’s weak, but that it too easily gives you the illusion of “good enough” on complex tasks. The result is saving a little time in the first round, but doing much more rework later.
Back to GPT-5.4, What Level is Actually Powerful?
So, finally back to that question: When people say “GPT-5.4” online is very powerful, what level are they referring to?
My judgment is that if the tier isn’t specified, it’s more reliable to assume they mean the higher thinking level when saying “GPT-5.4 is powerful.” At least in coding scenarios, don’t directly interpret it as medium. If the other party is talking about gpt-5.4-pro, that’s an entirely different matter; that’s a separate, more computationally intensive version.
I previously wrote about AI Coding Interaction Based on Command Line, which was more about changes in interaction methods. Looking back now, the change in interaction is one thing, but what level the model is actually running at has become another, more practical issue.
I am very clear on this now: high is sufficient for daily use; if that fails, try xhigh. This balance point between speed, cost, and success rate seems more correct.
References
- Using GPT-5.4 | OpenAI API
- Code generation | OpenAI API
- GPT-5.4 Model | OpenAI API
- GPT-5.4 pro Model | OpenAI API
- GPT-5.3-Codex Model | OpenAI API
- Code AI Leaderboard - Best AI Models for Coding
Writing Notes
Original Prompt
While using $blog-writer codex, I have a question: the default thinking level is medium. What is the difference between the remaining 'high' and 'xhigh' capabilities? Which one should I use for daily use? The official documentation doesn't provide clear instructions, and online sources say that GPT-5.4 is very powerful—which level of thinking are they referring to? Suddenly, I remembered the large model ranking: https://arena.ai/leaderboard/code. Here, it clearly states the thinking levels of large models. It seems that gpt-5.4-high (codex-harness) ranks sixth. Using 'high' by default should be sufficient. If it still can't handle it, I can try 'xhigh' to balance cost and speed.
Writing Approach Summary
- Use “daily use high, xhigh as fallback” as the main judgment point, rather than creating a tier encyclopedia.
- Separate
reasoningandverbosityto avoid confusing the twomediumlevels mentioned in public documentation. - Official materials are mainly used to confirm supported tiers, default values, and code generation examples; do not fabricate an ability gap table that is not provided by the official sources.
- The Arena leaderboard uses the rankings and scores from the
2026-04-01page to provide factual anchors supporting the claim that “high is clearly superior to medium.” - Structurally, first explain the source of confusion, then define the boundaries according to official statements, and finally conclude with practical daily selection advice.