Fewer tokens, so why is GPT-5.5 in Codex actually more expensive?

Unit Price Details/Breakdown

Look at the surface details first.

OpenAI released GPT-5.5 on April 23, 2026, positioning it as a model with enhanced capabilities for coding and professional tasks. According to the official API Pricing page dated April 27, 2026, the pricing for GPT-5.5 is $5 / 1M tokens for input and $30 / 1M tokens for output; while GPT-5.4 costs $2.5 / 1M tokens for input and $15 / 1M tokens for output. In other words, compared to the previous version (5.4), both the unit price for input and output are roughly doubled with 5.5.

Thus, it’s not hard to understand why simple tasks only cost 30% more. This is because 5.5 may indeed be slightly more convergent in these types of tasks, take fewer detours, and consume fewer tokens, thus offsetting some of the price increase. OpenAI itself emphasizes in its release notes that 5.5 is more capable in complex coding and real-world work tasks, and the token utilization rate is higher in certain scenarios. This statement does not conflict with my tests; it only means that the model itself performs better, and it doesn’t guarantee that your total bill will be cheaper.

To put it simply, my appetite has shrunk a bit, but the dish prices have doubled. It’s only natural that the total bill will go up.

The Dark Side of Complex Tasks

What truly impressed/struck me was not the mere 30% improvement on simple problems, but the direct boost up to 2.6 times for complex tasks.

If you only focus on the per-token cost of this part, it can easily misrepresent its true value. Complex tasks in Codex are inherently not as simple as a “question and answer” format. When you set the thinking mode to high, the model’s goal is not just to generate output quickly; rather, it aims to complete the task. It will break down steps, review the context, rewrite information, double-check results, and might even run several extra rounds on its own. I found one point in OpenAI’s official description of GPT-5.5 to be particularly critical: it is not merely better at chatting; rather, it excels at self-planning, utilizing tools, verifying results, and continuing the process even when faced with ambiguity.

This means that the cost structure of complex tasks has changed.

Previously, when you looked at a model, you mainly focused on “how much it costs per 1M tokens.” Now, you have to look at something else: “how many actions it will trigger in total to complete this task.” If the model is more proactive and persistent, of course, it might deliver better results, but it also makes it easier for the number of requests, context rereading (or context usage), and output length all to increase. What you ultimately see is a combination of rising per-unit cost, rising tokens, and rising request counts.

This is also why I observed a rather strange phenomenon in this test: For simple problems, 5.5 merely appeared “slightly more costly”; however, for complex tasks, it was not just “more costly,” but entirely changed the cost curve.

More Expensive, Not Necessarily Less Value

There is another cost here that isn’t visible just by looking at the platform billing statement: the rework cost.

If 5.4 is cheaper, but you have to add two extra rounds of prompts, manually correct the course several times, and revisit and fix the structure once more—that time also represents a cost, even if it’s not directly listed in the token bill. 5.5 is more expensive, which, in a way, means it is selling this thing: it bundles some of the work that “humans have to patch up” into the model’s own reasoning and checking process.

Therefore, in complex tasks, the elevated fee of 5.5 shouldn’t just be understood as “the platform overcharging.” To put it more accurately, it is like consolidating the rework, review cycles, supplementary explanations, and re-validations that were previously scattered across human efforts into a single, higher model cost. If maintaining a high first-draft success rate is important to you, or if the task context is extensive and switching back and forth is costly, then this fee may actually be worthwhile.

Of course, there is a boundary here. Whether rework costs actually decrease depends on the type of task. If you are just asking simple questions, or if the matter is inherently quick and straightforward, then this extra 5.5 capability might not be necessary. When the price goes up but the return doesn’t follow suit, it can feel disproportionately expensive at that moment.

How Should I Choose Now

There is also another highly practical question: why do many people go to third-party platforms for estimates/evaluations?

OpenAI’s help center explains it quite clearly: chatgpt.com and platform.openai.com are two independent platforms, with billing and historical costs being viewed separately. The API side has a “Usage Dashboard” where you can export cost and usage; however, if you usually spend your time experimenting on ChatGPT or Codex, there isn’t that token-cost perspective broken down by model that API users are familiar with. This conclusion is my inference based on the official help documentation. Also because of this, many times we can only measure it ourselves or rely on third-party platforms for a sideways view.

Therefore, my attitude towards GPT-5.5 is also quite clear:

It’s not that it can’t be used, but we shouldn’t use the mere phrase “more powerful model” (or “model stronger”) as a default assumption that it will also be “more cost-effective.”

For simple problems and low-risk tasks, cheaper models like 5.4 are probably still more suitable for daily foundational use. For complex and high-value tasks, then consider deploying 5.5, provided you are genuinely willing to pay for the stronger autonomy and lower rework cost it offers. Otherwise, it’s easy to find yourself in a situation: while the results may be slightly better, the costs spiral out of control first, leading you to hesitate even on replication or scaling.

To be honest, this is the most useful conclusion from this round of testing. Model upgrades are not simply performance upgrades; often, they are more like billing logic upgrades. Previously, we were choosing based on answer quality, but now we also have to conveniently pick a reasoning path that we can afford.

If one considers the platform perspective a step further, this pricing structure seems to be rewriting users’ price anchor points for AI programming. Stronger agents will no longer be sold based on “how much per conversation,” but rather on “how much rework they save you.” OpenAI completed new large-scale funding in April 2026 and continues to adjust its corporate structure and expand its commercial scale. Regarding the topic of an “IPO push within the year,” as of April 27, 2026, I haven’t seen OpenAI officially state this directly; it is more that external media and analysts are speculating in this direction. Therefore, if you need to write about this segment, a safer way to phrase it would not be “OpenAI deliberately raises prices for its IPO,” but rather that the company is clearly using its stronger agent capabilities to gradually educate users to accept a higher price band.

References

Writing Notes

Original Prompt

ChatGPT’s official platform makes it difficult to directly track tokens and costs. I used a third-party platform for testing, calling different models of ChatGPT inside codex. With ‘high’ reasoning mode, simple questions were tested: Compared to 5.4, 5.5 had an increase of about 30%, resulting in fewer tokens, but at a higher unit price per token; For complex tasks, the cost jumped by 2.6 times. There were more API requests and greater token consumption, and combined with rising token prices, this led to a total cost increase of 2.6 times. The update to 5.5 primarily reduces your rework (iteration) costs. OpenAI is also taking this opportunity to let users adapt to higher pricing, especially since they aim for an IPO within the year.

Writing Concept Summary

Center the article’s main narrative on the “changing cost structure of complex tasks,” rather than writing a general model evaluation.
In the first half, use official pricing to explain why simple problems can result in “fewer tokens but a higher overall cost.”
In the middle section, analyze how the 2.6x increase for complex tasks is driven by rising unit prices, token counts, and request volumes simultaneously.
Include a dedicated segment on the “rework cost” perspective