codex troubleshooting how-to

How to Reduce Codex Token Usage

Why Codex burns tokens under per-token pricing, how to measure it, and a concrete checklist of reductions from AGENTS.md size to reasoning effort.

jordan · June 2, 2026 · 6 min read

The most-commented open issue on the Codex repo is titled, plainly, "Burning tokens very fast." It has 594 comments and counting. The original poster watched 20% of a business account's tokens vanish in two hours over one or two prompts. The thread underneath is just a long line of people asking the same thing: where did all my credits go? If you want to reduce Codex token usage, that question is the right place to start, and the answer is more concrete than most of the thread assumes.

It got sharper on April 2, 2026. That's when OpenAI moved Codex to API token-based rates. You still buy credits, but what you spend is now metered by tokens. A heavy session used to cost you "messages" against a window. Now every token of input, output, and reasoning draws down a balance you can watch shrink in real time. That's why the complaint jumped from background grumbling to the top of the issue tracker.

So let's deal with it directly. Here is why Codex eats tokens, how to actually see the damage, and a checklist of cuts that work.

Why Codex burns tokens

Codex runs an agent loop, so one task is never one API call. It reads files, runs tools, reasons about the output, reads more files, and each of those round trips carries the entire accumulated context back to the model. By the time a moderately complex task finishes, the total token count is often three to five times what you'd guess from a single prompt. The context window does not reset between steps inside a task. It grows.

Four things inflate that growth more than anything else, and reasoning effort is the big one. Codex exposes a model_reasoning_effort setting, and the gap between levels is steep. At xhigh, Codex burns three to five times more tokens than medium on the identical prompt. Set high reasoning globally and forget about it, and that one slip explains a doubled or tripled bill on its own.

Then there's your AGENTS.md. Every Codex turn reloads the AGENTS.md files in scope and pays for them as input tokens, every single turn, forever. Codex caps the combined size at 32 KB by default (project_doc_max_bytes), and the people who hit that ceiling are shipping the equivalent of a short novel to the model before they type anything. A bloated AGENTS.md isn't a one-time cost. It's a tax you pay continuously.

MCP servers are the quiet one. Every connected server injects its tool definitions into context at the start of each turn, so a single GitHub MCP exposing 93 tools can consume roughly 55,000 tokens per turn before you've said a word. Stack five servers and you can burn more than 50,000 tokens of pure overhead per turn.

The last one is verbose tool output. When Codex dumps an entire 4,000-line log, or cats a whole file it only needed ten lines from, that output lands in context and rides along on every turn after. Runaway reads compound.

How to see your token usage

You cannot cut what you cannot measure, and the measurement tools live inside the session.

Run /status inside an active Codex session. It displays your current model, token usage, git branch, sandbox mode, and your remaining tokens for both the rolling 5-hour window and the weekly window. When a bill surprises you, this is the first place to look, because the count it shows includes all that overhead, not just the messages you can see.

For something you don't have to remember to run, configure /statusline. That command sets what shows up in the footer: model, context, limits, token counters, git, session, and directory. Think of it this way. /status is a check you call on demand, while /statusline is a meter that stays on screen the whole time you work. The Codex TUI also shows a persistent context-usage percentage, and once it climbs past 80%, you're heading into compaction territory, where Codex starts summarizing history to make room.

One honest caveat. These commands report session-level numbers. They do not show account-wide billing or your subscription quota. For that you check the Codex usage settings, or the OpenAI Platform usage page if you're on an API key. Don't compare the two directly, because they meter different things.

The reduction checklist

Start with reasoning effort, because one setting moves the bill more than any other. Set model_reasoning_effort = "medium" as your default in ~/.codex/config.toml and reserve high or xhigh for the handful of tasks that really need it, like an architectural decision or a security pass. Most edits, refactors, and bug fixes do not. If you turned reasoning up once for a hard problem and never turned it back down, you've been paying the xhigh premium on every trivial rename since.

Cut your AGENTS.md down. The practical targets that circulate among Codex users are tight: under 20 lines for your global ~/.codex/AGENTS.md, under 50 for a repo-root AGENTS.md, under 30 for module-level files. Delete anything the model already knows. "Write idiomatic Python" is wasted tokens. The only lines that earn their place are project-specific facts the model cannot infer. If you genuinely need more room, split guidance into nested files per directory instead of raising the byte cap, so a given turn only pays for the rules in scope.

This is where a well-scoped skill beats a sprawling AGENTS.md. A skill loads only when its trigger fires, so the instructions for, say, your deploy process don't sit in context during unrelated work the way a monolithic AGENTS.md does. The catch is that writing a tight skill is its own craft, and most people overstuff them the same way they overstuff AGENTS.md. Knack is built for exactly that problem. It turns a short interview into a single, well-scoped SKILL.md that travels to Codex, so non-coders can author guidance that stays lean instead of growing into a context tax.

Prune your MCP servers. Open ~/.codex/config.toml and disable any server you aren't actively using this session. If you only need three tools out of a 93-tool server, that server is still charging you the full tool-definition payload every turn. Turn it off when you don't need it.

Cap verbose output. Set tool_output_token_limit to something like 12,000 so a runaway file read can't flood your context, and drop model_verbosity = "low" to trim output tokens without hurting code quality. When you do need to inspect a big file, ask for the specific lines, or grep first, rather than reading the whole thing.

Compact on purpose. Don't wait for the automatic trigger near 95%, which can degrade quality because it summarizes aggressively under pressure. Run /compact manually at around 60% usage to fold conversation history into a summary while the context is still clean. Better still, start a fresh session between major phases. The investigation context from debugging does not need to ride along into implementation.

Pick a smaller model for routine work. GPT-5.4-mini runs at roughly 30% of GPT-5.4's token cost and handles most mechanical edits fine. For batch or CI runs, codex exec --model gpt-5.4-mini is the cheap default. Save the expensive model for the work that actually rewards it.

The shape of a cheap session

Put those together and a token-disciplined Codex session looks specific. Reasoning sits at medium. AGENTS.md is forty lines of facts the model couldn't guess. Two MCP servers are connected, not nine. Output is capped, the model is the mini for routine passes, and you /compact or restart before context bloats. That setup can cut a session's token draw by more than half against a default-everything, high-reasoning, 32-KB-AGENTS.md, every-MCP-on configuration. If you want a place to keep that discipline once you've found it, folding your repeated instructions into a lean skill with Knack is how it sticks instead of creeping back into a bloated AGENTS.md.

I think the credit meter mostly rewards restraint. The teams burning through balances in two hours are almost never doing harder work. They're carrying more context than the task needs, turn after turn, and paying for all of it. Trim the context and the bill follows. If you only do one thing today, run /status mid-task and read the number. Most people have never looked, and that number is usually the whole argument.