knack
← all posts

Codex CLI vs Claude Code: The Honest 2026 Comparison

Six months running both as daily drivers. Claude thinks better, Codex executes more cheaply, and the people who pretend one of them is dead are not actually shipping code.

I have been running both Codex and Claude Code as daily drivers since January. Same machine, same projects, mostly the same prompts. After six months I have a take, and it matches what the rest of the senior-dev internet has converged on: Claude thinks better, Codex executes more cheaply, and the people who pretend one of them is dead are not actually shipping code.

I pay for both subscriptions out of pocket. If you only read one paragraph, read the next one.

The take, stated plainly

Use Claude Code when the cost of a wrong answer is high, the problem requires multi-step reasoning, or you are reviewing other people's work. Use Codex when the cost of a wrong answer is low, the task is well-scoped, and you are going to run twenty of them today. Claude as thinker, Codex as executor. That framing is now the consensus on r/ChatGPTCoding and the conclusion of the 500-comment Reddit survey from April. Which one wins for you depends on which side of that line you spend more time on.

Surfaces and where you actually run them

Both tools want to be everywhere you write code. Codex has more touch points; Claude Code is tighter across the ones it ships.

Codex ships across five primary surfaces: an open-source Rust CLI at v0.130.0 (May 8, 2026), macOS and Windows desktop apps with parallel agents and worktrees, a cloud web agent at chatgpt.com/codex, an IDE extension for VS Code, Cursor, Windsurf, and JetBrains, and a Chrome extension that launched in May. A GitHub Action exists too. No surface is the "lead." OpenAI markets the desktop app as the coordination hub, but the CLI has the developer mindshare and 82k+ GitHub stars.

Claude Code covers fewer surfaces but knits them together harder. The Node-based CLI, the VS Code extension, a JetBrains plugin, the Claude Desktop app for macOS, Windows, and Windows ARM64, the web surface at claude.ai/code, an iOS app, and Remote Control for handing a terminal session off to your phone. CLAUDE.md files, settings, MCP servers, and skills follow you across all of them. Codex doesn't match that yet.

If your team lives in Slack-driven triage and ticket-to-PR pipelines, Codex's surface count helps. For IDE-plus-terminal people who occasionally check on a job from a laptop in another room, Claude Code's portability does.

Models, and what "default" actually means

Codex defaults to GPT-5.5, released April 23, 2026. GPT-5.4 and GPT-5.4-mini are still selectable. The 5.x-Codex line that powered the tool through 2025 has been folded into the generalist GPT-5.5 model rather than maintained as a separate variant, and it inherits much of the 1,000+ tok/s throughput from GPT-5.3-Codex Spark.

Claude Code uses whichever Sonnet or Opus is current. Opus 4.7 shipped April 16, 2026 as the heavy-lift default; Sonnet 4.6 handles most edits. The /model switcher is one keystroke, which matters when you want Opus for a refactor and Sonnet for the 200 file renames that follow.

The model difference shows up as a vibes thing. GPT-5.5 writes confidently and runs longer chains of tool calls before checking in. Claude asks clarifying questions when the brief is thin, which is annoying in scaffolding tasks and a lifesaver in migrations.

Permissions, autonomy, and the "danger" flags

The tools look very different in shape but converge on the same outcomes.

Codex has three sandbox modes (read-only, workspace-write, danger-full-access) crossed with four approval policies (untrusted, on-request, on-failure, never). The --full-auto flag is the daily-driver shortcut. --dangerously-bypass-approvals-and-sandbox, aliased as --yolo, is the escape hatch. The matrix reads like a security product, because OpenAI shipped Codex into enterprises with paranoid CISOs.

Claude Code has six permission modes: default, acceptEdits, plan, auto (a Q1 2026 research-preview classifier that approves tool calls when it can verify they match the request), dontAsk, and bypassPermissions. The CLI flag for the last one is --dangerously-skip-permissions. Plan mode is the killer feature: Claude reads your codebase, drafts a plan, and won't touch a file until you say go. Codex's /plan slash command does the same job, but Claude's plan-then-execute flow has more polish.

For OS-level isolation, Claude Code shipped a sandbox that combines filesystem and network restrictions for Bash commands. Codex's sandbox is built into the agent runtime itself. Both stop rm -rf / even in YOLO modes. Neither will save you from a malicious npm install. The agents execute exactly the dangerous thing you let them execute.

Extensibility, and the surprise convergence

A year ago this was the biggest gap between these two products. As of May 2026 it is the most boring section of this article.

Both tools speak Skills, the open standard for SKILL.md folders with YAML frontmatter that started in Claude Code and got adopted by Codex in December 2025. Both speak MCP. Both support subagents. Both shipped plugin systems within weeks of each other in March 2026. Both read project-instruction files at the repo root, with different filenames: Codex reads AGENTS.md (with AGENTS.override.md precedence), Claude Code reads CLAUDE.md plus auto-loaded .claude/skills/ directories at every parent level. Different conventions, same capability.

Whichever tool you pick, your SKILL.md investments port between them. Knack outputs the format both agents read, so a skill you author in one workflow drops cleanly into the other. The progressive-disclosure model (name plus description always loaded, body only on invocation) is identical across both platforms, which is the entire reason teams run multi-tool stacks without rewriting their internal tooling.

For the longer Claude-specific version of this discussion, see our Claude Skills marketplace overview and how-to-use guide. For the Codex side, our Codex Skills walkthrough covers the same territory.

Benchmarks, with the asterisks

Benchmark numbers in this category age in weeks rather than months. Here are the ones that matter as of mid-May 2026.

Benchmark Claude (Sonnet 4.6 / Opus 4.7) Codex (GPT-5.5) Source
SWE-bench Verified 87.6% 76.5% Anthropic, OpenAI
Terminal-Bench 2.0 71.2% 77.3% Terminal-Bench
Blind 12-round dev test 67% wins 25% wins (8% ties) Blake Crosley
Reddit daily-driver pref 25-30% 65.3% (79.9% upvote-weighted) Dev.to survey of 500+ comments

Read this table twice. The same population that picks Claude two-to-one in blind quality tests picks Codex two-to-one as their daily driver. One explanation fits the data: rate limits. Anthropic's $20 plan burns tokens on long tool-using sessions faster than ChatGPT Plus does, and the modal heavy user hits the wall around lunchtime. Codex rebuilt their pricing precisely so this would not happen.

Caveat on the SWE-bench number: Anthropic's 87.6% uses a tuned evaluation setup. OpenAI publishes Codex against the standard rig. Apples to oranges if you read it strictly. Apples to apples enough if you care which model is sharper on real defects in real repos.

Pricing, and the thing that actually drives the switching

Claude Code

Plan Price What you get
Pro $17/mo annual, $20 monthly Sonnet + limited Opus, single seat
Max 5x $100/mo 5x Pro usage, Opus access
Max 20x $200/mo 20x Pro usage
Team $20-25/seat (standard), $100/seat (premium) Multi-seat, admin controls
Enterprise Custom + API usage SSO, audit, managed settings

Codex (bundled with ChatGPT)

Plan Price 5-hour limit (GPT-5.5)
Free $0 A handful of messages, no cloud tasks
Go $8/mo Low single-digit message cap
Plus $20/mo 15-80 messages
Pro 5x $100/mo 80-400 messages
Pro 20x $200/mo 300-1,600 messages
Business $30/seat/mo Pay-as-you-go credits
Edu / Enterprise Custom Custom

Codex shifted to credit-based metering on April 2, 2026. GPT-5.5 burns 125 credits per million input tokens (12.5 cached) and 750 per million output. GPT-5.4 is half that. GPT-5.4-mini is roughly an eighth. Cached input at 10% of standard matters more than people realize on long sessions.

The April change is the loudest pricing event in this category all year. The two-month promotional doubling rolled back alongside the credit shift, and Plus users on heavy workflows started reporting "seven prompts used my entire 5-hour limit" within days. OpenAI launched the $100 Pro tier on April 9 to catch the Plus users who outgrew the cap, with a "2x Codex through May 31" promo to soften the landing.

Claude Max users have not been spared. The $20 Pro tier hits its limit fast on Opus-driven sessions, and Max $100 is the floor for anyone running long plan-mode workflows daily. Both companies charge $100 for "you can actually work all day." That is the real price of a serious AI coding subscription in 2026.

Cost-per-task at the Pro/Max-100 tier is roughly comparable for most workflows. Codex is cheaper for the all-day light-touch loop. Claude is cheaper for the once-an-hour deep refactor.

Switching narratives, both directions

Q1 and Q2 2026 had two opposing migration flows. Reading both helps you place yourself.

Jonathan Fulton, a staff software engineer at Datadog, published the most-shared switching post of the year in early May. He used Claude Code as his primary tool for nearly a year, then moved to Codex citing four things: better skill-usage decisions without hand-holding, sharper multi-codebase request tracing across service boundaries, a 1,000-line RFC plus implementation plan generated in one shot, and computer-control for end-to-end UI debugging. He also reported a coworker watching GPT-5.5 one-shot a task that would have taken multiple iterations with Opus 4.7. His framing was cautious: better right now for his workflow, not permanently better.

The opposite flow is quieter but real. Codex Plus users hit the April rate-limit reset, looked at their workflow, and moved to Claude Max $100. Threads on r/ChatGPTCoding and the OpenAI community rate-limits thread are dense with these. The pattern: developers doing daily code review and small features find Claude's $20 Pro burns up too fast, but Max $100 with plan mode and the new auto classifier feels like a working professional setup.

The meta keeps flipping. Six months ago Claude was the "obvious choice." Today Codex has the wind. Three months from now it will probably be whoever ships the next model first. Lock-in is shallow, switching costs are real but not huge if your skills are portable, and "one tool forever" is not a strategy any senior dev I know actually pursues.

What to actually do, by reader

Solo dev shipping side projects. Pick Codex Pro $100 (or Plus $20 if your projects are small). The credit model rewards lots of small turns, the cloud tasks let you queue overnight work, and the desktop app's worktree feature pays for itself on parallel exploration. Skip Claude Code unless you specifically want plan mode.

Senior engineer at a small team doing daily code review. Run both. Pay $20 for Claude Pro, $20 for ChatGPT Plus, and route work by task type. Use Claude for the "review this PR carefully" loop and the architectural conversations. Use Codex for "rename this thing across six services," "write a migration script," and the ten small CI failures a day. If your time is worth more than $40/mo, this is not a real expense.

Heavy refactor or migration person. Claude Max $100, full stop. Plan mode plus Opus 4.7 plus the sandbox is the right combination for the work you do. Codex's sandbox is good, but you don't need raw throughput, you need a tool that thinks before it writes. Keep a Codex Plus subscription around for the executor role on the mechanical bits.

Engineering manager paying out of an L&D budget. Buy your team Claude Code Team seats and pick up Codex Business as the cheaper-per-seat shared executor pool. Audit logs on Claude, credit-based metering on Codex, less to explain to finance.

Vibe-coder shipping prototypes for clients. Codex Pro $100 with --full-auto and the desktop app. You'll move faster than with Claude, which keeps stopping to ask for confirmations you don't want to give.

Pick the workflow that costs you the most hours this week, install whichever tool above maps to it, and run it on three real tasks before you decide. Six months of forum threads will not tell you what one afternoon of head-to-head use will.