April 4, 2026·6 min read·Company Agents team· Engineering

A field guide to agent harnesses

Claude Code, Codex, Gemini CLI, Cursor, OpenClaw, OpenCode, Hermes, Pi. What each one is, who it is for, and how we think about picking between them.

An agent harness is the runtime that wraps a model and turns it into a tool-using loop. You feed it a goal. It plans, calls tools, observes results, and keeps going until it is done or it escalates to you. The model inside is what does the reasoning. The harness around it is what lets the reasoning actually touch the world.

In 2026 there are eight serious agent harnesses available as a command-line tool. We wrote adapters for all of them, because the right harness for a task depends on the task. Here is the field guide we wish someone had handed us when we started.

Claude Code · Anthropic

What it is. The agent runtime from Anthropic. Runs Claude Sonnet 4.5 or Opus 4.6 under the hood. Ships as a native CLI and a VS Code extension.

Who it is for. Engineers who want the strongest reasoning model with a tight tool integration. Claude Code has the best instruction following of any harness in the survey, and its file editing behavior is unusually clean: it plans a change, writes it, re-reads the file, and verifies its own work without being asked.

Strengths. Instruction following. Long-horizon planning. Clean rollback when it notices it made a mistake. Handles repo-sized context without getting confused.

Weaknesses. Token pricing is the highest in the survey. The loop can get expensive fast if you do not cap it.

When to pick it. Default choice for engineering work, code reviews, complex refactors, and anything that needs the model to hold a large mental model of a codebase at once.

Codex · OpenAI

What it is. OpenAI's agent runtime. Runs GPT-5 under the hood.

Who it is for. Teams already deep in the OpenAI ecosystem, or anyone who wants the fastest iteration time on short tasks. Codex has excellent latency on small steps.

Strengths. Speed. Low latency per step, which adds up when a loop runs hundreds of steps. Good at structured output (JSON, YAML) without ceremony.

Weaknesses. Shorter effective context window than Claude. Loses track of the plan on very long runs.

When to pick it. High-frequency loops where each step is small and the total count is what matters. Structured data generation. Fast feedback loops during development.

Gemini CLI · Google

What it is. Google's Gemini agent runtime. Has a 2 million token context window, the largest of any commercially available harness.

Who it is for. Anyone whose work involves a lot of reference material. Crawl output. Long PDFs. Entire codebases. Hour-long meeting transcripts.

Strengths. That context window. We have given it 1.2 million tokens of real input in a single run and it held the thread. Multimodal handling is best in class: it can take in a URL, extract the content, read the images, and work from both.

Weaknesses. Sometimes over-apologizes and softens output. Needs system prompts to hold a stronger stance.

When to pick it. Research and enrichment tasks. Anything that needs to absorb a mountain of source material before acting. Image analysis. Multi-language work.

Cursor · Cursor

What it is. The agent runtime from the Cursor editor, available as a standalone CLI outside the editor.

Who it is for. People who already live in Cursor and want its agent behavior everywhere, not just in the editor. Strong integration with codebase indexing.

Strengths. Fastest codebase search in the survey. Indexes your repo once and then searches it in milliseconds. Makes a big difference on large codebases where grep is slow.

Weaknesses. Index has to be built and kept fresh. Loses its edge on very small codebases.

When to pick it. Large monorepos. Work that requires fast lookup across a hundred thousand files.

OpenClaw · OpenClaw

What it is. Open-source agent harness focused on long-running research tasks. Designed for jobs that take hours or days to complete.

Who it is for. Research teams, deep investigation work, anyone whose agent needs to run for a long time without a human touching it.

Strengths. Built-in checkpointing. If the process dies, it resumes from the last checkpoint without starting over. Stable memory usage on long runs. Generous open-source license.

Weaknesses. Slower per step than the commercial harnesses. Newer than most of the alternatives, which means fewer battle-tested integrations.

When to pick it. Multi-hour or multi-day runs. Data gathering jobs. Any task where resumability matters more than raw speed.

OpenCode · SST

What it is. SST's open-source coding CLI. Bring your own model, bring your own tools, zero lock-in.

Who it is for. Developers who want the harness pattern but do not want to commit to a specific provider. Runs on top of any model that supports function calling.

Strengths. Complete model independence. Plugin architecture that makes it trivial to add new tools. Lean runtime, fast startup.

Weaknesses. You bring your own everything, which means more setup. Fewer batteries included than Claude Code or Codex.

When to pick it. When you want to run multiple models behind the same harness. When you want full control over the tool set. When you do not want to depend on any single provider.

Hermes · Nous Research

What it is. Function-calling focused agent runtime from Nous Research. Tuned for structured output and reliable tool calling.

Who it is for. Teams building automations where the agent has to produce a correct tool call every time, not most of the time. Hermes has unusually low failure rates on structured output compared to general-purpose harnesses.

Strengths. Discipline with tool calls. Strong at producing valid JSON on the first try. Makes very few function-signature mistakes.

Weaknesses. Less natural on open-ended reasoning tasks. Excels in narrow domains, feels mechanical in wide ones.

When to pick it. Automation pipelines where correctness of the tool call matters more than the quality of the writing between the calls. Integration bots. Workflow executors.

Pi · Pi.dev

What it is. Privacy-first local agent. Ships with a small, fast model baked in and runs entirely on your machine.

Who it is for. Anyone whose work cannot leave the laptop. Regulated industries. Security-sensitive code. Workflows where calling a cloud model is a no-go.

Strengths. Fully local. Zero network traffic after install. Fast for its size. Predictable cost (zero per token).

Weaknesses. Smaller model means weaker reasoning on complex tasks. You trade capability for privacy.

When to pick it. Local code review. Offline work. Data you cannot send to a third party under any circumstances.

How we actually pick between them

At Company Agents, we do not pick one. Every agent can be assigned a different harness. Your CEO agent might run on Claude Code because it coordinates everything else. Your image-sourcing agent runs on Gemini CLI for the multimodal context window. Your fast QA sweep runs on Codex because latency per step matters. Your long research job runs on OpenClaw because it might need four hours.

Picking a single harness for the whole company is like picking a single programming language for every service. You can, but you will regret it the first time the choice does not fit the problem.

The Company Agents platform exists to make mixing them boring. You set the adapter on each agent and the rest of the system stops caring which model is running where.