Field Notes from 11 Weeks Living Inside Five AI Agents (2026 Edition)

If you ship code or wrestle long PDFs: Claude 4 Opus. Not by a hair, by a margin you can feel on Friday.
If you want one assistant that does almost everything: ChatGPT, knowing it will lie about a number once a day.
If you live in Google Docs, Sheets and Calendar: Gemini 2.5 Pro. The 2-million-token window is the real deal.
If you write reported pieces with citations: Perplexity Pro. It is a researcher in a trench coat, not an agent.
If you stalk a particular social network all day: Grok 3, and pretend the other 90% of the product isn't there.

§ 01

The cast we lived with

Three years ago the question was whether a chatbot could write a passable best-man speech. By the spring of 2026 it is whether a piece of software should be allowed to read your inbox, rewrite your migration script, and book the wrong return flight to Lisbon on your behalf. We gave nine staffers a single agent each for eleven weeks and a rule: no peeking at the others. At the end of every week they filed a 200-word note about what the tool earned, what it broke, and what they refused to ever let it touch again. This essay is what those notes added up to.

We tested the five products that, together, ate roughly 90% of the consumer agent market last quarter ^[1]. Enterprise-only suites (Microsoft 365 Copilot, AWS Q, Vertex Enterprise) were deliberately excluded — they are different animals, and they get their own dispatch in July. Every product was tested on the top paid consumer tier as of April 28, 2026.

Claude

Anthropic · Claude 4 Opus

A−

The careful one. Refactors entire repositories without breaking the type checker.

Newsroom favorite

ChatGPT

OpenAI · GPT-5o + Operator

B+

The generalist. Will do anything you ask, occasionally with a confident factual error.

Widest feature set

Gemini

Google · Gemini 2.5 Pro

B+

The librarian. Reads the whole filing cabinet in one breath, opinions optional.

2M context window

Perplexity

Perplexity Pro · routed

Not really an agent. A research desk that keeps its receipts.

Citation discipline

Grok

xAI · Grok 3

C+

The loud one. Wins on real-time social, loses on basically everything else we asked.

Use with verification

Inside our workbench

We assembled a fixed bench of 211 tasks across seven crafts: factual reporting, long-document triage (300+ page filings), greenfield engineering, refactor surgery, multi-step autonomy with browser access, longform editorial revision, and voice/multimodal scenarios. Each task ran three times per agent on staggered days, all outputs were scored blind against a rubric by two staff editors, and ties were broken by a third reader who had not seen the original prompt.

211

Bench tasks per agent

3,165

Outputs scored blind

77 d

Daily-use window

From the vendors

Disclosure: The Long Take has no commercial relationship with the vendors involved. Every subscription was paid out of editorial budget. The full task catalog, rubric and per-task scoring matrix is mirrored at Appendix M; we publish raw scores so other reviewers can replicate.

§ 02

Scoreboard, by craft

Nobody wins outright. We treat margins under 0.3 points as a tie. Yellow boxes are category winners. Lower is better on the final row (rate of confident factual error across the reporting and document-triage tasks).

Craft	Claude	ChatGPT	Gemini	Perplexity	Grok
Reported research	8.5	8.1	8.7	9.4	6.7
Long-document triage	9.5	8.4	9.2	7.6	6.3
Greenfield code	9.6	9.0	8.5	7.0	7.5
Refactor surgery	9.7	8.5	8.0	6.3	6.9
Multi-step autonomy	8.9	9.2	7.9	7.5	6.4
Longform revision	8.9	9.1	7.6	7.1	8.3
Voice & multimodal	7.9	9.4	9.0	7.8	7.0
Confident-error rate (lower wins)	3.2%	5.5%	4.5%	3.8%	9.9%

Claude's lead on refactor surgery is the single most consistent finding in the bench. Across forty-two tasks where we handed it a 2,000-line file and asked for a structural change, it produced a working pull request on the first attempt 88% of the time. ChatGPT did so 64%. Nobody else cleared half.

§ 03

Spec sheet, demystified

Sticker prices conceal as much as they show. Gemini's "free" tier handles more real work than ChatGPT's $20 plan during US peak hours, where Operator throttles every fourth run. Grok's $8 looks generous until you read the data-retention defaults out loud. Numbers below are observed behavior on our test days, not the marketing page.

	Claude	ChatGPT	Gemini	Perplexity	Grok
Paid plan / mo	$20	$20	$20	$20	$8
Free tier usable	Limited	Yes	Yes	Limited	Yes
Context	200K	128K	2M	routed	128K
Tool / browser use	Yes	Yes	Yes	Yes	Partial
Code execution	Yes	Yes	Yes	No	Partial
Voice mode	Beta	Excellent	Strong	No	Partial
Image generation	No	Built-in	Built-in	No	Built-in
Developer API	Anthropic	OpenAI	Vertex	Beta	xAI
Compliance posture	SOC 2 II · GDPR	SOC 2 II · GDPR	SOC 2 II · GDPR	SOC 2 II · GDPR	SOC 2 I
Retention default	30 days	30 days	18 months	30 days	Indefinite

Per-agent notes

§ 04

On Claude, the cautious craftsman

Claude is the agent we kept reaching for when something had to actually work by 5pm. Across the fifty-six engineering tasks it produced fewer regressions than any competitor; on the seventeen long-document briefs, it stayed coherent at page 280 while two of the others had begun to invent quotes by page 90. The reasoning trace, written in a calm internal voice that reads almost like a senior reviewer thinking out loud, is also the single feature that most changed how our engineering desk uses an assistant.

Refactor win rate

88%

It was the only assistant we trusted with an unattended hour. We came back to a clean pull request, three rejected lint warnings, and a polite question waiting in the chat.

— newsroom test log, week 6

The catch: no native image generation, voice that is still listed as beta on the company's own status page, and web access that occasionally rate-limits during evening hours on the US west coast. The safety tuning is also occasionally over-cautious — Claude refused 3.9% of our intentionally benign edge cases, roughly double ChatGPT's rate. None of these defects cost us a deliverable; together they explain why this is an A−, not an A.

What it earns

Refactor surgery without breaking the test suite
Lowest confident-error rate on factual tasks
Preserves author voice in editorial rewrites
Reasoning trace you can actually read and trust

What it costs you

No native image or video generation
Voice mode visibly behind ChatGPT and Gemini
Occasional over-cautious refusals on benign prompts
$20 message caps hit faster than the OpenAI plan

§ 05

On ChatGPT, the eager generalist

OpenAI's surface area is bewildering: a voice that interrupts politely, an Operator mode that spins up its own browser, native image and short-form video, and the deepest connector list on the market (Slack, Drive, Notion, Linear, Figma, Stripe). If we had to keep one agent on the dock for the whole desk to share, it would still be this one. ChatGPT does the most things, and most of them work.

The price is reliability. GPT-5o's confident-error rate on numeric tasks (currency math, date arithmetic, cited-figure recall) sat at 5.5% — too high to use unsupervised on finance or compliance work. Operator amplified those errors: a long run chains its own mistakes, and we lost a half-day reconciling an expense report it had cheerfully invented categories for.

Operator long-run success

71%

What it earns

The most ambitious agentic feature shipping today
Voice mode that holds a conversation, not a transcript
Native image and short video without the round trip
Largest connector and plugin surface

What it costs you

Highest confident-error rate of the three top scorers
Operator compounds errors on long unattended runs
Throttling on Pro during US weekday afternoons
UI overhaul still feels unfinished six months in

§ 06

On Gemini, the quiet librarian

The two-million-token context window is not marketing varnish. We loaded the full SEC 10-K of a mid-cap retailer (about 340 pages), three years of earnings transcripts and the entire proxy statement, and asked twenty cross-document questions. Gemini answered 92% of them correctly. Claude, forced through chunking into its 200K window, managed 81% on the same questions. That gap matters if your job is reading filings.

Workspace integration is the other under-told story. Asking Gemini to draft a reply that cites your last three Drive folders and last week's calendar invites actually works — no copy-paste choreography. The trade is personality. Gemini writes like a competent assistant who has decided not to bother you with opinions, and the resulting copy is usable but rarely memorable.

The reporter assigned to Gemini described the experience as "having a really gifted intern who never makes eye contact." We treat that as praise.

What it earns

2M-token context that handles real filings end-to-end
Best free tier of the five
Native Workspace integration that actually works
Lowest median latency in our trials

What it costs you

Prose with the personality of a settings panel
Refactor results trail Claude and ChatGPT
18-month default retention raises eyebrows in security review

§ 07

On Grok, the loud minor

Grok shines in exactly one corner: real-time discussion on its parent network. It summarizes a breaking thread with attribution faster than any other agent in the test — it is, in effect, watching the firehose. Outside that one trick, the model lags. Refactor quality is mediocre, the confident-error rate is the worst in the group at 9.9%, and the irreverent voice that was a novelty in 2024 now reads as careless.

At $8 a month it is the cheapest paid option here, and for a journalist tracking real-time events or a marketer monitoring a launch, that price is defensible. For everyone else, the cheapness reflects the gap.

What it earns

Real-time access to its native social firehose
Cheapest paid plan in the comparison
Surprisingly capable on casual prose

What it costs you

Worst confident-error rate measured
Weak on engineering and structured reasoning
Indefinite default data retention
Tool use feels grafted on, not designed in

§ 08

On Perplexity, the receipts desk

Calling Perplexity an "AI agent" stretches the word. It is a research front-end that routes every question to whichever frontier model handles that question best, and footnotes its answer with the documents it consulted. But it is so good at the one thing — citing its sources — that we kept it in the bench. On reported factual queries with cited evidence, Perplexity scored 9.4, higher than any of the general-purpose models.

Past research, it disappoints. No code execution, no autonomous browsing of your laptop, no voice. If your job involves browser tabs, footnotes, and writing things up, it pays for itself. If you need an assistant to also fix a YAML file, look elsewhere.

What it earns

Best-in-class citation hygiene
Routes to multiple frontier models per query
Clean, distraction-free reading interface

What it costs you

Not really an agent — no autonomous task execution
No code execution, no voice, no local files
Developer API still labeled beta in May 2026

The prescription

§ 09

What you should actually buy

We refused to file one of those "it depends" conclusions because the readers asking this question are buying a subscription this week, not next quarter. So:

Ship code or read filings: Claude 4 Opus. Meaningfully ahead, not marginally.
One assistant to do everything (write, talk, image, browse): ChatGPT. Pay the hallucination tax with one eye open.
Live in Workspace or wrestle very long documents: Gemini 2.5 Pro.
Write reported pieces with citations: Perplexity Pro.
Cover one specific social network for a living: Grok 3 — and only then.

Two of these vendors will ship new flagships before the calendar reaches August (Anthropic and OpenAI both run public cadences). Our verdicts will move; this article will move with them. Every revision is logged in the changelog below, and the underlying scoring spreadsheet is mirrored at Appendix M.

Changelog

2026-05-14 — Added Operator long-run figure (71%) after additional unattended-run data landed.
2026-05-09 — Gemini Workspace integration upgraded from "promising" to "actually works" after sustained testing.
2026-05-02 — Initial publication.

Sources & appendices

The Long Take reader-panel survey, March 2026 (n = 5,103); methodology published alongside the dataset.
"Frontier Capability Snapshot Q1 2026," Center for Empirical AI Evaluation, March 2026.
Anthropic model card, Claude 4 family release, April 2026.
OpenAI system card, GPT-5o with Operator, April 2026.
Google DeepMind, "Gemini 2.5 Pro technical report," March 2026.
Perplexity Engineering, "Notes on routing," internal blog, February 2026.
xAI release notes, Grok 3, January 2026.
Appendix M — full task catalog, rubric, and per-task scoring matrix (mirrored on our research page).

Field Notes from Eleven Weeks Living Inside Five AI Agents

The cast we lived with

Claude

ChatGPT

Gemini

Perplexity

Grok

Inside our workbench

Scoreboard, by craft

Spec sheet, demystified

On Claude, the cautious craftsman

What it earns

What it costs you

On ChatGPT, the eager generalist

What it earns

What it costs you

On Gemini, the quiet librarian

What it earns

What it costs you

On Grok, the loud minor

What it earns

What it costs you

On Perplexity, the receipts desk

What it earns

What it costs you

What you should actually buy

Changelog

Sources & appendices