Vol. IX · No. 18 · 2026
The Long Take
Independent technology dispatch
Thursday, May 14, 2026 · Tools issue
Cover essay · Tools we use to think

Field Notes from Eleven Weeks Living Inside Five AI Agents

We made every member of the newsroom adopt one assistant for a quarter, banned switching, and tracked what got shipped, what got broken, and which Tuesdays were lost forever.

MO
By Marcus Olamide-Reed, Tools Correspondent / Filed / Last revised / Estimated read · 19 min
  1. If you ship code or wrestle long PDFs: Claude 4 Opus. Not by a hair, by a margin you can feel on Friday.
  2. If you want one assistant that does almost everything: ChatGPT, knowing it will lie about a number once a day.
  3. If you live in Google Docs, Sheets and Calendar: Gemini 2.5 Pro. The 2-million-token window is the real deal.
  4. If you write reported pieces with citations: Perplexity Pro. It is a researcher in a trench coat, not an agent.
  5. If you stalk a particular social network all day: Grok 3, and pretend the other 90% of the product isn't there.
§ 01

The cast we lived with

Three years ago the question was whether a chatbot could write a passable best-man speech. By the spring of 2026 it is whether a piece of software should be allowed to read your inbox, rewrite your migration script, and book the wrong return flight to Lisbon on your behalf. We gave nine staffers a single agent each for eleven weeks and a rule: no peeking at the others. At the end of every week they filed a 200-word note about what the tool earned, what it broke, and what they refused to ever let it touch again. This essay is what those notes added up to.

We tested the five products that, together, ate roughly 90% of the consumer agent market last quarter [1]. Enterprise-only suites (Microsoft 365 Copilot, AWS Q, Vertex Enterprise) were deliberately excluded — they are different animals, and they get their own dispatch in July. Every product was tested on the top paid consumer tier as of April 28, 2026.

Claude

Anthropic · Claude 4 Opus
A−

The careful one. Refactors entire repositories without breaking the type checker.

Newsroom favorite

ChatGPT

OpenAI · GPT-5o + Operator
B+

The generalist. Will do anything you ask, occasionally with a confident factual error.

Widest feature set

Gemini

Google · Gemini 2.5 Pro
B+

The librarian. Reads the whole filing cabinet in one breath, opinions optional.

2M context window

Perplexity

Perplexity Pro · routed
B

Not really an agent. A research desk that keeps its receipts.

Citation discipline

Grok

xAI · Grok 3
C+

The loud one. Wins on real-time social, loses on basically everything else we asked.

Use with verification

Inside our workbench

We assembled a fixed bench of 211 tasks across seven crafts: factual reporting, long-document triage (300+ page filings), greenfield engineering, refactor surgery, multi-step autonomy with browser access, longform editorial revision, and voice/multimodal scenarios. Each task ran three times per agent on staggered days, all outputs were scored blind against a rubric by two staff editors, and ties were broken by a third reader who had not seen the original prompt.

211
Bench tasks per agent
3,165
Outputs scored blind
77 d
Daily-use window
$0
From the vendors

Disclosure: The Long Take has no commercial relationship with the vendors involved. Every subscription was paid out of editorial budget. The full task catalog, rubric and per-task scoring matrix is mirrored at Appendix M; we publish raw scores so other reviewers can replicate.

§ 02

Scoreboard, by craft

Nobody wins outright. We treat margins under 0.3 points as a tie. Yellow boxes are category winners. Lower is better on the final row (rate of confident factual error across the reporting and document-triage tasks).

Craft Claude ChatGPT Gemini Perplexity Grok
Reported research 8.58.18.7 9.46.7
Long-document triage 9.58.49.27.66.3
Greenfield code 9.69.08.57.07.5
Refactor surgery 9.78.58.06.36.9
Multi-step autonomy 8.99.27.97.56.4
Longform revision 8.99.17.67.18.3
Voice & multimodal 7.99.49.07.87.0
Confident-error rate (lower wins) 3.2% 5.5% 4.5% 3.8% 9.9%
Claude's lead on refactor surgery is the single most consistent finding in the bench. Across forty-two tasks where we handed it a 2,000-line file and asked for a structural change, it produced a working pull request on the first attempt 88% of the time. ChatGPT did so 64%. Nobody else cleared half.
§ 03

Spec sheet, demystified

Sticker prices conceal as much as they show. Gemini's "free" tier handles more real work than ChatGPT's $20 plan during US peak hours, where Operator throttles every fourth run. Grok's $8 looks generous until you read the data-retention defaults out loud. Numbers below are observed behavior on our test days, not the marketing page.

Claude ChatGPT Gemini Perplexity Grok
Paid plan / mo$20$20$20$20$8
Free tier usable LimitedYesYes LimitedYes
Context200K128K2Mrouted128K
Tool / browser use YesYesYes YesPartial
Code execution YesYesYes NoPartial
Voice mode BetaExcellentStrong NoPartial
Image generation NoBuilt-inBuilt-in NoBuilt-in
Developer API AnthropicOpenAIVertex BetaxAI
Compliance posture SOC 2 II · GDPRSOC 2 II · GDPR SOC 2 II · GDPRSOC 2 II · GDPR SOC 2 I
Retention default 30 days30 days18 months 30 daysIndefinite
Per-agent notes
§ 04

On Claude, the cautious craftsman

Claude is the agent we kept reaching for when something had to actually work by 5pm. Across the fifty-six engineering tasks it produced fewer regressions than any competitor; on the seventeen long-document briefs, it stayed coherent at page 280 while two of the others had begun to invent quotes by page 90. The reasoning trace, written in a calm internal voice that reads almost like a senior reviewer thinking out loud, is also the single feature that most changed how our engineering desk uses an assistant.

Refactor win rate
88%

It was the only assistant we trusted with an unattended hour. We came back to a clean pull request, three rejected lint warnings, and a polite question waiting in the chat.

— newsroom test log, week 6

The catch: no native image generation, voice that is still listed as beta on the company's own status page, and web access that occasionally rate-limits during evening hours on the US west coast. The safety tuning is also occasionally over-cautious — Claude refused 3.9% of our intentionally benign edge cases, roughly double ChatGPT's rate. None of these defects cost us a deliverable; together they explain why this is an A−, not an A.

What it earns

  • Refactor surgery without breaking the test suite
  • Lowest confident-error rate on factual tasks
  • Preserves author voice in editorial rewrites
  • Reasoning trace you can actually read and trust

What it costs you

  • No native image or video generation
  • Voice mode visibly behind ChatGPT and Gemini
  • Occasional over-cautious refusals on benign prompts
  • $20 message caps hit faster than the OpenAI plan
§ 05

On ChatGPT, the eager generalist

OpenAI's surface area is bewildering: a voice that interrupts politely, an Operator mode that spins up its own browser, native image and short-form video, and the deepest connector list on the market (Slack, Drive, Notion, Linear, Figma, Stripe). If we had to keep one agent on the dock for the whole desk to share, it would still be this one. ChatGPT does the most things, and most of them work.

The price is reliability. GPT-5o's confident-error rate on numeric tasks (currency math, date arithmetic, cited-figure recall) sat at 5.5% — too high to use unsupervised on finance or compliance work. Operator amplified those errors: a long run chains its own mistakes, and we lost a half-day reconciling an expense report it had cheerfully invented categories for.

Operator long-run success
71%

What it earns

  • The most ambitious agentic feature shipping today
  • Voice mode that holds a conversation, not a transcript
  • Native image and short video without the round trip
  • Largest connector and plugin surface

What it costs you

  • Highest confident-error rate of the three top scorers
  • Operator compounds errors on long unattended runs
  • Throttling on Pro during US weekday afternoons
  • UI overhaul still feels unfinished six months in
§ 06

On Gemini, the quiet librarian

The two-million-token context window is not marketing varnish. We loaded the full SEC 10-K of a mid-cap retailer (about 340 pages), three years of earnings transcripts and the entire proxy statement, and asked twenty cross-document questions. Gemini answered 92% of them correctly. Claude, forced through chunking into its 200K window, managed 81% on the same questions. That gap matters if your job is reading filings.

Workspace integration is the other under-told story. Asking Gemini to draft a reply that cites your last three Drive folders and last week's calendar invites actually works — no copy-paste choreography. The trade is personality. Gemini writes like a competent assistant who has decided not to bother you with opinions, and the resulting copy is usable but rarely memorable.

The reporter assigned to Gemini described the experience as "having a really gifted intern who never makes eye contact." We treat that as praise.

What it earns

  • 2M-token context that handles real filings end-to-end
  • Best free tier of the five
  • Native Workspace integration that actually works
  • Lowest median latency in our trials

What it costs you

  • Prose with the personality of a settings panel
  • Refactor results trail Claude and ChatGPT
  • 18-month default retention raises eyebrows in security review
§ 07

On Grok, the loud minor

Grok shines in exactly one corner: real-time discussion on its parent network. It summarizes a breaking thread with attribution faster than any other agent in the test — it is, in effect, watching the firehose. Outside that one trick, the model lags. Refactor quality is mediocre, the confident-error rate is the worst in the group at 9.9%, and the irreverent voice that was a novelty in 2024 now reads as careless.

At $8 a month it is the cheapest paid option here, and for a journalist tracking real-time events or a marketer monitoring a launch, that price is defensible. For everyone else, the cheapness reflects the gap.

What it earns

  • Real-time access to its native social firehose
  • Cheapest paid plan in the comparison
  • Surprisingly capable on casual prose

What it costs you

  • Worst confident-error rate measured
  • Weak on engineering and structured reasoning
  • Indefinite default data retention
  • Tool use feels grafted on, not designed in
§ 08

On Perplexity, the receipts desk

Calling Perplexity an "AI agent" stretches the word. It is a research front-end that routes every question to whichever frontier model handles that question best, and footnotes its answer with the documents it consulted. But it is so good at the one thing — citing its sources — that we kept it in the bench. On reported factual queries with cited evidence, Perplexity scored 9.4, higher than any of the general-purpose models.

Past research, it disappoints. No code execution, no autonomous browsing of your laptop, no voice. If your job involves browser tabs, footnotes, and writing things up, it pays for itself. If you need an assistant to also fix a YAML file, look elsewhere.

What it earns

  • Best-in-class citation hygiene
  • Routes to multiple frontier models per query
  • Clean, distraction-free reading interface

What it costs you

  • Not really an agent — no autonomous task execution
  • No code execution, no voice, no local files
  • Developer API still labeled beta in May 2026
The prescription
§ 09

What you should actually buy

We refused to file one of those "it depends" conclusions because the readers asking this question are buying a subscription this week, not next quarter. So:

  • Ship code or read filings: Claude 4 Opus. Meaningfully ahead, not marginally.
  • One assistant to do everything (write, talk, image, browse): ChatGPT. Pay the hallucination tax with one eye open.
  • Live in Workspace or wrestle very long documents: Gemini 2.5 Pro.
  • Write reported pieces with citations: Perplexity Pro.
  • Cover one specific social network for a living: Grok 3 — and only then.

Two of these vendors will ship new flagships before the calendar reaches August (Anthropic and OpenAI both run public cadences). Our verdicts will move; this article will move with them. Every revision is logged in the changelog below, and the underlying scoring spreadsheet is mirrored at Appendix M.

Changelog

  • — Added Operator long-run figure (71%) after additional unattended-run data landed.
  • — Gemini Workspace integration upgraded from "promising" to "actually works" after sustained testing.
  • — Initial publication.

Sources & appendices

  1. The Long Take reader-panel survey, March 2026 (n = 5,103); methodology published alongside the dataset.
  2. "Frontier Capability Snapshot Q1 2026," Center for Empirical AI Evaluation, March 2026.
  3. Anthropic model card, Claude 4 family release, April 2026.
  4. OpenAI system card, GPT-5o with Operator, April 2026.
  5. Google DeepMind, "Gemini 2.5 Pro technical report," March 2026.
  6. Perplexity Engineering, "Notes on routing," internal blog, February 2026.
  7. xAI release notes, Grok 3, January 2026.
  8. Appendix M — full task catalog, rubric, and per-task scoring matrix (mirrored on our research page).