- If you ship code or wrestle long PDFs: Claude 4 Opus. Not by a hair, by a margin you can feel on Friday.
- If you want one assistant that does almost everything: ChatGPT, knowing it will lie about a number once a day.
- If you live in Google Docs, Sheets and Calendar: Gemini 2.5 Pro. The 2-million-token window is the real deal.
- If you write reported pieces with citations: Perplexity Pro. It is a researcher in a trench coat, not an agent.
- If you stalk a particular social network all day: Grok 3, and pretend the other 90% of the product isn't there.
The cast we lived with
Three years ago the question was whether a chatbot could write a passable best-man speech. By the spring of 2026 it is whether a piece of software should be allowed to read your inbox, rewrite your migration script, and book the wrong return flight to Lisbon on your behalf. We gave nine staffers a single agent each for eleven weeks and a rule: no peeking at the others. At the end of every week they filed a 200-word note about what the tool earned, what it broke, and what they refused to ever let it touch again. This essay is what those notes added up to.
We tested the five products that, together, ate roughly 90% of the consumer agent market last quarter [1]. Enterprise-only suites (Microsoft 365 Copilot, AWS Q, Vertex Enterprise) were deliberately excluded — they are different animals, and they get their own dispatch in July. Every product was tested on the top paid consumer tier as of April 28, 2026.
Claude
The careful one. Refactors entire repositories without breaking the type checker.
Newsroom favoriteChatGPT
The generalist. Will do anything you ask, occasionally with a confident factual error.
Widest feature setGemini
The librarian. Reads the whole filing cabinet in one breath, opinions optional.
2M context windowPerplexity
Not really an agent. A research desk that keeps its receipts.
Citation disciplineGrok
The loud one. Wins on real-time social, loses on basically everything else we asked.
Use with verificationInside our workbench
We assembled a fixed bench of 211 tasks across seven crafts: factual reporting, long-document triage (300+ page filings), greenfield engineering, refactor surgery, multi-step autonomy with browser access, longform editorial revision, and voice/multimodal scenarios. Each task ran three times per agent on staggered days, all outputs were scored blind against a rubric by two staff editors, and ties were broken by a third reader who had not seen the original prompt.
Disclosure: The Long Take has no commercial relationship with the vendors involved. Every subscription was paid out of editorial budget. The full task catalog, rubric and per-task scoring matrix is mirrored at Appendix M; we publish raw scores so other reviewers can replicate.
Scoreboard, by craft
Nobody wins outright. We treat margins under 0.3 points as a tie. Yellow boxes are category winners. Lower is better on the final row (rate of confident factual error across the reporting and document-triage tasks).
| Craft | Claude | ChatGPT | Gemini | Perplexity | Grok |
|---|---|---|---|---|---|
| Reported research | 8.5 | 8.1 | 8.7 | 9.4 | 6.7 |
| Long-document triage | 9.5 | 8.4 | 9.2 | 7.6 | 6.3 |
| Greenfield code | 9.6 | 9.0 | 8.5 | 7.0 | 7.5 |
| Refactor surgery | 9.7 | 8.5 | 8.0 | 6.3 | 6.9 |
| Multi-step autonomy | 8.9 | 9.2 | 7.9 | 7.5 | 6.4 |
| Longform revision | 8.9 | 9.1 | 7.6 | 7.1 | 8.3 |
| Voice & multimodal | 7.9 | 9.4 | 9.0 | 7.8 | 7.0 |
| Confident-error rate (lower wins) | 3.2% | 5.5% | 4.5% | 3.8% | 9.9% |
Spec sheet, demystified
Sticker prices conceal as much as they show. Gemini's "free" tier handles more real work than ChatGPT's $20 plan during US peak hours, where Operator throttles every fourth run. Grok's $8 looks generous until you read the data-retention defaults out loud. Numbers below are observed behavior on our test days, not the marketing page.
| Claude | ChatGPT | Gemini | Perplexity | Grok | |
|---|---|---|---|---|---|
| Paid plan / mo | $20 | $20 | $20 | $20 | $8 |
| Free tier usable | Limited | Yes | Yes | Limited | Yes |
| Context | 200K | 128K | 2M | routed | 128K |
| Tool / browser use | Yes | Yes | Yes | Yes | Partial |
| Code execution | Yes | Yes | Yes | No | Partial |
| Voice mode | Beta | Excellent | Strong | No | Partial |
| Image generation | No | Built-in | Built-in | No | Built-in |
| Developer API | Anthropic | OpenAI | Vertex | Beta | xAI |
| Compliance posture | SOC 2 II · GDPR | SOC 2 II · GDPR | SOC 2 II · GDPR | SOC 2 II · GDPR | SOC 2 I |
| Retention default | 30 days | 30 days | 18 months | 30 days | Indefinite |
On Claude, the cautious craftsman
Claude is the agent we kept reaching for when something had to actually work by 5pm. Across the fifty-six engineering tasks it produced fewer regressions than any competitor; on the seventeen long-document briefs, it stayed coherent at page 280 while two of the others had begun to invent quotes by page 90. The reasoning trace, written in a calm internal voice that reads almost like a senior reviewer thinking out loud, is also the single feature that most changed how our engineering desk uses an assistant.
It was the only assistant we trusted with an unattended hour. We came back to a clean pull request, three rejected lint warnings, and a polite question waiting in the chat.
The catch: no native image generation, voice that is still listed as beta on the company's own status page, and web access that occasionally rate-limits during evening hours on the US west coast. The safety tuning is also occasionally over-cautious — Claude refused 3.9% of our intentionally benign edge cases, roughly double ChatGPT's rate. None of these defects cost us a deliverable; together they explain why this is an A−, not an A.
What it earns
- Refactor surgery without breaking the test suite
- Lowest confident-error rate on factual tasks
- Preserves author voice in editorial rewrites
- Reasoning trace you can actually read and trust
What it costs you
- No native image or video generation
- Voice mode visibly behind ChatGPT and Gemini
- Occasional over-cautious refusals on benign prompts
- $20 message caps hit faster than the OpenAI plan
On ChatGPT, the eager generalist
OpenAI's surface area is bewildering: a voice that interrupts politely, an Operator mode that spins up its own browser, native image and short-form video, and the deepest connector list on the market (Slack, Drive, Notion, Linear, Figma, Stripe). If we had to keep one agent on the dock for the whole desk to share, it would still be this one. ChatGPT does the most things, and most of them work.
The price is reliability. GPT-5o's confident-error rate on numeric tasks (currency math, date arithmetic, cited-figure recall) sat at 5.5% — too high to use unsupervised on finance or compliance work. Operator amplified those errors: a long run chains its own mistakes, and we lost a half-day reconciling an expense report it had cheerfully invented categories for.
What it earns
- The most ambitious agentic feature shipping today
- Voice mode that holds a conversation, not a transcript
- Native image and short video without the round trip
- Largest connector and plugin surface
What it costs you
- Highest confident-error rate of the three top scorers
- Operator compounds errors on long unattended runs
- Throttling on Pro during US weekday afternoons
- UI overhaul still feels unfinished six months in
On Gemini, the quiet librarian
The two-million-token context window is not marketing varnish. We loaded the full SEC 10-K of a mid-cap retailer (about 340 pages), three years of earnings transcripts and the entire proxy statement, and asked twenty cross-document questions. Gemini answered 92% of them correctly. Claude, forced through chunking into its 200K window, managed 81% on the same questions. That gap matters if your job is reading filings.
Workspace integration is the other under-told story. Asking Gemini to draft a reply that cites your last three Drive folders and last week's calendar invites actually works — no copy-paste choreography. The trade is personality. Gemini writes like a competent assistant who has decided not to bother you with opinions, and the resulting copy is usable but rarely memorable.
What it earns
- 2M-token context that handles real filings end-to-end
- Best free tier of the five
- Native Workspace integration that actually works
- Lowest median latency in our trials
What it costs you
- Prose with the personality of a settings panel
- Refactor results trail Claude and ChatGPT
- 18-month default retention raises eyebrows in security review
On Grok, the loud minor
Grok shines in exactly one corner: real-time discussion on its parent network. It summarizes a breaking thread with attribution faster than any other agent in the test — it is, in effect, watching the firehose. Outside that one trick, the model lags. Refactor quality is mediocre, the confident-error rate is the worst in the group at 9.9%, and the irreverent voice that was a novelty in 2024 now reads as careless.
At $8 a month it is the cheapest paid option here, and for a journalist tracking real-time events or a marketer monitoring a launch, that price is defensible. For everyone else, the cheapness reflects the gap.
What it earns
- Real-time access to its native social firehose
- Cheapest paid plan in the comparison
- Surprisingly capable on casual prose
What it costs you
- Worst confident-error rate measured
- Weak on engineering and structured reasoning
- Indefinite default data retention
- Tool use feels grafted on, not designed in
On Perplexity, the receipts desk
Calling Perplexity an "AI agent" stretches the word. It is a research front-end that routes every question to whichever frontier model handles that question best, and footnotes its answer with the documents it consulted. But it is so good at the one thing — citing its sources — that we kept it in the bench. On reported factual queries with cited evidence, Perplexity scored 9.4, higher than any of the general-purpose models.
Past research, it disappoints. No code execution, no autonomous browsing of your laptop, no voice. If your job involves browser tabs, footnotes, and writing things up, it pays for itself. If you need an assistant to also fix a YAML file, look elsewhere.
What it earns
- Best-in-class citation hygiene
- Routes to multiple frontier models per query
- Clean, distraction-free reading interface
What it costs you
- Not really an agent — no autonomous task execution
- No code execution, no voice, no local files
- Developer API still labeled beta in May 2026
What you should actually buy
We refused to file one of those "it depends" conclusions because the readers asking this question are buying a subscription this week, not next quarter. So:
- Ship code or read filings: Claude 4 Opus. Meaningfully ahead, not marginally.
- One assistant to do everything (write, talk, image, browse): ChatGPT. Pay the hallucination tax with one eye open.
- Live in Workspace or wrestle very long documents: Gemini 2.5 Pro.
- Write reported pieces with citations: Perplexity Pro.
- Cover one specific social network for a living: Grok 3 — and only then.
Two of these vendors will ship new flagships before the calendar reaches August (Anthropic and OpenAI both run public cadences). Our verdicts will move; this article will move with them. Every revision is logged in the changelog below, and the underlying scoring spreadsheet is mirrored at Appendix M.
Changelog
- — Added Operator long-run figure (71%) after additional unattended-run data landed.
- — Gemini Workspace integration upgraded from "promising" to "actually works" after sustained testing.
- — Initial publication.
Sources & appendices
- The Long Take reader-panel survey, March 2026 (n = 5,103); methodology published alongside the dataset.
- "Frontier Capability Snapshot Q1 2026," Center for Empirical AI Evaluation, March 2026.
- Anthropic model card, Claude 4 family release, April 2026.
- OpenAI system card, GPT-5o with Operator, April 2026.
- Google DeepMind, "Gemini 2.5 Pro technical report," March 2026.
- Perplexity Engineering, "Notes on routing," internal blog, February 2026.
- xAI release notes, Grok 3, January 2026.
- Appendix M — full task catalog, rubric, and per-task scoring matrix (mirrored on our research page).