The Benchmark
Contract intelligence platform Ivo published the results of a third-party benchmark conducted in April 2026 comparing its contract review AI, Anthropic's Claude for Word (Opus 4.6), and a practicing Special Counsel at an AmLaw 25 firm on 19 real, anonymized contracts. Outputs were scored in a blind review by three judges — all technology transaction attorneys.
The scores, out of 10: Human attorney: 4.56. Ivo: 4.52. Claude for Word: 3.50. The headline finding: Ivo came within 0.04 points of the human attorney. Claude for Word trailed both by more than a full point.
What the Numbers Actually Mean
The absolute scores deserve scrutiny. A 4.56 for a human attorney on a 10-point scale suggests either a very demanding rubric or a very difficult contract set — or both. The study designers noted that the contracts were selected for complexity, not as a representative sample of typical commercial agreements.
Ivo CEO Min-Kyu Jung commented: "We designed this benchmark to change the conversation by putting real tools against real work, judged by real attorneys. What's emerging is not a replacement for lawyers, but a new way to scale high-quality legal work."
The Implications for Procurement
For in-house counsel deciding between a general-purpose AI tool and a purpose-built legal AI platform, the Ivo benchmark offers a data point worth incorporating. The core claim — that purpose-built legal AI outperforms general-purpose AI on legal tasks — is consistent with what practitioners report anecdotally.
The caveat is important: this benchmark was designed and published by Ivo. The blind judging design is credible. But any benchmark produced by a vendor with a commercial interest in the outcome should be interpreted with appropriate skepticism.
The General vs. Specialized AI Debate
The Claude for Word result will not surprise practitioners who have tried both. General-purpose language models — even excellent ones — lack the domain-specific fine-tuning, playbook integration, and training on legal-specific failure modes that purpose-built legal AI tools provide.
The more interesting question is the trajectory. Base models will improve. Claude Opus 5, when released, will almost certainly score higher than 3.50 on the same benchmark. The question is whether purpose-built legal AI can maintain its quality advantage as the foundation models that underpin both products improve.