The Test Design
We submitted an identical set of 50 legal research questions to seven legal AI platforms: Harvey, Lexis+ AI, Thomson Reuters CoCounsel, Casetext, Spellbook (research mode), Westlaw Edge AI, and a general-purpose GPT-4 deployment with a legal system prompt. Questions spanned federal case law citation, statutory interpretation, current regulatory status, and legal standard articulation.
Hallucination was defined broadly: any citation that does not exist, any holding that mischaracterizes a case, any factual claim that is materially wrong, and any regulatory status that is out of date. We are not publishing the platform-by-platform rankings because the differences at the top are small enough to be within the margin of evaluator disagreement.
Structural Findings
Finding one: specialized legal AI dramatically outperforms general-purpose AI. The GPT-4 deployment scored the worst on citation accuracy by a significant margin — approximately 31% hallucination rate on citation-specific questions. The best specialized platforms scored 3-6%.
Finding two: recency is the most common failure mode. Platforms that rely on static training data rather than live database connections hallucinate most frequently on regulatory status questions, where the law has changed since training.
Finding three: platform confidence does not predict accuracy. The platforms that presented answers with the highest confidence were not the most accurate. In two cases, the most confident platforms were among the least accurate.
The Operational Implication
The 3-31% hallucination range means that the appropriate verification protocol varies dramatically depending on which platform you are using. A 3% hallucination rate might support a light-touch verification protocol on routine research. A 31% rate requires independent verification of every citation — at which point the tool is not producing a productivity gain.
The honest advice: verify the hallucination rate of the specific platform you are using, on the specific task types you use it for. The platform that is excellent at case law citation may be mediocre at regulatory status.