May 04, 2026 | AI and product innovation
Why Legal AI Needs a New Standard: Inside CoCoBench
By Tyler Alexander, Director of CoCounsel AI Reliability,
A lawyersubmitsa filing supported by a citation thatdoesn’texist. The system produced a polished answer.It just wasn’t grounded in reality.
This is the gap facing legal AI today. Not whether systems can generate sophisticatedanswers, but whether those answers areactually goodenough for real legal work.
In practice, there is a consistent and measurable gap between how systemsperform ontraditional benchmarks and how theyperform onreal legal work.
Most evaluations still rely on benchmarks that were never designed for how legal work actually happens.Bar exam questions, clause extraction, singleturn prompts.These tests evaluate discretecomponents of the work.But theyfail tocapturehow a system performs across theiterative spectrum oftasksthat make up real legal work.
As a result, systems are oftenoptimizedto perform well on benchmarks that do not reflect how legal work isactually done.
And critically, they fail in ways those benchmarks are not designed to catch, and as agentic systems proliferate, thosesmall errorscascade intomore frequent and even harder toidentifyfailures.
Starting with the work
When we set out to build the next generation ofCoCounselLegal, wedidn’tstart with models or features.We started withthe workitself: what does legal work actually look like in practice?
“Thisisn’tbuildfirst, ask later.It’saskfirst, build second,” our teamsoftenreiterate.
CoCounselLegal has been inthemarket since August, already supporting legal professionals in research, drafting, and review. But as we looked ahead to the next generation, now in beta, a clear shiftemerged. The focus is moving from point-in-timeassistanceto systems capable of handlinglonger unaided task horizons andmoreend-to-endworkflows.That shift required us to rethink not only how we buildCoCounsel, but how we evaluate it.
FromSingletasks toWork, Completed
Through research with hundreds of legal professionals and over 100 Practical Law attorney editors, a consistent patternemerged. The challenge was not any single task,beingtoo difficult. It was the number of stepsrequiredand the effortof keepingthem coherent.
Legal workdoesn’thappen in isolated prompts. It moves across research, drafting, review, and revision. Context builds,decisionscompound, and small errors early can affect everything that follows. That is not what traditional benchmarks are designed to measure.
A different kind of system
The next generation ofCoCounselLegal reflects that shift. A single instruction can now trigger a complete workflow.
Ask it to draft a motion to dismiss. It plans the work, reviews the relevant documents, conducts legal research, pulls secondary sources, and produces a draft grounded in authority,validatingcitationsfor its conclusions throughout the work andreturning a final outputgrounded in those facts.
That’snot a task.It’sa complete workflow.Andit’sexactly where traditional benchmarks break down.
And it raises a different question. How do youcomprehensivelyevaluate something like that?
ܾ徱ԲǰǵԳ
We needed a way to measure performance at the level of real legal work.That’swhy we builtCoCoBench,a framework designed to evaluate AI systems at the level of real legal work, and one we are now making more visible externally.
CoCoBenchmeasures whether an AI system can complete real legal tasks to afiduciary-gradestandard. It is built around hundreds of attorney-authored benchmark tasks, with a fixed core dataset used to track performance over time. More than 100 legal subject matter experts have contributedto the legal dataset, alongside research and engineering teams at Labswho developed the evaluationinfrastructure,representing over15,000hoursof practitionerand engineeringwork.
Each test reflectsreal practice: a querywrittenthe way a practitioner wouldask it,supporting materials drawn from representative contracts, pleadings, or correspondence, anda gold-standard response drafted and reviewed by attorneys.This approach is grounded in what we internally refer to as ideal-response evaluation, defining what correct, complete legal work actually looks like and measuring system output against that standard.
The goal is not to measure whether a system canproduce a response.It is to measure whetherthat response(andit’ssequence of work toreach that response)constitutes complete,accuratelegal work.
Evaluating how the work gets done
Legal workflows are multi-step, which means evaluation cannot stop at the final output.A system can produce a coherent answerevenwhile relying on flawed reasoning-traditional benchmarks oftenfail todetectthis as a failure mode.
In agentic systems, an error in one step carries forward. A result may appear coherent while being built onanerrorupstream.CoCoBenchaddresses this by evaluating the final deliverable alongside the citation record the system produced along the way.Specifically, what it cited, where it sourced it, and whether the source actually supports the claim.
These evaluations span core categories of legal work, including research, drafting, review, and multi-step reasoning across workflows.
A higher standard
Every output is evaluated against what a practicing attorney would consider acceptable. That includes correctapplication of the law, completeness of analysis,accurateuse ofsources, and work productthatmeetsfiduciary-gradestandards and is usable in practice.
No capability is considered ready until itdemonstratesimprovement against that standard.Progress is measured through real-world performance, evaluated by the attorneys best positioned to judge it.
Whatwe’reseeing so far
In practice, we are seeing a consistent gap between how systemsperform ontraditional benchmarks and how theyperform onreal legal tasks.Systemsoptimizedfor general-purpose benchmarks often struggle when evaluated against real workflows, revealing gaps in completeness, source fidelity, and multi-step reasoning that are not visible instandardbenchmarkresults.
When evaluation shifts from task-level performance to the workflow level, the bar changes.What counts as good changes and which systems actually meet that bar changes aswell.
More detailed findings will be shared asCoCoBenchcontinues to evolve. The direction is clear. Evaluating AI at the task level changes not only how performance is measured, but what needs to be built.
In the next post in this series,we’llshare what happens when you apply this standard in practice, and how different approaches to legal AI perform when evaluated against real legal work.
Building whatcomes next
The next generation ofCoCounselLegal, currently in beta, is being built on this foundation. The focus is not on isolated capabilities. It is helping attorneys complete their work reliably,efficiently, and to afiduciary-gradestandard.
As AI systems take on more of that work, how they are evaluated becomes as important as what they can do, because without the right standard, progress can be overstated.
Because in legal work,almost rightis not good enough.