Legal Archives - Thomson Reuters Institute

From Legal AI Experiments to Execution: What CoCounsel Changes

denise.lam@thomsonreuters.com — Tue, 10 Mar 2026 11:59:27 +0000

Legal AI has moved beyond the stage where it can be considered an experiment.

We have watched AI transition from pilots and into the core of real legal work that is relied on, billed, filed, and defended. As this shift occurs, the AI conversation changes. It is no longer about curiosity or efficiency; it becomes about whether the system is truly dependable when the stakes are high.

That’s why the fact that more than one million professionals now use CoCounsel marks such an inflection point. This is not simply a vanity metric, but rather evidence that AI is being trusted with real workflows. Firms are no longer just testing solutions; they’re reshaping how work moves through their organizations, how expertise scales, and what clients experience as value. The gap between law firms that embrace that model and firms that don’t is already apparent, and it’s not going to close.

What’s becoming clear is that not all AI belongs in legal work.

The AI Line That Matters

Speed and fluency are easy to demo. They’re also the wrong bar. Legal professionals don’t operate in a world where “close enough” is acceptable. They operate in a world where answers have consequences for their clients, their firms, and the justice system itself.

General‑purpose AI is designed to sound plausible across almost anything. That’s impressive, but it’s also exactly the risk. In legal work, plausibility without grounding is a liability. Vertical AI startups focus on a specific domain, but are missing critical components including proprietary content depth, trusted workflows and enterprise-grade infrastructure.

This is where the difference between interesting AI and reliable AI shows up. At ��, we use the term fiduciary‑grade AI very intentionally. It means the system is built for environments where accuracy, accountability, and trust aren’t optional. Where sources matter. Where outputs need to be explainable. Where professionals have to stand behind the work. That’s the standard we built CoCounsel to meet. CoCounsel Legal is grounded in authoritative legal content refined over decades, not scraped public data. It draws directly from trusted sources like Westlaw and Practical Law, and it can connect to a law firm’s own knowledge and documents. Its outputs are transparent and citation‑backed because that’s what professionals need in order to rely on the result.

Equally important, the system itself is shaped and validated by domain experts who understand how legal work must be done, and what standards it must meet. Customer data is protected by design, not retrofitted through policy. Governance, accountability and oversight are built into the architecture.

CoCounsel Legal Reimagined: Built for How Legal Work Really Happens

The next generation of CoCounsel Legal reflects a simple belief: lawyers shouldn’t have to adapt themselves to AI.

For too long, legal AI has asked professionals to manage the system and choose solutions, craft prompts, stitch together workflows, and move work between platforms. That’s busywork. A lawyer’s value is not in learning how to operate software. It is in judgment, experience and strategy.

CoCounsel Legal removes that burden entirely.

With CoCounsel Legal, legal professionals describe what they need in plain language. The system determines how to get there by pulling the right sources, analyzing the relevant documents, applying jurisdiction‑specific guidance, and delivering work product that’s ready to be reviewed and used. Tasks that once took hours across multiple systems can now happen in a single workflow.

This is not an interface upgrade. It is a re-architecture of legal work execution.

“Lawyers don’t want to just operate software, and that’s not what great AI should do. CoCounsel keeps them in the analytical mindset they were trained for: going back and forth, challenging answers, and steering the work. With sourcing directly from Westlaw and Practical Law, they’re not wasting time second-guessing the results. We’re seeing adoption from associates to partners across every practice area. When it spreads that quickly, the experience just works.”

– Andrew P. Medeiros, Managing Director of Innovation, Troutman Pepper Locke

Trust Is What Turns AI into Infrastructure

What we hear consistently from customers reflects that shift. They’re no longer asking whether to use AI. They’re asking which AI they’re willing to trust when the work actually matters.

Trust is what allows AI to move from the edges of practice into daily execution. It’s what enables law firms to embed AI into workflows that carry real legal, financial and reputational risk. And it is why one million professionals across more than 100 countries and territories now have access to CoCounsel.

That milestone isn’t the destination. It’s a signal that the profession is converging on a new operating model where AI becomes infrastructure, not novelty, and where advantage compounds for law firms that adopt dependable systems early.

The Future Is Already Taking Shape

The next chapter of legal work is not theoretical. It is already taking shape inside law firms that are using fiduciary-grade AI every day and moving faster, reducing friction and focusing more of their time on judgment rather than mechanics.

That’s the shift underway. And CoCounsel Legal is built for it.

Sign Up Now

The new CoCounsel Legal is entering beta in the United States soon, with general availability planned for later this year. More regions and territories will follow.��

��for early beta access to the fully reimagined��CoCounsel��Legal experience.��

Fewer Steps. Bigger Outcomes. A fundamentally better way to practice law.

CoCounsel Monthly Insider: Sharpening Your Competitive Edge

jeffrey.mccoy@thomsonreuters.com — Tue, 14 Oct 2025 15:38:15 +0000

Our commitment to innovation continues through the latest enhancements to CoCounsel Legal and additional solutions. These updates optimize workflows, deliver more comprehensive insights, and strengthen your competitive advantage. This month, we are introducing several integrations and new features aimed at providing a seamless and cohesive workflow experience led by our beta of Deep Research in Practical Law, integration of HighQ and CoCounsel, and the expanded offering of CoCounsel in French, German, Portuguese, Spanish and Japanese.

Deep Research in Practical Law (beta)

Legal research is consistently one of the most important AI use cases for legal professionals. CoCounsel’s Deep Research is being built on authoritative �� content in Westlaw and Practical Law, giving lawyers the power to tackle complex, multi-step research through an agentic experience that is built on the legal authority they use and trust every day.

Deep Research on Practical Law, currently in beta with select customers, is a significant advancement toward the comprehensive, trusted, and seamless CoCounsel Legal research solution of the future. Deep Research on Practical Law plans the research steps, retrieves the most relevant guidance and templates from Practical Law, and presents clear, supported conclusions. It adapts as follow-up questions are asked, enabling deeper, more nuanced analysis.

This streamlined approach saves time, reduces friction, and builds confidence in the resulting work product. As the leading resource for legal know how content, Deep Research in Practical Law complements Westlaw’s primary-law expertise and supports the evolving needs of legal professionals.

Deep Research on Practical Law

HigQ and CoCounsel integration

Embedded with GenAI from �� CoCounsel, HighQ users can maximize and build on AI-generated insights and outputs through new waves of integrations that bring AI directly into their workflows and deliver greater client service.

Document Insights

This integration embeds CoCounsel’s document review and summarize skills directly into HighQ enabling users to understand documents faster, gain critical insights, and pinpoint and extract information at the point of need.

HighQ Document Insights with CoCounsel

Integration with CoCounsel drafting capabilities

Within their HighQ workflow, users can seamlessly access drafting capabilities in CoCounsel to review a document, edit, redline it against a playbook and more. The new integration allows users to leverage their documents in HighQ and eliminate versioning risks and manual uploads, saving significant time on drafting and review tasks.

Integration with CoCounsel drafting capabilities

Self-Service Q&A

A new AI-powered chat experience within modernized HighQ dashboards leverages CoCounsel’s search a database skill. Clients and stakeholders can now ask natural language questions about curated document sets and receive summarized, highly relevant answers in minutes, transforming static repositories into dynamic knowledge hubs.

Streamlined workflows, connected experiences and performance improvements

Practice area specific prompts in the CoCounsel Library

New expert-crafted prompts by Practical Law editors are now available directly in the CoCounsel Library, providing a precise starting point for tasks across various practice areas including criminal law, personal injury, data privacy, litigation, and more. Developed by legal experts, these prompts ensure accuracy and efficiency, helping move work forward faster.

CoCounsel Library with practice specific pormpts

Expanded CoCounsel Library accessibility

As the CoCounsel Library becomes even more accessible, users will be able to access CoCounsel directly from Westlaw and Practical Law in the U.S., and Microsoft Word and Outlook internationally. This reduces context switching and accelerates workflow with faster start times, improved document analysis, and quick access to frequently used prompts and skills without switching platforms.

Syncly Google Drive connector for CoCounsel Legal

For law firms using Google Drive, CoCounsel Legal now connects directly via Syncly. The connection allows users to upload documents from Google Drive into CoCounsel while preserving metadata and permissions. This will eliminate manual downloads and uploads, ensuring you’re always working with the most current and auditable document. version.

Syncly Google Drive connector for CoCounsel Legal

AI Overview for Statutes Compare

When comparing two versions of a statute, you can choose to receive an AI Overview of the substantive changes. This intelligent feature filters out minor formatting adjustments and highlights the changes that truly matter, saving you valuable time in reviewing lengthy statutory documents.

Moving Faster, Working Smarter

Platform updates led to a 97% reduction in CoCounsel’s average file download time. Additionally, the average time required to complete tasks using skills rooted in our industry-leading content has a 31% average improvement, reflecting changes in AI performance on features including Ask Practical Law AI, Summarize Documents and user’s open documents within CoCounsel.

In addition, through testing and optimization efforts, we have improved the quality and reliability of the draft skill in CoCounsel by migrating the underlying LLM to GPT5. Users should experience reductions in application startup times and higher-quality precision output enabling them to complete first drafts more completely and accurately.

CoCounsel grows internationally adding multiple languages

CoCounsel is expanding its footprint internationally and adding new languages including French, German and Japanese. The professional-grade legal AI assistant will be available in France, Benelux��/Brussels, Luxembourg and Quebec (French), Germany, Austria and Switzerland (German), Brazil (Portuguese), Argentina (Spanish), and Japan (Japanese) allowing more legal professionals to benefit from CoCounsel’s capabilities.

CoCounsel is also available in the U.S., UK, Canada, New Zealand, Hong Kong, Southeast Asia and United Arab Emirates.

And if you would like to see a deeper dive of Deep Research on Westlaw Advantage, check out the .

�� Best Practices for Benchmarking AI for Legal Research

jeffrey.mccoy@thomsonreuters.com — Wed, 12 Feb 2025 15:38:20 +0000

At ��, we do an enormous amount of AI testing in our efforts to improve our customers’ ability to move through legal work faster and more effectively. We’ve noticed an increase in interest in AI testing generally, and in benchmarking AI applications for legal research specifically. We’ve learned a lot in our thousands of hours of AI testing, as such we offer the following best practices for those interested in considering an updated or differentiated approach when testing or benchmarking AI for legal research.

1. Test for the results you care about most.

This would seem obvious, but we’ve seen a lot of confusion about it, and if we could only make one recommendation, this would be it. It’s foundational for all other recommendations.

If you cared most about determining how long it takes to drive from one place to another, you wouldn’t just measure highway time, you’d measure total door-to-door time. If you cared most about car maintenance costs, you wouldn’t just measure the cost and frequency of brake repairs and maintenance.

With the use of AI for legal research, there are no LLMs nor any LLM-based solutions that offer 100% accuracy. Because of that, all answers generated by large language models or LLM-based solutions, even if they use Retrieval Augmented Generation (RAG), must be independently verified.

Some assume verification is a simple matter of checking the sources cited in an AI answer, but this is incorrect. We’ve seen plenty of examples where an AI-generated answer is wrong, and the cited sources simply corroborate the wrong answer. Verification requires using additional tools (like a citator, statute annotations, etc.) to ensure the answer is correct.

This means every time an AI-generated answer is used for research, there is a three-step process the researcher must engage in: (1) review the answer, (2) review the cited material from the answer, (3) use traditional research tools to make sure the answer and cited material are correct.

When we talk with researchers about research generally and this process specifically, what they care about most is (a) getting to a correct answer or understanding of the relevant law, and (b) the time it takes to get to that correct answer or understanding.

Because of this, the two most important measures are:

Percentage of times using this three-step process the user can get to the right answer, and
Time it takes to complete all three steps

Surprisingly, the percentage of errors in answer in step 1 can have very little impact on the percentage of correct answers by the researcher using all three steps or the time to complete those steps (unless errors are excessive), as long as citations and links to primary law are good and those primary resources are current and easily verified. Focusing on step one is like trying to figure out door-to-door times by measuring highway speeds only. It’s not very useful.

For instance, which of the following systems would you rather use?

System where the initial AI answer is 92% accurate, but verification, on average, takes 18 minutes, and post-verification accuracy is 97%, or
System where the initial AI answer is 89% accurate, but verification, on average, takes 10 minutes, and post-verification accuracy is 99.9%

It’s a clear choice, but there is often a misplaced focus on measurement of the first step in the process to the exclusion of steps two and three. Measure what you care about most.

2. Use realistic, representative questions in your testing.

Presumably you want to evaluate AI for the typical legal research you or your organization does. For instance, if you look at the research your organization does and find the questions are roughly 20% simple questions, 60% medium complexity, and 20% very complex or difficult, and that roughly half are questions about IP law and half are about federal civil procedure, then a benchmark testing 90% simple questions about criminal law would not be very helpful to you.

At ��, we model our testing based on the real-world questions we see from our customers every month. For your own testing, focus on the question types that best represent the researchers you’re focused on.

Testing mostly simple questions with clear-cut answers is easiest for testing, but if those types of questions don’t represent what your users do most (it doesn’t well represent most AI usage in Westlaw), then the results are not particularly helpful. Similarly, if you primarily test overly complex, extremely difficult and nuanced questions – or trick questions, those can be useful for testing the limits of a system, but they tend not to be very helpful for most real-world decision making.

3. Test a lot of questions.

In our own testing, we’ve found that testing small sets of questions is rarely representative of actual performance with a larger set. Large language models can generate different responses each time, even with identical inputs. Additionally, if responses are long and complex, graders may disagree, even when judging identical responses. For just a quick general sense of direction, it’s fine to test with a sample of questions as small as 100 or so, but for comparing algorithms/LLMs against each other, we strongly recommend checking the results as you grade and testing until the measure of interest stabilizes. For example, if you are running a comparison between two systems to see which is preferred, you would test until the rate at which one system is preferred over the other stops changing dramatically with each new batch of questions. Another guide to the number of questions you should test is the confidence level and interval you want (see next section).

4. Calculate and report confidence levels and intervals.

Even with a relatively large set of questions, measurements of accuracy are only so precise. When using these measurements to make decisions, it’s important to understand the degree or range of accuracy of the measurement, often referred to as confidence level and confidence interval. You can think of confidence intervals and levels like margin of error in surveys. It lets you know how reliable or repeatable the measurement is expected to be.

For instance, testing AI accuracy based on 200 questions, if you ran the test again with the same questions/answers but different evaluators, or used the same evaluators but with a different 200 random, representative sample of questions, would you expect the exact same result? Typically, you wouldn’t. You’d expect the result to fall within a certain range, so it’s important to report that range along with the results so decision makers understand the differences between algorithms/LLMs that are meaningful and those that are not meaningful. The proper way to report this is with confidence intervals and levels. You can read more about them . Using standard assumptions, when measuring an error rate of 10% from a sample of only 100 questions, you can be about 95% confident that the true error rate is between 4.1% and 15.9%. This is called a 95% confidence level, and the “+/- 5.9%” is the margin of error. If you measure an error rate of 10% from a sample of 500 questions, the 95% confidence interval would be between 7.4% and 12.6%, or 10% +/- 2.6%.

The basic power analysis to estimate a confidence interval assumes a perfect means of detecting the outcome you are trying to measure. If there is some uncertainty in that detection, e.g., if two independent evaluators disagree about the outcome some percentage of the time, then the margin of error increases. A grading process or measurement that’s unreliable ~5% of the time, might increase the margin of error from 5.9% to 7.3%, in our example above with 100 questions. It’s important to note that there are various methods for calculating standard error, and these examples make simplifying assumptions that likely underestimate the confidence intervals observed in practice.

5. Use a combination of automated and manual evaluation efforts.

Having human evaluators pore through lengthy answers to complex questions can be difficult and time-consuming. Ideally, we would just have AI evaluate the accuracy and quality of answers generated by AI. This is sometimes referred to as LLM as judge. But in the same way that AI makes mistakes when generating an answer, it can also make mistakes when evaluating the quality of an answer against a gold-standard answer written by a human. In our experience, modern LLMs are pretty good at evaluating AI-generated answers against gold-standard answers when answers are clear and relatively short. With length and complexity, we’ve found the LLM as judge approach to be very unreliable.

For instance, has shown that LLMs tend to struggle when evaluating responses to complex and challenging questions like those requiring expert knowledge, reasoning, and math.

Since most test sets will contain a sample of simple/easy/clear questions and answers, it makes sense to use AI for automated evaluation of these, then use human evaluators for the rest, at least until AI improves to the point where more can be automated.

6. For human grading, use two separate human evaluators for each answer, and have a third (ideally more experienced) evaluator to resolve conflicts.

For assessments like these, can be a real issue. In our own testing, we’ve found attorneys evaluating AI-generated answers for more complex legal research questions can disagree about the accuracy or quality of answers about 25% of the time, which makes single-grader evaluation unreliable. To improve reliability, we have two evaluators separately grade each answer, and where there are conflicts, we have a third, more experienced evaluator resolves the conflict.

7. When answers are wrong, investigate to see if the gold-standard answer might be wrong.

In the same way people make mistakes in evaluating answers, they can also make mistakes in coming up with the gold-standard answer for testing. In our experience, we’ve found some instances where the AI-generated answer was evaluated as incorrect when compared to the gold-standard answer, but when we dug into it further, it turned out the AI was correct and the person who put together the gold-standard answer was wrong. Sometimes AI makes mistakes and sometimes humans make mistakes – you should check both.

8. If evaluating multiple algorithms/LLMs/solutions, make sure the evaluators are blind to which algorithm/LLM/solution the answer was generated by.

In our evaluations we try to avoid human bias in grading. Sometimes an evaluator has had bad experiences or great experiences with a certain product or LLM in the past, and we don’t want them to bring that bias to the current evaluation, so when evaluating different solutions, we first strip away anything that would identify the source of the solution, so results are not biased by past positive or negative experiences.

9. Grade the value of answers in addition to making a binary determination of whether the answer has an error.

What’s right or wrong in an answer can vary enormously in terms of positive value and negative impact. For instance, consider the following answers:

A. Answer is correct in every way but is short and high level. It just gives a basic description of the legal issue as it relates to the question but doesn’t provide any references to primary or secondary law for verification, nor any nuance regarding exceptions or other considerations.

B. Answer is lengthy and nuanced, addressing multiple aspects of the question and discussing important exceptions that might apply, and it provides references with citations and links for verification, and it’s correct in every way except in one of the citations, the date is incorrect, but that’s easily verified and corrected when clicking the link from the citation.

C. Answer is incorrect in every way and all its linked references point to primary law that simply corroborate the wrong answer.

If the evaluation is simply a binary view of the number of answers that contain an error, then answer A looks good and answers B and C look equally bad. In reality, answer C is far worse and more harmful than answer B, and Answer B is likely much more valuable to the researcher than answer A.

In our evaluations, we’re looking for answer attributes that are helpful to researchers, like depth of the answer and quality of the references, and we don’t just evaluate errors in a binary way. We consider answers that are totally wrong to be far worse than answers with erroneous statements in otherwise correct and helpful answers. Similarly, we consider erroneous statements in answers based on whether they address the core questions or are tangential to it, and whether they’re contradicted in the answer or easily verified with the linked references. We’d like to eradicate all errors, of course, but some are more harmful than others.

10. Look for errors beyond gold-standard answers.

Often LLMs generate answers with information beyond the scope of a gold-standard answer. For instance, the gold-standard answer might say the answer should state that the answer to the question is no, and it should explain that with X, Y, and Z, and it should specifically cite to cases A & B and statute C.

The LLM-generated answer might state the answer is no and explain X, Y, and Z with references to A, B, and C, but it might also add a few statements about exceptions or related issues or an additional case or statute. Sometimes these additional statements are incorrect, even when everything else is correct. So, if an LLM-as-judge or human evaluator only looks at the gold-standard answer to see if the AI-generated answer is correct, that evaluation can miss errors in the additional material. This means evaluators need to do independent research beyond simply looking at the gold-standard answers to determine if an answer has an error.

11. Consider testing reliability.

LLMs often have some randomness built into them. Many have a temperature setting that can be used to minimize or eliminate this, making answers more consistent when asking the same question multiple times.

But some LLMs are better at this than others, and some integrated solutions that use LLMs in conjunction with other techniques, like RAG, don’t set temperature low to allow for more creativity in answers.

For big decisions you might be making, consider testing reliability by running the same question 20 times and seeing if any of the answers are substantially worse than the other answers to the same question.

The above are our and learnings from our extensive expertise with AI, Gen AI and LLMs over the past 30 years. At �� we put the customer at the heart of each of these decisions we make and are transparent that at the point of use all our AI generated answers must be checked by a human.

As we work through testing our AI products, our teams do not follow each of these steps for every test we do, sometimes we prioritize speed over accuracy of testing or vice versa, but we ensure we clearly understand the trade-off in prioritizing some of these steps and communicate this with our teams. The bigger and more important the decision we’re trying to make, the more of these steps we follow.

This is a guest post from Mike Dahn, head of Westlaw Product, and Dasha Herrmannova, senior applied scientist, from ��.