How to properly evaluate generative AI technology for productive use by lawyers
The legal industry is currently flooded with GenAI pilot programs, but many of them, like the recent trials conducted by Ashurst, suffer from a critical flaw: they prioritise change management over scientific measurement. While Ashurst’s approach was excellent for building culture, engaging lawyers with "friendly competition" and "art of the possible" sessions, it was functionally weak on data rigour. The most glaring issue was statistical insignificance. The foundation of their quantitative data was a blind study in which a panel of four lawyers reviewed four cases. In statistical terms, an n of 4 renders the findings at best anecdotal; it is impossible to extrapolate reliable estimates of error rates or efficiency gains for a global firm from such a tiny sample size. Furthermore, the trial relied heavily on self-selection, recruiting "enthusiastic users" and nominees rather than a random cross-section of the firm. This introduces significant selection bias, as tech-forward lawyers are naturally more likely to rate "usability" and "confidence" higher than the average partner.
The methodology also fell victim to the "Hawthorne effect", the phenomenon where participants improve their performance simply because they are being observed. Ashurst noted that imposing time limits and creating competition increased engagement, but this likely introduced bias into the data. Were lawyers working faster because the GenAI tool was effective, or because they were racing against the clock and their colleagues? Without a control group doing the same work under normal conditions, it is impossible to tell. Finally, the metrics themselves were too subjective. The trial relied on Likert scales (1–5) to measure "accuracy" and "completeness" based on perception. There was a notable absence of hard metrics, such as keystroke logging, hallucination rates, or precise time-on-task measurements. Ultimately, this approach tells a firm how their lawyers feel about GenAI, but not whether it actually saves money or reduces risk.
To move from "experimentation" to true "validation," future trials must shift to a rigorous A/B testing model, treating the pilot as a clinical trial. Instead of relying on volunteers, a robust design would use stratified random sampling to select a group of 100+ participants, ensuring an even split across practice areas and seniority levels (from Junior Associates to Senior Partners). This eliminates the "enthusiast bias." Importantly, this method introduces a concurrent Control Group. While Group A performs a set of standardised legal tasks (such as contract review or clause generation) using GenAI, Group B performs the same tasks using standard tools. This enables a direct, mathematically valid comparison of output quality and speed, reducing the noise from the Hawthorne Effect.
The measurement criteria in this proposed improved approach would also shift from sentiment-based to hard-science metrics. Rather than asking lawyers whether a draft was "useful," the trial would measure "Edit Distance" (Levenshtein distance), a measure of how many edits were required to transform the AI output into a client-ready document or presentable advice. If a lawyer has to rewrite 60% of the text, the tool’s utility is objectively low, regardless of how "confident" they felt. Additionally, replacing self-reported surveys with background time-motion logging provides a precise calculation of efficiency gains, measured in minutes rather than feelings. Finally, extending the trial duration to 90 days allows a law firm to filter out the "novelty hype." By measuring usage in month three, when the excitement has faded, the law firm can see if the tool has truly integrated into the workflow. This data-driven approach moves beyond "my eyes are more open" and provides the concrete ROI calculations necessary to make multimillion-dollar investment and procurement decisions for technology products.

