Lawyer v large language model (LLM) - can AI actually answer Australian legal questions

The last few years have seen explosive growth in generative AI technology. Large language models (LLMs) have advanced so rapidly that they are already reshaping how many law firms operate. But for the legal profession, a critical question remains: how good are they, really, at being a lawyer?

On paper, the premise is sound. Lawyers spend their days identifying patterns in large volumes of text and generating optimised sentences, precisely the tasks AI excels at. However, legal advice requires judgment, nuance, and critical reasoning, areas where AI has historically struggled.

Until now, the effectiveness of AI in Australian law has been based on anecdotal evidence and speculation. To move beyond the hype, an AI Australian Law Benchmark has been developed.

Market-leading LLMs have been systematically tested on 30 complex legal questions across 10 practice areas, each typically requiring a competent mid-level lawyer to answer.

Here is what was found, and what it means for your business.

1. Do not rely on AI for legal advice

The most critical finding from our testing is clear: none of the models tested should be used for Australian legal advice without expert human supervision.

While the tools are impressive, they pose real risks if you do not already know the answer. The models frequently produced answers that were legally incorrect or missed the point entirely, all while expressing their conclusions with falsely inflated confidence.

Strategic impact: If your legal or business teams are using public AI tools to check regulatory requirements or interpret statutes, you are operating in a high-risk zone. These tools require a "human in the loop" who is qualified to verify the output.

2. OpenAI’s GPT-4 takes the lead

Among the general-purpose models tested (publicly available chatbots implementing GPT-4, Gemini 1, Claude 2, Perplexity, and LLaMa 2), there was a clear hierarchy in performance:

Top Performer: GPT-4 was the strongest overall performer, followed by Perplexity.
The rest: LLaMa 2, Claude 2, and Gemini 1 tested relatively similarly, trailing the leaders.

However, even the "best" performers were not consistently reliable. While they may have a practical role in summarising well-understood areas of law, their output is not robust enough to stand alone.

3. The "infection" problem

A significant issue identified in smaller jurisdictions, such as Australia, is "jurisdictional infection."

LLMs are trained on vast datasets dominated by content from the US, UK, and EU. Our testing revealed that even when explicitly asked to answer from an Australian legal perspective, many models conflated the law. They frequently cited UK or EU authorities or applied legal analysis from those jurisdictions that is simply incorrect in Australia.

Strategic impact: This is a subtle but dangerous error. An answer might look logically sound and cite real legal principles, but if those principles belong to English law rather than Australian law, the advice is worthless and potentially harmful.

4. Hallucinations and citation failures

Poor citation remains a major stumbling block. A reliable lawyer must back up their advice with authoritative sources (legislation and case law). The AI models struggled significantly here:

Hallucination: Some tools invented case names entirely.
Misattribution: Others named real cases but attributed fictional quotes to them.
Laziness: Many models cited entire pieces of legislation without specifying the relevant section, rendering the citation unhelpful.
Authority bias: Some models failed to distinguish between binding authority (like a High Court judgment) and non-binding commentary (like a law firm blog post).

Key takeaway

Generative AI is a powerful tool, but it is not yet as competent as a qualified and experienced lawyer.

The ability to generate a succinct, grammatically correct answer is only a fraction of legal practice. The modern Australian lawyer is a strategic adviser, a role that requires critical reasoning and judgment that current public LLMs cannot replicate.

For now, lawyers, IT teams, and procurement departments must ensure safeguards are in place. AI should be viewed as an assistant to be supervised, not a replacement for qualified legal counsel.

Key stakeholders impacted:

General Counsel and Legal Teams
Chief Information Officers (CIOs)
Innovation & procurement Leads

Lawyer v large language model (LLM) - can AI actually answer Australian legal questions

1. Do not rely on AI for legal advice

2. OpenAI’s GPT-4 takes the lead

3. The "infection" problem

4. Hallucinations and citation failures

Key takeaway

Comments

More from this blog

Facial recognition and the Privacy Act: a clearer (but stricter) line for businesses

Artificial intelligence: top priorities for in-house legal teams in 2026

Preparing for Australia’s 2026 privacy transparency rules on automated decision making

Binding vs non-binding MOUs: what the labels do (and do not) decide

Agentic AI: autonomy, failure modes, and liability

Command Palette

1. Do not rely on AI for legal advice

2. OpenAI’s GPT-4 takes the lead

3. The "infection" problem

4. Hallucinations and citation failures

Key takeaway

Comments

More from this blog