Ori Katz & Eyal Zamir, 2026
Judges and other decision-makers often face a tension between adhering to rules and taking into account the specific circumstances of the case at hand. Increasingly, such decisions are supported by—and soon may even be made by—artificial intelligence, including large language models (LLMs). How LLMs resolve this tension is therefore of paramount importance. Yet, little is known about how LLMs navigate the tension between applying legal rules and accounting for justice. This study compares the decisions of GPT-4o, Claude Sonnet 4, and Gemini 2.5 Flash with those of laypersons and legal professionals, including judges, across six vignette-based experiments comprising about 50,000 decisions.
We find that, unlike humans, LLMs do not balance law and equity: when instructed to follow the law, they largely ignore justice in both their decisions and reasoning; when instructed to decide based on justice, they disregard legal rules. Moreover, in contrast to humans, requiring reasons or providing precedents has little effect on their responses. Prompting LLMs to consider litigant sympathy, or asking them to predict judicial decisions rather than make them, somewhat reduce their formalism, but they remain far more rigid than humans.
Beyond their formalism, LLMs exhibit far less variability (“noise”) than humans. While greater consistency is generally a virtue in decision-making, the article discusses its shortcomings as well. The study introduces a methodology for evaluating current and future LLMs where no demonstrably single correct answer exists.
Preprint. The full paper is available on SSRN