A new benchmark evaluating how large language models perform in real accounting workflows highlights both the rapid progress of AI systems and the limitations that still prevent fully autonomous finance operations.
Released by DualEntry, the benchmark tested 19 AI models across 101 accounting tasks designed to mirror real operational scenarios. The evaluation included activities such as journal entry preparation, accounts payable and receivable processing, reconciliations, and financial reporting tasks—areas increasingly targeted by AI-enabled ERP and finance automation tools.
Among the models tested, OpenAI’s GPT-5.4 achieved the highest score with 77.3% accuracy, outperforming other models evaluated in the study. The findings were detailed in DualEntry’s latest analysis of AI performance in accounting workflows.

Moving Beyond Generic AI Benchmarks
Most existing AI benchmarks focus on general reasoning tasks, academic exams, or synthetic datasets. However, enterprise finance workflows introduce constraints that differ significantly from typical AI testing environments.
Accounting processes require strict adherence to rules, structured data interpretation, and consistent outputs. Even minor inaccuracies can have compliance or audit implications. For this reason, DualEntry designed the benchmark to simulate real operational steps within an accounting workflow rather than isolated questions.
The test environment included standardized charts of accounts, deterministic grading criteria, and repeat runs to measure model consistency. According to the developers, the goal was to evaluate how models perform when executing tasks that resemble those handled by modern finance teams.
Readers interested in the underlying methodology and model comparisons can review the full benchmark results, which include performance data across all tested models and task categories.
AI Progress — But Not Yet Autonomous Finance
Despite strong performance from the top model, the benchmark also highlights the gap between current AI capabilities and fully automated finance operations.
Even the highest-scoring model failed to correctly complete more than one in five tasks. Many models struggled with more complex accounting scenarios involving multi-step reasoning, contextual interpretation, or financial rule application.
These findings reinforce a broader trend in enterprise finance: AI is increasingly effective at augmenting accounting workflows, but human oversight remains essential for validation and compliance.
AI adoption in finance has expanded rapidly in recent years, particularly in areas such as invoice processing, reconciliations, financial reporting, and audit preparation. These processes are well suited to automation because they rely heavily on structured data and clearly defined rules.
However, benchmarks like this suggest that the industry is still in a transitional phase between workflow assistance and end-to-end autonomous finance.
Implications for ERP and Finance Platforms
For ERP vendors and finance technology providers, the results underline the importance of domain-specific evaluation when deploying AI capabilities inside enterprise systems.
Generic AI performance metrics do not necessarily translate into reliable execution in regulated financial environments. Platforms embedding AI into accounting workflows must consider factors such as audit trails, determinism, reproducibility, and error management.
This is particularly relevant as ERP providers increasingly integrate generative and agentic AI features into financial modules, including general ledger automation, reconciliation assistance, and intelligent financial analysis.
Benchmarks grounded in operational workflows could therefore become an important tool for evaluating how AI performs in enterprise finance environments.
A Growing Focus on Domain-Specific AI Evaluation
The emergence of accounting-focused AI benchmarks reflects a broader shift across enterprise technology: evaluating AI systems based on real business workflows rather than abstract tasks.
As AI adoption expands across ERP systems, finance teams, and accounting platforms, organizations will likely demand more transparent measurements of reliability, consistency, and domain-specific performance.
For enterprise finance leaders, the key takeaway is clear: AI capabilities in accounting are improving rapidly, but practical deployment still requires careful governance, validation, and workflow design.
In that context, benchmarks that mirror real accounting operations may play an increasingly important role in guiding how AI is adopted inside the modern finance stack.
ERP News Editorial Team
The ERPNews Editorial Team covers global developments in ERP (Enterprise Resource Planning), enterprise software, cloud platforms, AI, automation, and digital transformation, providing independent news and editorial analysis for senior business and technology leaders. Our reporting focuses on market signals, strategic shifts, and enterprise impact across the ERP and enterprise technology ecosystem.
For editorial inquiries, please contact:
📩 [email protected]











