Towards a Benchmark for Causal Business Process Reasoning with LLMs

Authors: Fabiana Fournier, Lior Limonad, Inna Skarbovsky

Year: 2024

Source: https://arxiv.org/abs/2406.05506

TLDR:

The paper by Fabiana Fournier, Lior Limonad, and Inna Skarbovsky from IBM Research in Haifa, Israel, introduces a novel benchmark for evaluating the ability of Large Language Models (LLMs) to reason about causally-augmented business processes (BPC). The authors argue that while LLMs are increasingly used for various applications, their capability to handle complex cognitive tasks such as reasoning and decision-making in business processes is not yet fully understood. The benchmark they propose consists of a set of BPC-related situations, questions, and deductive rules to determine ground truth answers, which can be used to test and train LLMs. The study finds that while LLMs show promise in reasoning about BPC, there is room for improvement, and the authors suggest that their benchmark can be a valuable tool for both assessing and enhancing the reasoning capabilities of LLMs in the context of business process management.

Free Login To Access AI Capability

Free Access To ChatGPT

The paper presents a benchmark for evaluating and potentially enhancing the ability of Large Language Models to reason about causally-augmented business processes, highlighting the potential and current limitations of LLMs in this domain.

Free Access to ChatGPT

Abstract

Large Language Models (LLMs) are increasingly used for boosting organizational efficiency and automating tasks. While not originally designed for complex cognitive processes, recent efforts have further extended to employ LLMs in activities such as reasoning, planning, and decision-making. In business processes, such abilities could be invaluable for leveraging on the massive corpora LLMs have been trained on for gaining deep understanding of such processes. In this work, we plant the seeds for the development of a benchmark to assess the ability of LLMs to reason about causal and process perspectives of business operations. We refer to this view as Causally-augmented Business Processes (BP^C). The core of the benchmark comprises a set of BP^C related situations, a set of questions about these situations, and a set of deductive rules employed to systematically resolve the ground truth answers to these questions. Also with the power of LLMs, the seed is then instantiated into a larger-scale set of domain-specific situations and questions. Reasoning on BP^C is of crucial importance for process interventions and process improvement. Our benchmark could be used in one of two possible modalities: testing the performance of any target LLM and training an LLM to advance its capability to reason about BP^C.

Method

The authors used a methodology that involved creating a core set of template questions and situations to test the reasoning capabilities of LLMs on causally-augmented business processes (BPC). These templates were designed to cover different causal structures and process perspectives, and they were associated with a set of deductive rules to facilitate reasoning about each situation. The authors then instantiated these template questions into domain-specific questions using an open-source LLM, Mixtral-instruct-8x-7b. They also developed a method to filter out inadequate questions that did not maintain a faithful linkage to their corresponding templates. The benchmark was then used to evaluate the performance of five different LLMs across three perspectives: process, causal, and a combination of both, using both template questions and domain-specific questions. The evaluation was conducted by running the questions multiple times with each LLM and calculating the proportion of correct answers.

Main Finding

The authors discovered that while Large Language Models (LLMs) have the potential to reason about causally-augmented business processes (BPC), their performance in this area is not yet perfect. They found that LLMs can achieve a level of predictive accuracy that aligns with logical reasoning through training on a large set of deductive textual statements. However, the study revealed that there is room for improvement in the LLMs' ability to handle complex cognitive tasks such as reasoning and decision-making in the context of business processes. The authors also observed that LLMs may suffer from "causal hallucination" when dealing with causal relationships, indicating a need for further advancements in the models' reasoning capabilities.

Conclusion

The conclusion of the paper is that the developed benchmark can be used in two ways: to test the performance of LLMs in reasoning about causally-augmented business processes (BPC), and to train LLMs to improve their reasoning capabilities in this domain. The authors suggest that their benchmark can serve as a standardized tool for assessing the suitability of LLMs for specific tasks and that it can be gradually expanded with the help of the community to cover a broader range of domains. They acknowledge that while LLMs may not inherently possess reasoning abilities, they can be trained to produce outputs that are reliably aligned with logical reasoning, and the benchmark they have created can contribute to this training and assessment process.

Keywords

Large Language Models, Business Processes, Causally-augmented Business Processes, Reasoning, Benchmark

The Best AI PDF Reader