Arthur also unveiled The Generative Assessment Project (GAP) — a research initiative tracking the strengths and weaknesses of language model offerings of OpenAI, Anthropic, Meta, and others as they evolve over time
Arthur, an AI performance platform trusted by some of the largest organizations in the world to ensure that their AI systems are well-managed and deployed in a responsible manner, today introduced Arthur Bench, an open-source evaluation tool for comparing large language models (LLMs), prompts, and hyperparameters for generative text models. This open-source tool will enable businesses to evaluate how different LLMs will perform in real-world scenarios so they can make informed, data-driven decisions when integrating the latest AI technologies into their operations.
In conjunction with Arthur Bench, Arthur also unveiled The Generative Assessment Project (GAP), a research initiative ranking the strengths and weaknesses of language model offerings from industry leaders like OpenAI, Anthropic, and Meta. Notably, Arthur’s research suggests that Anthropic may be gaining a slight competitive edge against OpenAI’s GPT-4 on measures of “reliability” within specific domains. For example, while GPT-4 was the most successful when answering math questions, Anthropic’s Claude-2 model was stronger at avoiding hallucinated factual mistakes and answering “I don’t know” at appropriate times when answering history questions. Through GAP, Arthur will continue to share discoveries about behavior differences and best practices with the public in its journey to make LLMs work for everyone.
“As our GAP research clearly shows, understanding the differences in performance between LLMs can have an incredible amount of nuance. With Bench, we’ve created an open-source tool to help teams deeply understand the differences between LLM providers, different prompting and augmentation strategies, and custom training regimes,” said Adam Wenchel, co-founder and CEO of Arthur.
Arthur Bench is the newest in Arthur’s suite of LLM-centered products, following Arthur Shield in May. Arthur Bench helps businesses in multiple ways:
- Model Selection & Validation: The AI landscape is rapidly evolving. Keeping abreast of advancements and ensuring that a company’s LLM choice remains the best fit in terms of performance viability is crucial. Arthur Bench helps companies compare the different LLM options available using a consistent metric so they can determine the best fit for their application.
- Budget & Privacy Optimization: Not all applications require the most advanced or expensive LLMs. In some cases, a less expensive AI model might perform the required tasks equally as well. For instance, if an application is generating simple text, such as automated responses to common customer queries, a less expensive model could be sufficient. Additionally, leveraging some models and bringing them in-house can offer greater controls around data privacy.
- Translating Academic Benchmarks to Real-World Performance: Companies want to evaluate LLMs using standard academic benchmarks like fairness or bias, but have trouble translating the latest research into real-world scenarios. Bench helps companies test and compare the performance of different models quantitatively so that they are using a set of standard metrics to evaluate them accurately and consistently. Additionally, companies can configure customized benchmarks that they care about, enabling them to focus on what matters most to their specific business and their customers.
“Arthur Bench helped us develop an internal framework to scale and standardize LLM evaluation across features, and to describe performance to the Product team with meaningful and interpretable metrics,” said Priyanka Oberoi, Staff Data Scientist at Axios HQ, an Arthur customer with early access to Arthur Bench.
Visit AITechPark for cutting-edge Tech Trends around AI, ML, Cybersecurity, along with AITech News, and timely updates from industry professionals!