Together with GPT-4, OpenAI has created an open source software framework for evaluating the performance of its AI models. Called Evals, OpenAI says the tools will allow anyone to report deficiencies in their models to help guide improvements.
It’s kind of a crowdsourced approach to model testing, OpenAI explains in a blog post.
“We use Evals to guide the development of our models (both to identify deficiencies and to prevent regressions), and our users can apply it to track performance across model versions and the evolution of product integrations,” OpenAI writes. . “We expect Evals to become a vehicle for sharing crowdsourced benchmarks, representing a maximum set of failure modes and difficult tasks.”
OpenAI created Evals to develop and run benchmarks to evaluate models like GPT-4 while inspecting their performance. With Evals, developers can use data sets to generate alerts, measure the quality of completions provided by an OpenAI model, and compare performance across different data sets and models.
Supporting several popular AI benchmarks, Evals also supports writing new classes to implement custom evaluation logic. As an example to follow, OpenAI created a logic puzzle test that contains 10 indications where GPT-4 fails.
It’s all unpaid work, very unfortunately. But to incentivize the use of Evals, OpenAI plans to grant access to GPT-4 to those who contribute “high quality” benchmarks.
“We believe that Evals will be an integral part of the process for using and building on our models, and we welcome direct contributions, questions, and feedback,” the company wrote.
With Evals, OpenAI, which recently said it would stop using customer data to train its models by default, is following in the footsteps of others who have turned to crowdsourcing to harden AI models.
In 2017, the University of Maryland Laboratory for Computational Linguistics and Information Processing launched a platform called Break It, Build It, which allows researchers to send models to users with the task of finding examples to beat them. And Meta maintains a platform called Dynabench that has “tricked” models for users designed to analyze sentiment, answer questions, detect hate speech, and more.