Benchmarks

Bechmarks

GLUE
GPQA - 448 multiple choice questions in biology, chemistry and physics. 65% accuracy by PhDs in relavent domain.
MMLU - 16,000 multiple choice questions spanning 57 academic subjects
HumanEval - 164 programming problems
Math - 12,500 challenging competition mathematics problems with step-by-step solution
GSM8K - Grade School Math 8K is 8,500 high-quality grade school word math problems
IFEval - Instruction-Following Eval (IFEval) focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions.

Created: September 6 2024.
Modified: September 6 2024.