Carlin Wiegner >
Computers >
AI >
Benchmarks
Bechmarks
- GLUE
- GPQA - 448 multiple choice questions in biology, chemistry and physics. 65% accuracy by PhDs in relavent domain.
- MMLU - 16,000 multiple choice questions spanning 57 academic subjects
- HumanEval - 164 programming problems
- Math - 12,500 challenging competition mathematics problems with step-by-step solution
- GSM8K - Grade School Math 8K is 8,500 high-quality grade school word math problems
- IFEval - Instruction-Following Eval (IFEval) focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions.
Created: September 6 2024.
Modified: September 6 2024.