Phan, LongGatti, AliceLi, NathanielKhoja, AdamKim, RyanRen, RichardScaramuzza, DavidePark, Jongee2026-03-052026-03-0520260028-08361476-468710.1038/s41586-025-09962-4https://doi.org/10.1038/s41586-025-09962-4https://hdl.handle.net/20.500.14411/11178Zekry, Mohamed/0000-0002-4594-8749; Yuan, Michelle/0000-0002-9937-2108; Lo, Eve/0000-0002-3270-7786; Kuchkin, Aleksey/0009-0004-3287-0948; Moyano, Alejano José/0000-0002-4976-7611; Kang, Timothy/0009-0008-8138-3264; Petruzella, Gerol/0009-0000-3018-9391; Lee, Kwok-Hao/0009-0006-7435-0240; Zhelnov, Pavel/0000-0003-2767-5123Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding(1), limiting informed measurement of state-of-the-art LLM capabilities. Here, in response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.eninfo:eu-repo/semantics/openAccessA Benchmark of Expert-Level Academic Questions to Assess AI CapabilitiesArticle