leaderboard
Coverage of LLM-generated heuristics on the IPC 2023 Learning Track domains.
This leaderboard shows the performance of LLM-generated heuristics for traditional PDDL planning domains. Our original paper only contained models released until mid-2025. We created this leaderboard to evaluate the performance of frontier models as they are released.
Coverage (number of solved tasks) of LLM-generated heuristic functions evaluated using greedy best-first search in Pyperplan. Each heuristic is selected from a pool of 25 candidates based on training set performance.
All domains have 90 test tasks (720 total), following the easy/medium/hard split used in the IPC 2023. Each run is limited to 30 minutes and 8 GiB. Test tasks are out-of-distribution. See our NeurIPS 2025 paper for more details.
Best values per domain are shown in bold. Entries in italics are baseline planners (not LLM-generated heuristics).
Last Update (2026-02-21)
Added results for GPT 5.2, Gemini 3 Pro (Preview) and Gemini 3.1 Pro (Preview).
| Model | Total | Block. | Child. | Floor. | Mico. | Rov. | Soko. | Span. | Trans. |
|---|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro (Preview) | 460 | 84 | 58 | 30 | 83 | 53 | 31 | 66 | 55 |
| GPT-5 | 450 | 79 | 61 | 12 | 90 | 40 | 30 | 68 | 70 |
| Fast Down. Stone Soup'23 | 448 | 58 | 45 | 23 | 90 | 64 | 40 | 64 | 64 |
| Gemini 3 Pro (Preview) | 426 | 84 | 60 | 12 | 88 | 33 | 27 | 66 | 56 |
| GPT-5.2 | 407 | 73 | 58 | 10 | 75 | 35 | 31 | 66 | 59 |
| LAMA | 396 | 55 | 35 | 12 | 90 | 68 | 40 | 30 | 66 |
| o1 | 373 | 29 | 60 | 9 | 90 | 40 | 32 | 69 | 44 |
| DeepSeek R1 | 373 | 66 | 22 | 4 | 90 | 32 | 30 | 70 | 59 |
| Gemini 2.5 Pro | 361 | 52 | 61 | 4 | 89 | 38 | 30 | 30 | 57 |
| o3 | 354 | 36 | 45 | 11 | 76 | 36 | 27 | 66 | 57 |
| DeepSeek V3 | 343 | 45 | 55 | 3 | 64 | 34 | 31 | 69 | 42 |
| GPT-4.1 | 327 | 54 | 59 | 11 | 38 | 30 | 27 | 63 | 45 |
| Gemini 2.0 Flash Think. | 305 | 37 | 14 | 8 | 88 | 39 | 32 | 30 | 57 |
| Gemini 2.0 Flash | 296 | 35 | 32 | 4 | 90 | 32 | 31 | 30 | 42 |
| GPT-4o | 258 | 35 | 24 | 3 | 63 | 32 | 28 | 30 | 43 |
| DeepSeek R1 DistQwen14B | 214 | 34 | 16 | 3 | 30 | 32 | 24 | 30 | 45 |
| Qwen3 4B Instruct | 208 | 28 | 13 | 2 | 48 | 19 | 25 | 30 | 43 |
| Gemma 3 12B | 187 | 30 | 12 | 1 | 30 | 19 | 31 | 30 | 34 |
If you would like us to test another model, please contact us directly via email.