leaderboard
Coverage of LLM-generated heuristics on the IPC 2023 Learning Track domains.
Coverage (number of solved tasks) of LLM-generated heuristic functions evaluated using greedy best-first search in Pyperplan. Each heuristic is selected from a pool of 25 candidates based on training set performance.
Best values per domain are shown in bold. Entries in italics are baseline planners (not LLM-generated heuristics). All domains have 90 test tasks (720 total). Each run is limited to 30 minutes and 8 GiB. Test tasks are out-of-distribution. See our NeurIPS 2025 paper for more details.
| Model | Block. | Child. | Floor. | Mico. | Rov. | Soko. | Span. | Trans. | Total |
|---|---|---|---|---|---|---|---|---|---|
| GPT-5 | 79 | 61 | 12 | 90 | 40 | 30 | 68 | 70 | 450 |
| Fast Down. Stone Soup'23 | 58 | 45 | 23 | 90 | 64 | 40 | 64 | 64 | 448 |
| LAMA | 55 | 35 | 12 | 90 | 68 | 40 | 30 | 66 | 396 |
| o1 | 29 | 60 | 9 | 90 | 40 | 32 | 69 | 44 | 373 |
| DeepSeek R1 | 66 | 22 | 4 | 90 | 32 | 30 | 70 | 59 | 373 |
| Gemini 2.5 Pro | 52 | 61 | 4 | 89 | 38 | 30 | 30 | 57 | 361 |
| o3 | 36 | 45 | 11 | 76 | 36 | 27 | 66 | 57 | 354 |
| DeepSeek V3 | 45 | 55 | 3 | 64 | 34 | 31 | 69 | 42 | 343 |
| GPT-4.1 | 54 | 59 | 11 | 38 | 30 | 27 | 63 | 45 | 327 |
| Gemini 2.0 Flash Think. | 37 | 14 | 8 | 88 | 39 | 32 | 30 | 57 | 305 |
| Gemini 2.0 Flash | 35 | 32 | 4 | 90 | 32 | 31 | 30 | 42 | 296 |
| FF | 27 | 25 | 12 | 90 | 34 | 36 | 30 | 41 | 295 |
| GPT-4o | 35 | 24 | 3 | 63 | 32 | 28 | 30 | 43 | 258 |
| DeepSeek R1 DistQwen14B | 34 | 16 | 3 | 30 | 32 | 24 | 30 | 45 | 214 |
If you want us to test another LLM, contact us directly via email.