leaderboard

Coverage of LLM-generated heuristics on the IPC 2023 Learning Track domains.

Coverage (number of solved tasks) of LLM-generated heuristic functions evaluated using greedy best-first search in Pyperplan. Each heuristic is selected from a pool of 25 candidates based on training set performance.

Best values per domain are shown in bold. Entries in italics are baseline planners (not LLM-generated heuristics). All domains have 90 test tasks (720 total). Each run is limited to 30 minutes and 8 GiB. Test tasks are out-of-distribution. See our NeurIPS 2025 paper for more details.


Model Block. Child. Floor. Mico. Rov. Soko. Span. Trans. Total
GPT-5 79 61 12 90 40 30 68 70 450
Fast Down. Stone Soup'23 58 45 23 90 64 40 64 64 448
LAMA 55 35 12 90 68 40 30 66 396
o1 29 60 9 90 40 32 69 44 373
DeepSeek R1 66 22 4 90 32 30 70 59 373
Gemini 2.5 Pro 52 61 4 89 38 30 30 57 361
o3 36 45 11 76 36 27 66 57 354
DeepSeek V3 45 55 3 64 34 31 69 42 343
GPT-4.1 54 59 11 38 30 27 63 45 327
Gemini 2.0 Flash Think. 37 14 8 88 39 32 30 57 305
Gemini 2.0 Flash 35 32 4 90 32 31 30 42 296
FF 27 25 12 90 34 36 30 41 295
GPT-4o 35 24 3 63 32 28 30 43 258
DeepSeek R1 DistQwen14B 34 16 3 30 32 24 30 45 214


If you want us to test another LLM, contact us directly via email.