leaderboard

Coverage of LLM-generated heuristics on the IPC 2023 Learning Track domains.

This leaderboard shows the performance of LLM-generated heuristics for traditional PDDL planning domains. Our original paper only contained models released until mid-2025. We created this leaderboard to evaluate the performance of frontier models as they are released.

Coverage (number of solved tasks) of LLM-generated heuristic functions evaluated using greedy best-first search in Pyperplan. Each heuristic is selected from a pool of 25 candidates based on training set performance.

All domains have 90 test tasks (720 total), following the easy/medium/hard split used in the IPC 2023. Each run is limited to 30 minutes and 8 GiB. Test tasks are out-of-distribution. See our NeurIPS 2025 paper for more details.

Best values per domain are shown in bold. Entries in italics are baseline planners (not LLM-generated heuristics).

Last Update (2026-02-21)

Added results for GPT 5.2, Gemini 3 Pro (Preview) and Gemini 3.1 Pro (Preview).


Model Total Block. Child. Floor. Mico. Rov. Soko. Span. Trans.
Gemini 3.1 Pro (Preview) 460 84 58 30 83 53 31 66 55
GPT-5 450 79 61 12 90 40 30 68 70
Fast Down. Stone Soup'23 448 58 45 23 90 64 40 64 64
Gemini 3 Pro (Preview) 426 84 60 12 88 33 27 66 56
GPT-5.2 407 73 58 10 75 35 31 66 59
LAMA 396 55 35 12 90 68 40 30 66
o1 373 29 60 9 90 40 32 69 44
DeepSeek R1 373 66 22 4 90 32 30 70 59
Gemini 2.5 Pro 361 52 61 4 89 38 30 30 57
o3 354 36 45 11 76 36 27 66 57
DeepSeek V3 343 45 55 3 64 34 31 69 42
GPT-4.1 327 54 59 11 38 30 27 63 45
Gemini 2.0 Flash Think. 305 37 14 8 88 39 32 30 57
Gemini 2.0 Flash 296 35 32 4 90 32 31 30 42
GPT-4o 258 35 24 3 63 32 28 30 43
DeepSeek R1 DistQwen14B 214 34 16 3 30 32 24 30 45
Qwen3 4B Instruct 208 28 13 2 48 19 25 30 43
Gemma 3 12B 187 30 12 1 30 19 31 30 34


If you would like us to test another model, please contact us directly via email.