leaderboard | Augusto B. Corrêa

This leaderboard shows the performance of LLM-generated heuristics for traditional PDDL planning domains. Our original paper only contained models released until mid-2025. We created this leaderboard to evaluate the performance of frontier models as they are released.

Coverage (number of solved tasks) of LLM-generated heuristic functions evaluated using greedy best-first search in Pyperplan. Each heuristic is selected from a pool of 25 candidates based on training set performance.

All domains have 90 test tasks (720 total), following the easy/medium/hard split used in the IPC 2023. Each run is limited to 30 minutes and 8 GiB. Test tasks are out-of-distribution. See our NeurIPS 2025 paper for more details.

Best values per domain are shown in bold. Entries in italics are baseline planners (not LLM-generated heuristics).

Last Update (2026-02-21)

Added results for GPT 5.2, Gemini 3 Pro (Preview) and Gemini 3.1 Pro (Preview).

Model	Total	Block.	Child.	Floor.	Mico.	Rov.	Soko.	Span.	Trans.
Gemini 3.1 Pro (Preview)	460	84	58	30	83	53	31	66	55
GPT-5	450	79	61	12	90	40	30	68	70
Fast Down. Stone Soup'23	448	58	45	23	90	64	40	64	64
Gemini 3 Pro (Preview)	426	84	60	12	88	33	27	66	56
GPT-5.2	407	73	58	10	75	35	31	66	59
LAMA	396	55	35	12	90	68	40	30	66
o1	373	29	60	9	90	40	32	69	44
DeepSeek R1	373	66	22	4	90	32	30	70	59
Gemini 2.5 Pro	361	52	61	4	89	38	30	30	57
o3	354	36	45	11	76	36	27	66	57
DeepSeek V3	343	45	55	3	64	34	31	69	42
GPT-4.1	327	54	59	11	38	30	27	63	45
Gemini 2.0 Flash Think.	305	37	14	8	88	39	32	30	57
Gemini 2.0 Flash	296	35	32	4	90	32	31	30	42
GPT-4o	258	35	24	3	63	32	28	30	43
DeepSeek R1 DistQwen14B	214	34	16	3	30	32	24	30	45
Qwen3 4B Instruct	208	28	13	2	48	19	25	30	43
Gemma 3 12B	187	30	12	1	30	19	31	30	34

If you would like us to test another model, please contact us directly via email.