CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

Model Rankings on CoSPlan

Models ranked by average Step Completion accuracy across all CoSPlan tasks (using Scene Graph method)
Rank	Model	Robo-VQA-E	Shuffle-E	Maze-E	Blocks-World-E	Average Step Completion
1	GPT-4o	52.2	30.1	46.1	54.3	45.7
2	Intern-VLM	25.1	23.4	41.2	18.9	27.1
3	CoG-VLM	21.5	23.7	26.5	26.7	24.6
4	Janus-pro-7B	21.3	23.5	21.7	25.1	22.9
5	Qwen2-VL-8B	18.9	25.1	28.3	18.8	22.8
6	Random	20	20	20	20	20.0

Abstract

Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in 'visual sequential planning', i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block re-arrangement, image reconstruction, and object re-organization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of ~5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as PlanBench and VQA. Code and dataset will be made public.

CoSPlan Benchmark

We introduce CoSPlan (Corrective Sequence Planning), a benchmark designed to study VLMs' planning capabilities in erroneous scenarios. CoSPlan focuses on 2D spatial vision tasks guided by text-based instructions, requiring models to plan a temporal sequence of actions toward a goal (temporal), while detecting and correcting an erroneous action.

CoSPlan includes four diverse tasks:

Maze-E: Navigation in a 2D maze with obstacles and erroneous moves.
Blocks-World-E: Re-arranging colored blocks into a target configuration.
Shuffle-E: Reconstructing shuffled image tiles to form the original image.
Robo-VQA-E: Re-organizing real-world objects based on instructions.

(a) Maze E (Navigating): → denotes movement

(b) Blocks World (Rearrangement): X from (a) →(b) means "move box number X from column (a) to (b)"

(c) Shuffle E (Reconstruction): ⇔ indicate swap of patches

(d) Robo VQA (Reorganization): Real world tasks

Overview of CoSPlan Benchmark Datasets: Maze-E, Blocks-World-E, Shuffle-E, and Robo-VQA-E.

Solution : Scene Graph Incremental Update (SGI)

Addressing the limitations of current VLMs in tracking evolving scenes, we propose Scene Graph Incremental updates (SGI). This novel training-free method refines Scene Graphs step-by-step for each action, generating intermediate states.

SGI consists of three main steps:

Vanilla Scene Graphs (SG): Generate initial and goal Scene Graphs.
Incremental Scene Update: Simulate each action to update the Scene Graph incrementally, creating intermediate representations.
Similarity Comparison: Compare the resulting Scene Graph with the goal Scene Graph to select the correct sequence of actions.

SGI Method: 1) Initial and Goal Scene Graphs (SG) are generated. 2) Incremental Scene Update sequentially modifies SG for each action. 3) Similarity Comparison matches the resultant SG with Goal graph for searching for the best-aligned sequence.

Scene Graph Incremental Update (SGI) Results

SGI improvement relative to vanilla Scene Graph (SG) method. All values show percentage accuracy (↑ higher is better).

Step Completion Performance

Method	Robo-VQA-E		Shuffle-E		Maze-E		Blocks-World-E
Method	SG	SGI	SG	SGI	SG	SGI	SG	SGI
Intern-VLM	25.1	32.1 (+7.0)	23.4	25.2 (+1.8)	41.2	43.2 (+2.0)	18.9	29.2 (+10.3)
GPT-4o	52.2	56.4 (+4.2)	30.1	37.0 (+6.9)	46.1	56.1 (+10.0)	54.3	55.3 (+1.0)

Error Detection Performance

Method	Robo-VQA-E		Maze-E		Blocks-World-E
Method	SG	SGI	SG	SGI	SG	SGI
Intern-VLM	26.1	31.5 (+5.4)	33.4	34.8 (+1.4)	37.3	42.9 (+5.6)
GPT-4o	44.2	57.4 (+13.2)	35.3	41.1 (+5.8)	42.1	50.7 (+8.6)

Average improvement with SGI: ~5.2% across all tasks

Additional SGI Results

SGI improves performance on error-free scenarios

SGI on VQA dataset
Method	Spatial Map			Maze Nav			Spatial Grid
Method	CoT	SG	SGI	CoT	SG	SGI	CoT	SG	SGI
CoG VLM	25.1	36.7	35.8	32.3	32.4	31.2	30.1	34.3	38.2
Janus pro 7B	42.4	47.4	47.8	20.8	27.3	29.3	34.4	35.8	36.3
Intern VLM	36.3	41.3	44.3	28.6	40.5	42.1	33.3	33.8	35.1

SGI (Qwen2 VL 8B) on PlanBench (Task 8)
Method	Score
Vanilla	13.8
CoT	14.1
SG	13.9
SGI (our)	14.7

BibTeX

@misc{grover2025cosplan,
        title={CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates},
        author={Shresth Grover and Priyank Pathak and Akash Kumar and Vibhav Vineet and Yogesh S Rawat},
        year={2025},
        eprint={2512.10342},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2512.10342},
    }