CoSPlan logo

CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

1University of California San Diego, 2University of Central Florida, 3Microsoft Research,

Under Review

Model Rankings on CoSPlan

Rank Model Robo-VQA-E Shuffle-E Maze-E Blocks-World-E Average Step Completion
1 GPT-4o 52.2 30.1 46.1 54.3 45.7
2 Intern-VLM 25.1 23.4 41.2 18.9 27.1
3 CoG-VLM 21.5 23.7 26.5 26.7 24.6
4 Janus-pro-7B 21.3 23.5 21.7 25.1 22.9
5 Qwen2-VL-8B 18.9 25.1 28.3 18.8 22.8
6 Random 20 20 20 20 20.0
Models ranked by average Step Completion accuracy across all CoSPlan tasks (using Scene Graph method)


Teaser Figure

Corrective Sequential Planning: Given the initial and final states, with already performed actions with some errors (initial context), model identifies errors in the provided context, and picks the optimal action steps to reach the final goal, correcting the error.

Abstract

Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in 'visual sequential planning', i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block re-arrangement, image reconstruction, and object re-organization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of ~5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as PlanBench and VQA. Code and dataset will be made public.

CoSPlan Benchmark

We introduce CoSPlan (Corrective Sequence Planning), a benchmark designed to study VLMs' planning capabilities in erroneous scenarios. CoSPlan focuses on 2D spatial vision tasks guided by text-based instructions, requiring models to plan a temporal sequence of actions toward a goal (temporal), while detecting and correcting an erroneous action.

CoSPlan includes four diverse tasks:

  • Maze-E: Navigation in a 2D maze with obstacles and erroneous moves.
  • Blocks-World-E: Re-arranging colored blocks into a target configuration.
  • Shuffle-E: Reconstructing shuffled image tiles to form the original image.
  • Robo-VQA-E: Re-organizing real-world objects based on instructions.

Maze E Navigating
(a) Maze E (Navigating): → denotes movement
Blocks World Rearrangement
(b) Blocks World (Rearrangement): X from (a) →(b) means "move box number X from column (a) to (b)"
Shuffle E Reconstruction
(c) Shuffle E (Reconstruction): ⇔ indicate swap of patches
Robo VQA Reorganization
(d) Robo VQA (Reorganization): Real world tasks

Overview of CoSPlan Benchmark Datasets: Maze-E, Blocks-World-E, Shuffle-E, and Robo-VQA-E.

Proposed Solution

Addressing the limitations of current VLMs in tracking evolving scenes, we propose Scene Graph Incremental updates (SGI). This novel training-free method refines Scene Graphs step-by-step for each action, generating intermediate states.

SGI consists of three main steps:

  1. Vanilla Scene Graphs (SG): Generate initial and goal Scene Graphs.
  2. Incremental Scene Update: Simulate each action to update the Scene Graph incrementally, creating intermediate representations.
  3. Similarity Comparison: Compare the resulting Scene Graph with the goal Scene Graph to select the correct sequence of actions.

SGI Framework

SGI Method: 1) Initial and Goal Scene Graphs (SG) are generated. 2) Incremental Scene Update sequentially modifies SG for each action. 3) Similarity Comparison matches the resultant SG with Goal graph for searching for the best-aligned sequence.

Analysis


Why Model Give wrong Results (Generic)

MCQ Wrong answer
(a) As number of MCQ option goes up accuracy falls
MCQ Wrong answer
(b) Models have a strong bias towards picking option A
Shuffle E Reconstruction
(c) Models dont reason well with visual problems

Why Handling Error is difficult (Ours)

MCQ Wrong answer
(a) Without Error models have a higher accuracy
MCQ Wrong answer
(b) Error related to objects in the scene (in-context) are harder to handle
Shuffle E Reconstruction
(c) The higher the intial context (already performed actions) better the accuracy
Shuffle E Reconstruction
(d) Number of steps required to reach the goal. Models dont seem to take advantage of projected paths towards goal in picking up the option.

Scene Graph Incremental Update (SGI) Results

SGI improvement relative to vanilla Scene Graph (SG) method. All values show percentage accuracy (↑ higher is better).

Step Completion Performance

Method Robo-VQA-E Shuffle-E Maze-E Blocks-World-E
SG SGI SG SGI SG SGI SG SGI
Intern-VLM 25.1 32.1 (+7.0) 23.4 25.2 (+1.8) 41.2 43.2 (+2.0) 18.9 29.2 (+10.3)
GPT-4o 52.2 56.4 (+4.2) 30.1 37.0 (+6.9) 46.1 56.1 (+10.0) 54.3 55.3 (+1.0)

Error Detection Performance

Method Robo-VQA-E Maze-E Blocks-World-E
SG SGI SG SGI SG SGI
Intern-VLM 26.1 31.5 (+5.4) 33.4 34.8 (+1.4) 37.3 42.9 (+5.6)
GPT-4o 44.2 57.4 (+13.2) 35.3 41.1 (+5.8) 42.1 50.7 (+8.6)

Average improvement with SGI: ~5.2% across all tasks

Additional SGI Results

MCQ Wrong answer

SGI improves performance on error-free scenarios


Method Spatial Map Maze Nav Spatial Grid
CoTSGSGI CoTSGSGI CoTSGSGI
CoG VLM 25.136.7 35.8 32.332.431.2 30.134.338.2
Janus pro 7B 42.447.447.8 20.827.329.3 34.435.836.3
Intern VLM 36.341.344.3 28.640.542.1 33.333.835.1
SGI on VQA dataset
SGI (Qwen2 VL 8B) on PlanBench (Task 8)
Method Score
Vanilla 13.8
CoT 14.1
SG 13.9
SGI (our) 14.7

BibTeX

@misc{grover2025cosplan,
        title={CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates},
        author={Shresth Grover and Priyank Pathak and Akash Kumar and Vibhav Vineet and Yogesh S Rawat},
        year={2025},
        eprint={2512.10342},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2512.10342},
    }