Beyond Viewpoint Generalization:
What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?

Boyang Cai1,2,*,† Qiwei Liang1,2,* Jiawei Li2,* Shihang Weng2,* Zhaoxin Zhang2,*
Tao Lin3 Xiangyu Chen1 Wenjie Zhang1 Jiaqi Mao4 Weisheng Xu1 Bin Yang1 Jiaming Liang1,2 Junhao Cai2 Renjing Xu1,‡
1The Hong Kong University of Science and Technology (Guangzhou)     2Shenzhen University
3Beijing Jiaotong University     4The Chinese University of Hong Kong, Shenzhen
* Equal Contribution    † Project Leader    ‡ Corresponding Author

Abstract

Does multi-view demonstration truly improve robot manipulation, or merely enhance cross-view robustness? We present a systematic study quantifying the performance gains, scaling behavior, and underlying mechanisms of multi-view data for robot manipulation. Controlled experiments show that multi-view demonstrations consistently improve single-view policy success even when evaluation is restricted to the original viewpoint. Notably, multi-view data breaks the scaling limitation of single-view datasets and continues to raise performance ceilings after saturation. To address the scarcity of multi-view data, we propose RoboNVS, a geometry-aware self-supervised framework that synthesizes novel-view videos from monocular inputs. Our generated data consistently improves downstream policies in both simulation and real-world environments.

Understanding Multi-View Gains in Robot Learning

While multi-view demonstrations are known to improve robustness to camera shifts, we investigated whether they fundamentally enhance manipulation capabilities. By isolating viewpoint diversity and evaluating all policies from a single canonical camera, we ensured performance gains were not simply due to test-time view availability.

Internal Mechanisms: Multi-view supervision reshapes visual representations to focus on manipulation-relevant regions (like the end-effector and objects) rather than the background. It also improves action head robustness and stabilizes optimization dynamics.

➜ Motivated by these insights, we propose RoboNVS to synthesize high-quality multi-view demonstrations from monocular inputs.

RoboNVS: Synthesizing Multi-View Demonstrations

To overcome the scarcity of multi-view data, we propose RoboNVS, a geometry-aware framework that synthesizes novel-view videos from a single monocular demonstration. Why Geometry Matters? Direct video generation often fails in robotics due to geometric inconsistency (e.g., distorted objects or hallucinated structures). Therefore, we explicitly incorporate 3D geometry as a strong prior.

RoboNVS Pipeline Overview

Overview: RoboNVS reconstructs 3D geometry from monocular video, renders it under new viewpoints, and uses diffusion-based inpainting to generate physically consistent novel-view videos.

Key Innovation 1: Depth Alignment

Problem: Monocular depth estimation faces a trade-off:

  • Relative: Temporally consistent but lacks scale.
  • Metric: Geometrically correct but temporally unstable.

Solution: We align the two using a global scale-shift transformation, combining temporal consistency with geometric accuracy.

Key Innovation 2: Bi-directional Masking

Problem: Standard inpainting suffers from mismatch between training masks and real occlusions caused by viewpoint changes.

Solution: We train with both directions:

  • Original mask: Learns object reconstruction.
  • Inverse mask: Learns background completion.

Together, these designs enable RoboNVS to generate geometrically consistent, temporally stable, and physically plausible novel-view demonstrations.

Extensive Visual Comparison

Source Video(Target View: Right 20° Yaw)
RoboNVS (Ours)(Top: Mask | Bottom: RGB)
RoboNVS (Ablation)(Top: Mask | Bottom: RGB)
EX-4D(Top: Mask | Bottom: RGB)
TrajectoryCrafter(Top: Mask | Bottom: RGB)
CogNVS(Top: Mask | Bottom: RGB)
ZeroNVS-ft(Top: Mask | Bottom: RGB)
ReCamMaster(Top: Mask | Bottom: RGB)
Source Video(Target View: Right 10° Yaw)
RoboNVS (Ours)(Top: Mask | Bottom: RGB)
RoboNVS (Ablation)(Top: Mask | Bottom: RGB)
EX-4D(Top: Mask | Bottom: RGB)
TrajectoryCrafter(Top: Mask | Bottom: RGB)
CogNVS(Top: Mask | Bottom: RGB)
ZeroNVS-ft(Top: Mask | Bottom: RGB)
ReCamMaster(Top: Mask | Bottom: RGB)
Source Video(Target View: Right 20° Yaw)
RoboNVS (Ours)(Top: Mask | Bottom: RGB)
RoboNVS (Ablation)(Top: Mask | Bottom: RGB)
EX-4D(Top: Mask | Bottom: RGB)
TrajectoryCrafter(Top: Mask | Bottom: RGB)
CogNVS(Top: Mask | Bottom: RGB)
ZeroNVS-ft(Top: Mask | Bottom: RGB)
ReCamMaster(Top: Mask | Bottom: RGB)
Source Video(Target View: Left 10° Yaw)
RoboNVS (Ours)(Top: Mask | Bottom: RGB)
RoboNVS (Ablation)(Top: Mask | Bottom: RGB)
EX-4D(Top: Mask | Bottom: RGB)
TrajectoryCrafter(Top: Mask | Bottom: RGB)
CogNVS(Top: Mask | Bottom: RGB)
ZeroNVS-ft(Top: Mask | Bottom: RGB)
ReCamMaster(Top: Mask | Bottom: RGB)
Source Video(Target View: Right 20° Yaw)
RoboNVS (Ours)(Top: Mask | Bottom: RGB)
RoboNVS (Ablation)(Top: Mask | Bottom: RGB)
EX-4D(Top: Mask | Bottom: RGB)
TrajectoryCrafter(Top: Mask | Bottom: RGB)
CogNVS(Top: Mask | Bottom: RGB)
ZeroNVS-ft(Top: Mask | Bottom: RGB)
ReCamMaster(Top: Mask | Bottom: RGB)

Experimental Results

We evaluate the efficacy of our method across three challenging manipulation tasks. We compare four data augmentation configurations: (1) Baseline (Monocular only), which uses no augmentation; (2) EX-4D; (3) EX-4D w/ Depth Alignment (EX-4D w/ DA); and (4) RoboNVS (Ours). For all augmentation-based methods, we generate four synthetic views at specified camera trajectories ({−20°, −10°, 10°, 20°}) to augment the training set. A Diffusion Policy is then trained on each augmented dataset. We report the success rates evaluated under the base-view to demonstrate how synthesized multi-view data provides geometric priors for robust manipulation.

RoboNVS Setup

Figure 1. Real-world Setup. Our evaluation environment and the three manipulation tasks: Click Bell, Pick Fruit, and Pick Lego. Each task requires precise geometric understanding of the target objects.

Real-world Comparison

Figure 2. Qualitative Comparison. Visual results in real-world scenarios. RoboNVS demonstrates superior performance in both geometry reconstruction and mask completion quality compared to existing baselines.

Method Click Bell Pick Fruit Pick Lego
Baseline (Monocular only) 25% 10% 5%
EX-4D 40% 35% 30%
EX-4D w/ Depth Alignment 50% 40% 40%
RoboNVS (Ours) 70% 60% 65%

Table 1. Success Rate Comparison. Mean success rates over multiple trials. RoboNVS significantly outperforms existing methods by providing high-fidelity synthetic data.

BibTeX

If you find our work useful in your research, please consider citing:

@misc{cai2026viewpointgeneralizationmultiviewdemonstrations,
      title={Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?}, 
      author={Boyang Cai and Qiwei Liang and Jiawei Li and Shihang Weng and Zhaoxin Zhang and Tao Lin and Xiangyu Chen and Wenjie Zhang and Jiaqi Mao and Weisheng Xu and Bin Yang and Jiaming Liang and Junhao Cai and Renjing Xu},
      year={2026},
      eprint={2603.26757},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.26757}, 
}
BibTeX