ArXiv ID: 2603.19228
Summary: Current instruction-guided video editing models struggle to balance precise semantic modifications with faithful motion preservation. SAMA factorizes video editing into (i) Semantic Anchoring, which establishes a visual anchor by jointly predicting semantic tokens and video latents at sparse frames, and (ii) Motion Alignment, which pre-trains on motion-centric video restoration tasks. The factorized approach yields strong zero-shot video editing ability, outperforming existing open-source models.