SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

ArXiv ID: 2603.19228

Summary: Current instruction-guided video editing models struggle to balance precise semantic modifications with faithful motion preservation. SAMA factorizes video editing into (i) Semantic Anchoring, which establishes a visual anchor by jointly predicting semantic tokens and video latents at sparse frames, and (ii) Motion Alignment, which pre-trains on motion-centric video restoration tasks. The factorized approach yields strong zero-shot video editing ability, outperforming existing open-source models.

Read on ArXiv