Zip-VGGT Object-Centric Spatiotemporal KV Compression for Streaming Vision Transformers

Teaser Image
Zip-VGGT Teaser

TL;DR: Zip-VGGT is an object-centric spatiotemporal KV compression framework for streaming vision transformers that keeps memory bounded while preserving high-fidelity long-horizon 4D perception.

Motivation

  • 🟩Unbounded Historical KV Cache: memory and latency keep increasing with sequence length in streaming vision transformers.
  • 🟩Boundary-Agnostic Spatial Compression: token merging across physical entities corrupts fine 3D geometry and object boundaries.
  • 🟩Motion-Blind Temporal Eviction: naive pruning removes informative dynamic trajectories and hurts long-term 4D consistency.

Zip-VGGT Framework

Zip-VGGT unifies spatial and temporal KV control with object-level memory states. It combines crisp 2D entity boundaries from lightweight SAM masks with auxiliary motion cues, then applies object-aware update, selection, and retrieval under a fixed cache budget.

  • Object-Centric Memory Mechanism: We group historical tokens by objects, compute pooled object-query relevance, protect anchor tokens, and apply score-based compression to preserve semantically structured history under bounded memory.
  • RD-Lite, Dual-Bank Memory, and Pose-query Suppression: RD-Lite allocates per-object budgets with a lightweight utility-cost objective; dual-bank memory separates static canonical content from dynamic residuals; and history-aware pose-query suppression improves pose robustness without harming geometry queries.
Zip-VGGT Framework

Comparison Results

Visualization Comparison 1

Visualization Comparison 1

Visualization Comparison 2

Visualization Comparison 2

Conclusion

Zip-VGGT formulates long-horizon streaming 4D perception as bounded-memory inference with coupled goals in memory efficiency, geometric fidelity, and dynamic continuity. By unifying object-centric memory organization with RD-Lite allocation, dual-bank retention, and history-aware pose-query suppression, the framework maintains stable streaming performance and avoids unbounded cache growth. The ablations show these modules are complementary, supporting object-centric memory management as an effective strategy for robust long-horizon streaming vision.