We identify coupled failure modes in prior methods: boundary-agnostic spatial merging damages geometry, while motion-blind temporal eviction removes dynamic object trajectories.
We introduce an object-centric memory mechanism plus three lightweight modules: RD-Lite budget allocation, dual-bank memory, and history-aware pose-query suppression.
Across depth, pose, reconstruction, and efficiency evaluations, Zip-VGGT achieves strong compression and speed benefits while maintaining competitive geometric and temporal quality.

Motivation

🟩Unbounded Historical KV Cache: memory and latency keep increasing with sequence length in streaming vision transformers.
🟩Boundary-Agnostic Spatial Compression: token merging across physical entities corrupts fine 3D geometry and object boundaries.
🟩Motion-Blind Temporal Eviction: naive pruning removes informative dynamic trajectories and hurts long-term 4D consistency.

Zip-VGGT Framework

Zip-VGGT unifies spatial and temporal KV control with object-level memory states. It combines crisp 2D entity boundaries from lightweight SAM masks with auxiliary motion cues, then applies object-aware update, selection, and retrieval under a fixed cache budget.

Object-Centric Memory Mechanism: We group historical tokens by objects, compute pooled object-query relevance, protect anchor tokens, and apply score-based compression to preserve semantically structured history under bounded memory.
RD-Lite, Dual-Bank Memory, and Pose-query Suppression: RD-Lite allocates per-object budgets with a lightweight utility-cost objective; dual-bank memory separates static canonical content from dynamic residuals; and history-aware pose-query suppression improves pose robustness without harming geometry queries.

Comparison Results

Visualization Comparison 1

Visualization Comparison 2

Conclusion

Zip-VGGT formulates long-horizon streaming 4D perception as bounded-memory inference with coupled goals in memory efficiency, geometric fidelity, and dynamic continuity. By unifying object-centric memory organization with RD-Lite allocation, dual-bank retention, and history-aware pose-query suppression, the framework maintains stable streaming performance and avoids unbounded cache growth. The ablations show these modules are complementary, supporting object-centric memory management as an effective strategy for robust long-horizon streaming vision.

Zip-VGGT Object-Centric Spatiotemporal KV Compression for Streaming Vision Transformers

Motivation

Zip-VGGT Framework

Comparison Results

Conclusion