APUT: Large Language Models as Cross-Modal Consistency Reasoning Engines for Egocentric State Estimation
- Authors
-
-
Haochen Huang
Author
-
- Keywords:
- cross-modal consistency, egocentric state estimation, active perception, object permanence, constraint satisfaction
- Abstract
-
Current multimodal large language models (LLMs) excel at passive representation alignment. However, in highly dynamic, partially observable environments (e.g., egocentric vision with severe occlusion and the “cocktail party” effect), standard models suffer from representational collapse. Unobserved targets are frequently dropped from working memory or their states are hallucinated. To address this, the Active-Perception Unified Transformer (APUT) is proposed. Crucially, APUT is not designed as an improved multimodal tracker, but rather as a reasoning-based state estimator. Identity tracking is treated as a dynamic constraint satisfaction problem (CSP). By modeling a structured state matrix through a dedicated encoder, the LLM operates over explicit tokenized world states. Implicit latent tracking collapses under contradictory cross-modal evidence because state representations are entangled, whereas explicit constraint graphs allow for discrete hypothesis elimination. Furthermore, a partial delta-update mechanism and a multiphase training pipeline, ranging from teacher-forced alignment to critic-guided hypothesis ranking, are introduced. Through this architecture, it is demonstrated how robust object permanence emerges from internal logical coherence under supervised structured learning, providing a novel paradigm for embodied state estimation.
- Downloads
- Published
- 2026-06-03
- Section
- Articles