APUT: Large Language Models as Cross-Modal Consistency Reasoning Engines for Egocentric State Estimation

Haochen Huang

doi:10.66393/t13z9817

APUT: Large Language Models as Cross-Modal Consistency Reasoning Engines for Egocentric State Estimation

Authors: Haochen Huang

Author
DOI:: https://doi.org/10.66393/t13z9817
Keywords:: cross-modal consistency, egocentric state estimation, active perception, object permanence, constraint satisfaction
Abstract: Current multimodal large language models (LLMs) excel at passive representation alignment. However, in highly dynamic, partially observable environments (e.g., egocentric vision with severe occlusion and the “cocktail party” effect), standard models suffer from representational collapse. Unobserved targets are frequently dropped from working memory or their states are hallucinated. To address this, the Active-Perception Unified Transformer (APUT) is proposed. Crucially, APUT is not designed as an improved multimodal tracker, but rather as a reasoning-based state estimator. Identity tracking is treated as a dynamic constraint satisfaction problem (CSP). By modeling a structured state matrix through a dedicated encoder, the LLM operates over explicit tokenized world states. Implicit latent tracking collapses under contradictory cross-modal evidence because state representations are entangled, whereas explicit constraint graphs allow for discrete hypothesis elimination. Furthermore, a partial delta-update mechanism and a multiphase training pipeline, ranging from teacher-forced alignment to critic-guided hypothesis ranking, are introduced. Through this architecture, it is demonstrated how robust object permanence emerges from internal logical coherence under supervised structured learning, providing a novel paradigm for embodied state estimation.
Downloads: FITEI000009.pdf
Published: 2026-06-03
Issue: Vol. 2 No. 4 (2026): Issue 3
Section: Articles