Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action(VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UniLARN, a unified latent action learning framework based on inverse and forward dynamics objectives that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. This formulation produces modality-specific and unified latent action representations that serve as pseudo-labels for the depth-aware pretraining of UniLACT. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness of unified latent action representations. UniLACT consistently outperforms RGB-based latent action baselines under in-domain and out-of-domain pretraining regimes, as well as on both seen and unseen manipulation tasks.
We introduce UniLACT, a transformer-based Vision–Language–Action model that injects geometric structure into latent action representations. Our method mainly involves three-stage training pipeline. (1) Unified latent action learning(UniLARN): we use UniLARN to learn modality-specific latent actions from RGB and depth, along with a unified latent space that captures cross-modal interactions. (2) Unified latent pretraining: UniLACT is pretrained to predict these modality-specific and unified latent actions, providing supervision without explicit robot action annotations. (3) Action fine-tuning: UniLACT is then fine-tuned on robot demonstrations to map predicted unified latent actions to executable control policies.
We evaluate UniLACT on the CALVIN simulation benchmark in the unseen D environment and compare it against various language-conditioned action policies. UniLACT consistently outperforms the RGB latent-action method Moto under both in-domain and out-of-domain pretraining settings. These results highlight the benefit of depth-aware unified latent action representations for precise grasping and manipulation.
For a better understanding of where depth actually benefits, we further analyse CALVIN tasks. Moto (RGB latents) and UniLACT (RGB+depth latents) are comparable on appearance-driven tasks (e.g.,open drawer, rotate blue block). In contrast, UniLACT significantly improve tasks requiring spatial precision (e.g., move the slider, turn on the LED), suggesting depth injects stronger 3D geometric priors into the latent action space.