We present AutoFocus-IL, a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Saliency regularization has emerged as a promising way to achieve this, but existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Our findings highlight that VLM-driven saliency provides a scalable, annotation-free path toward robust imitation learning in robotics. Particularly, our experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data.
In this section, we provide example prompts used to generate saliency maps with the VLM.
We used Qwen2.5-VL-72B-Instruct model to generate the saliency maps.
The process of building saliency map consists of some consequent steps. In the following
part, we first show the prompts in the CARLA environment, followed by the prompts in the
real-world robot setup.
1. Trajectory Description Prompt of CARLA Environment:
2. Filtering Stage Prompt
1. Trajectory Description Prompt of WidowX Environment:
2. Filtering Stage Prompt