AutoFocus-IL: VLM-based Saliency Maps for
Data-Efficient Visual Imitation Learning
without Extra Human Annotations

Anonymous Authors
Anonymous Institution
IEEE International Conference on Robotics and Automation (ICRA) 2026.

Can VLMs' generated saliency maps improve imitation learning?

intro fig

An overview of three different approaches: While traditional imitation learning suffers from causal confusion, gaze-based IL solves this by utilizing an expensive solution of collecting human eye gaze data. However, AutoFocus-IL resolves this issue by getting a saliency map annotated by a VLM to retain the benefits of gaze-based IL without incurring the extra data collection costs.

Abstract

We present AutoFocus-IL, a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Saliency regularization has emerged as a promising way to achieve this, but existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Our findings highlight that VLM-driven saliency provides a scalable, annotation-free path toward robust imitation learning in robotics. Particularly, our experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data.

method fig
Overview of the AutoFocus-IL pipeline.
We first use a VLM to identify and track task-relevant objects. Then, we generate temporal saliency maps that highlight these objects in each frame of the demonstration. Finally, we use these saliency maps to regularize the behavior cloning policy, encouraging it to focus on the highlighted regions while suppressing distractors.

Visualization of Tasks

This section contains visualizations of our experimental setups and data processing steps. Both setups icludes original and confounded environments. A CARLA confounded environment, there are action-conditioned icons on top of each frame as indicated in the upcoming videos. These overlays do not affect dynamics or expert actions, but introduce spurious correlations intended to test robustness against causal confusion.
Moreover, in the real-world robot, we design a confounded variant by placing task-irrelevant distractor objects in the background to induce visual causal confusion. Although the scene contains many distractors, AutoFocus-IL attends only to the most task-relevant objects as indicated in the videos below.

Filtering via VLM

In this part, we visualize how AutoFocus-IL utilizes the VLM to mimic the human eye gaze by filtering non-relevant factors in the input step by step. First, the left most column indicates the raw observation. Then the second column is the output of the VLM when queried with the object detection. After that, there is the task-relevant objects filtered by the VLM. Finally, the right most column is the overlaid saliency map.


Task: Lift Carrot (Confounded)

Raw Observation

Key Objects Detected

Task-Relevant Objects

Saliency Map


Task: Pull Pot (Confounded)

Raw Observation

Key Objects Detected

Task-Relevant Objects

Saliency Map


Task: Turn Left (Confounded)

Raw Observation

Key Objects Detected

Task-Relevant Objects

Saliency Map


Task: Change Lane (Confounded)

Raw Observation

Key Objects Detected

Task-Relevant Objects

Saliency Map


Advantage of the trained model over BC

In this part, we visualize how AutoFocus-IL performs a task successfully, while BC fails due to the causal confusion. The left column is the BC model's performance, the middle column indicates AutoFocus-IL's performance in different tasks, and the right column shows an example of performing the same task by a human.


Task: Lift the Carrot (Confounded)

Expert Demonstration

BC

AutoFocus-IL


Task: Lift the Carrot (Original)

Expert Demonstration

BC

AutoFocus-IL


Task: Pull the Pot (Confounded)

Expert Demonstration

BC

AutoFocus-IL


Task: Pull the Pot (Original)

Expert Demonstration

BC

AutoFocus-IL


VLM Prompts

In this section, we provide example prompts used to generate saliency maps with the VLM. We used Qwen2.5-VL-72B-Instruct model to generate the saliency maps. The process of building saliency map consists of some consequent steps. In the following part, we first show the prompts in the CARLA environment, followed by the prompts in the real-world robot setup.

CARLA

1. Trajectory Description Prompt of CARLA Environment:

2. Filtering Stage Prompt

Real-World Robot

1. Trajectory Description Prompt of WidowX Environment:

2. Filtering Stage Prompt



Results

CARLA

Evaluation is conducted on two disjoint sets: (1) seen routes, which are the same 10 routes used for training but with 2 different random seeds (20 evaluations per method), and (2) unseen routes, a disjoint set of 10 held-out routes, also with 2 seeds each (20 evaluations per method).


Original Environment


tabular

Confounded Environment


tabular

Real-World Robot

For each of the four settings, we perform 10 rollouts and report the number of successful episodes (task completion).We compare AutoFocus-IL against Behavior Cloning (BC).
Results are summarized in the table below. AutoFocus-IL consistently improves success on both tasks, with the largest gains under the confounded setting. These trends mirror our simulation findings: VLM-driven, object-centric saliency helps the policy focus on causal scene elements and markedly improves robustness in the presence of unrelated visual clutter—without any additional human attention labels. Note that prior baselines such as GABRIL are not applicable to this task, as they require human gaze data—necessitating either specialized equipment for gaze collection or complex projection methods to map gaze onto the robot’s point-of-view camera.


Table of Real-Robot Results

tabular