AutoFocus-IL: VLM-based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations

We present AutoFocus-IL, a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Saliency regularization has emerged as a promising way to achieve this, but existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Our findings highlight that VLM-driven saliency provides a scalable, annotation-free path toward robust imitation learning in robotics. Particularly, our experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data.

This section contains visualizations of our experimental setups and data processing steps. Both setups icludes original and confounded environments. A CARLA confounded environment, there are action-conditioned icons on top of each frame as indicated in the upcoming videos. These overlays do not affect dynamics or expert actions, but introduce spurious correlations intended to test robustness against causal confusion.
Moreover, in the real-world robot, we design a confounded variant by placing task-irrelevant distractor objects in the background to induce visual causal confusion. Although the scene contains many distractors, AutoFocus-IL attends only to the most task-relevant objects as indicated in the videos below.

Filtering via VLM

In this part, we visualize how AutoFocus-IL utilizes the VLM to mimic the human eye gaze by filtering non-relevant factors in the input step by step. First, the left most column indicates the raw observation. Then the second column is the output of the VLM when queried with the object detection. After that, there is the task-relevant objects filtered by the VLM. Finally, the right most column is the overlaid saliency map.

In this part, we visualize how AutoFocus-IL performs a task successfully, while BC fails due to the causal confusion. The left column is the BC model's performance, the middle column indicates AutoFocus-IL's performance in different tasks, and the right column shows an example of performing the same task by a human.

In this section, we provide example prompts used to generate saliency maps with the VLM. We used Qwen2.5-VL-72B-Instruct model to generate the saliency maps. The process of building saliency map consists of some consequent steps. In the following part, we first show the prompts in the CARLA environment, followed by the prompts in the real-world robot setup.

1. Trajectory Description Prompt of CARLA Environment:

The following is a short video segment of an autonomous driving trajectory. 1. First, understand what the autonomous vehicle is doing in this video (e.g., approaching an intersection, stopping, turning, adjusting to other vehicles). 2. Identify and list all visible objects in the scene that are relevant to driving. For each object, provide: - type: the object’s category (e.g., Stop Sign, Speed Limit Sign, Lane Markings, Vehicles). - relevance: the role of this object in driving context (e.g., traffic_control, navigation, safety_critical, context). - description: one sentence describing its appearance, meaning, and role in driving. 3. Ignore irrelevant background objects or decorations that do not affect the driving task. 4. Ensure consistent naming style for object type (singular, capitalized, concise). 5. After listing objects, summarize the vehicle’s behavioral description in one sentence (e.g., “The vehicle approaches a stop sign, halts, then continues driving while maintaining lane alignment and adjusting speed to oncoming traffic.”). 6. Finally, output a metadata block summarizing: - total_objects: total count of unique driving-relevant objects (not including pure context objects like trees/buildings, unless explicitly relevant). - object_types: array of the key object categories detected. - processed_frames: number of video frames analyzed. 7. Output must strictly follow this JSON format and avoid redundant text:
json { "route": "<route_id>", "seed": "<seed_id>", "k_frames": <number_of_frames>, "objects": [ { "type": "<object_type>", "relevance": "<traffic_control|navigation|safety_critical|context>", "description": "<one-sentence description>" } ], "description": "<summary of what the vehicle is doing>", "metadata": { "total_objects": <int>, "object_types": ["<object1>", "<object2>", "..."], "processed_frames": <int> } }

2. Filtering Stage Prompt

You are analyzing a scene for autonomous driving. Frame: {frame_id} Global Intent: {global_context} Action Intent: {action_context} Candidates (normalized xyxy): {candidates} Task: 1) Select the top-K most relevant objects (<=3 by track_id) to safe driving and current intention. 2) For same class but different track_id, pick the one with highest importance now. 3) If you suspect missing categories for current frame, list them for re-detection. 4) Return JSON with this exact schema:
{ "top_k": [ {"id": <int>, "class": "<str>", "score": <float 0~1>, "rationale": "<short>"} ], "missing_suspects": ["<class>", ...] }
Rules: - Prioritize collision risks (in-path vehicles, close pedestrians/cyclists), rule-governing items (traffic_light/sign), and trajectory-relevant actors. - "score" is an importance weight; approximate is fine; we only use ids for filtering. - Return JSON only (no extra text).

1. Trajectory Description Prompt of WidowX Environment:

Objective: The following is a video of a WidowX robotic arm operating trajectory. 1. First, understand the robot's task in this video. 2. Identify objects in the image that only interact with the robotic arm and output their descriptive keywords. Output must use the color + material + category format (for example: silver iron pot lid, blue plastic bowl). Do not include obstacles that the robotic arm may encounter during its trajectory. 3. Ignore irrelevant decorations and distant or uninterpreted backgrounds. 4. Ensure consistent naming style, using color + material + category throughout. 5. Material and color must be clearly defined: for example, silver iron, black plastic, blue ceramic. Omit material or color if uncertain. 6. The gripper/manipulator must *not* be in the output, for example: black robot gripper fingers, this should not be in the output. 7. Components must be distinct: for example, a pot lid, cup handle, or knob should be output as pot lid, cup handle, stove knob, etc. 8. Output must strictly follow the JSON format and avoid redundant text:
json { "objects": [ { "type": "<object_class or parent_part, e.g., microwave_handle, cup_handle, drawer_handle, stove_knob>", "state": "<open|closed|filled|empty|on|off|unknown>", "role": "<target|tool|container|support|obstacle|background>", "affordance": "<open|close|grasp|place|pour|press|pull|push|turn|none>", } ], "description": "<one-sentence action summary of what the robot/person is doing and key event>" }

2. Filtering Stage Prompt

You are analyzing a household manipulation scene: {task_context}. Frame: {frame_id} Global Intent: {global_context} Action Intent: {action_context} Candidates (normalized xyxy): {candidates} Task: 1) Based on the given bounding box and scene image, select the K objects most relevant to the robot successfully completing the task (sorted by track_id, including the robot gripper, the fewer the better, up to a maximum of three) to enable or block the current action. 2) Prioritize direct targets (e.g., doors/handles/buttons/knobs), tools/containers, direct supports, and direct obstacles. 3) If you believe there are potentially missing objects/parts in the current context, please list their classes for re-detection. 4) Return JSON format:
{ "top_k": [ {"id": <int>, "class": "<str>", "score": <float 0~1>, "rationale": "<short>"} ], "missing_suspects": ["<class>", ...] }
Rules: - Explicitly address key components (e.g., microwave handles, cabinet doors, button panels). - Exclude distant/background items not relevant to the action. - "score" is the importance weight; an approximate value is sufficient; the ID is used for filtering. - Returns only JSON (no extra text).

Evaluation is conducted on two disjoint sets: (1) seen routes, which are the same 10 routes used for training but with 2 different random seeds (20 evaluations per method), and (2) unseen routes, a disjoint set of 10 held-out routes, also with 2 seeds each (20 evaluations per method).

For each of the four settings, we perform 10 rollouts and report the number of successful episodes (task completion).We compare AutoFocus-IL against Behavior Cloning (BC).
Results are summarized in the table below. AutoFocus-IL consistently improves success on both tasks, with the largest gains under the confounded setting. These trends mirror our simulation findings: VLM-driven, object-centric saliency helps the policy focus on causal scene elements and markedly improves robustness in the presence of unrelated visual clutter—without any additional human attention labels. Note that prior baselines such as GABRIL are not applicable to this task, as they require human gaze data—necessitating either specialized equipment for gaze collection or complex projection methods to map gaze onto the robot’s point-of-view camera.

AutoFocus-IL: VLM-based Saliency Maps for
Data-Efficient Visual Imitation Learning
without Extra Human Annotations

Abstract

Visualization of Tasks

Filtering via VLM

Advantage of the trained model over BC

VLM Prompts

CARLA

Real-World Robot

Results

CARLA

Real-World Robot