Set-Of-Marks Prompting

Set-Of-Marks Prompting

Definition

A prompting technique that converts structural information of a UI (DOM tree for web, visual bounding boxes for native OS) into labeled visual markers with IDs, allowing LLMs to reference elements by label instead of pixel coordinates. Leverages LLM vision and perception strengths while solving localization problems.

Examples in the Wild

  • Example 1:Browser automation: DOM structure converted to bounding boxes with labels like 'click 4' instead of 'click 443 213'
  • Example 2:OS automation: YOLO-detected UI elements labeled with IDs mapped to center coordinates