Vision-Based OS Automation with SoMatic
Enable AI agents to autonomously control any OS interface using vision and Set-Of-Marks prompting
About this automation
SoMatic uses a finetuned YOLO model running locally on CPU with ONNX to identify text and interactable elements in any UI. It draws bounding boxes with labels and maps IDs to element coordinates, enabling Set-Of-Marks prompting for native OS automation. Achieves ~20% higher accuracy than raw model on GPT-4.5.
How to implement
Install SoMatic CLI: npm install -g somatic-cli/cli
Add SoMatic skill: npx skills add Smyan1909/SoMatic
Configure your LLM API (GPT-4.5 or compatible)
Use stdio MCP server to parse b64-encoded screenshots directly
Define agent tasks that reference bounding box IDs instead of pixel coordinates
Run agent with full OS access across Windows, Mac, or Linux