- Mervin Praison Newsletter
- Posts
- šØ Microsoft Drops GUI-Actor on Hugging Face ā Visual Grounding Will Never Be the Same
šØ Microsoft Drops GUI-Actor on Hugging Face ā Visual Grounding Will Never Be the Same
No Coordinates. No Limits. Just Smarter Screen Interaction
A new era of GUI interaction just arrived. Microsoft has released GUI-Actorāa groundbreaking, coordinate-free visual grounding modelāon Hugging Face. If you've been working with coordinate-based GUI agents, it's time to rethink everything.
Instead of blindly guessing where to click (x=0.25 anyone?), GUI-Actor "looks" at the screen like a humanāusing attention, not coordinates.
Whatās New?
š Coordinate-Free Visual Grounding for GUI Agents
š No more fragile (x, y) clicks
šÆ Attention-based actions with screen-level understanding
Letās unpack what makes GUI-Actor a major leap forward:

1ļøā£ Coordinate Prediction Is Dead:
Traditional GUI agents output screen coordinates to actāclumsy, brittle, and unnatural.
GUI-Actor changes the game with:
An
<ACTOR>token that attends over screen patchesAttention-based grounding to visually align actions
No more guessing pixel positionsājust informed decisions
Benefits:
ā
Higher accuracy
ā
Stronger generalisation
ā
Lower data requirement
2ļøā£ Performance vs Data Size:
GUI-Actor-7B outperforms UI-TARS-72B on ScreenSpot-Pro, despite having 6x fewer parameters. Thanks to coordinate-free grounding, itās:
More data-efficient
More robust
More scalable

3ļøā£ How It Works :
Instead of predicting coordinates:
GUI-Actor generates a
<ACTOR>token within its outputThat token attends to interactive regions (e.g., buttons, icons)
Multi-patch supervision resolves spatial ambiguity
Itās a simple switch in modellingāwith massive downstream benefits.

4ļøā£ Generalizes Like a Human :
Unlike coordinate-based models that overfit quickly, GUI-Actor:
Handles new UIs and screen layouts gracefully
Works across different resolutions and platforms
Continues to improve out-of-distribution

5ļøā£ Smarter Multi-Click Predictions :
GUI-Actor predicts multiple valid regions in a single passāthink Hit@K improvements without extra compute. Compare that to coordinate-based models jittering around one spot. GUI-Actor offers:
More options
More precision
More reliability

š Why This Matters
GUI-Actor:
ā
Ends reliance on coordinates
ā
Thinks visually, like users do
ā
Scales smarter, not harder
ā
Brings robustness to real-world apps
Visual grounding just leveled up. š§ Try GUI-Actor now on Hugging Face and reimagine how your agents interact with the screen.