Y: An experiment on myself

Background

"I mostly got what I wanted by ignoring advice." - Sam Atlman

As an engineer, I deeply believe that context is everything. And nobody has more context on my life than me. I tend to be extremely cautious in taking advice.

I think of my life as the dependent variable (Y), a product of my environment, conversations, information diet, etc.

My Hypothesis: tracking all these independent variables will push me towards greater intentionality.

I recently met Andrea, an incredible founder working on an app that listens to him 24/7, turning his day into a story. I decided to join &

Scope

I decomposed the system into two specialized models:

Planning: Vision-language model generates semantic actions from screenshots
Grounding: Specialized model maps semantic actions to pixel coordinates

Given a screenshot $$s_t$$ and history $h_t = \{(s_i, a_i)\}_{i=0}^{t-1}$ , the planner outputs action $$a_t$$ :

$ a_t = \pi_{\text{plan}}(s_t, h_t) $

The grounding model converts semantic actions to coordinates:

$ (x, y) = \pi_{\text{ground}}(s_t, a_t) $

Example output:

{
  "action": "click",
  "target": "Submit button",
  "coordinates": [834, 672],
  "confidence": 0.94
}

Cost Analysis

Vision-language models tile screenshots for processing. A 1920×1080 screen generates 6 tiles at 170 tokens each:

$ T_{\text{image}} = \lceil \frac{w}{512} \rceil \times \lceil \frac{h}{512} \rceil \times 170 $

For a 100-step workflow:

Metric	Without Caching	With Caching
Cost	$3.20–$9.40	$0.32–$0.94
Latency per action	2–5 seconds	10–50ms
Total time	3–8 minutes	1–5 seconds

The optimization strategy is to cache UI elements between actions and route simple interactions directly to the grounding model. UI state changes slowly—most elements remain at fixed coordinates across consecutive actions.

Training

I trained the grounding model via GRPO (Group Relative Policy Optimization) with binary rewards:

$ R(s, a, \hat{c}) = \begin{cases} 1 & \text{if } \|\hat{c} - c_{\text{true}}\|_2 < \tau \\ 0 & \text{otherwise} \end{cases} $

where $\hat{c}$ is the predicted coordinate, $c_{\text{true}}$ is the ground truth, and $\tau$ is the hit threshold.

Training uses trajectory augmentation: one recorded workflow generates multiple training samples by varying UI states and timing. For a trajectory of length $$n$$ , I extract $$O(n^2)$$ sub-trajectories as training data.

Adaptive Routing

For simple UIs, the grounding model executes directly without planning. I route to the planner only when confidence is low:

$ \text{use\_planner} = \begin{cases} \text{true} & \text{if } H(p_{\text{ground}}) > \theta \text{ or } \max(p_{\text{ground}}) < \gamma \\ \text{false} & \text{otherwise} \end{cases} $

where $$H(p)$$ is the entropy of the grounding model's coordinate distribution, $\theta$ is the entropy threshold, and $\gamma$ is the confidence threshold.

This achieves ~50ms latency on simple actions (A100), escalating to 2-5s only for ambiguous cases.

Future Work

The current implementation processes discrete states. Moving to streaming would enable continuous perception-action loops at 5-10 Hz, handling dynamic interactions (drag, hover, scroll) more naturally.

For repeated workflows, policy distillation could compile trajectories into specialized models:

$ \pi_{\text{task}}(s) = \arg\min_{\pi} \mathbb{E}_{s \sim \mathcal{D}_{\text{task}}} \left[ \text{KL}(\pi(s) \| \pi_{\text{plan}}(s)) \right] $

This converts the planner from runtime dependency to training-time teacher, enabling local execution of routine tasks.

Related Work

He, Y., Jin, J., & Liu, P. (2025). Efficient Agent Training for Computer Use. arXiv preprint arXiv:2505.13909.

Yang, Y., Li, D., Dai, Y., Yang, Y., Luo, Z., Zhao, Z., Hu, Z., Huang, J., Saha, A., Chen, Z., Xu, R., Pan, L., Xiong, C., & Li, J. (2025). GTA1: GUI Test-time Scaling Agent. arXiv preprint arXiv:2507.05791.