One thing that I thought we’d see more of in 2025 was how Gemini could control your Android phone. There was the May demo and other underlying work, but we don’t have Google’s complete vision yet.
At I/O 2025 in May, Google demoed the latest research prototype of Project Astra that could retrieve content from the web/Chrome, search and play YouTube videos, search through your emails, make calls on your behalf, and place orders.
The nearly 2-minute demo showed Gemini scrolling a PDF in Chrome for Android, as well as opening the YouTube app to the search results page, scrolling, and then selecting/tapping a video. Google is working to bring these capabilities to Gemini Live.
In October, Google made a Computer Use model available to developers in preview that lets Gemini interact with — by scrolling, clicking, and typing — user interfaces like humans do. What’s currently available is “optimized for web browsers,” but Google noted “strong promise for mobile UI control tasks.”
Google described these capabilities as a “crucial next step in building powerful, general-purpose agents” since “many digital tasks still require direct interaction with graphical user interfaces.”

A future version of Siri will let you “take action in and across apps” using your voice. The vision Apple pitched in 2024 is that tasks that would have required you to jump through multiple apps “could be addressed in a matter of seconds” through a series of voice prompts. Apple has detailed what app developers must do to support this. So far, we’ve had nothing from Google, specifically the Android team, if a similar system or approach is coming.

…Siri can take actions across apps, so after you ask Siri to enhance a photo for you by saying “Make this photo pop,” you can ask Siri to drop it in a specific note in the Notes app — without lifting a finger.
Instead, what Google has shown is very generalized and seems to not require any prior integrations. In many ways, it’s the pragmatic approach, especially if Android developers don’t rush to support this in their apps.
This is not the first time Google has worked towards this. The premise of the new Google Assistant in 2019 was that on-device voice processing — a breakthrough at the time — would make “tapping to use your phone… seem slow.”
This next-generation Assistant will let you instantly operate your phone with your voice, multitask across apps, and complete complex actions, all with nearly zero latency.
This did not really take off in 2019 and never dropped Pixel-exclusivity, with it suffering from the same issues of the previous era of assistants, like regimented voice commands.
LLMs should let you phrase your command in a conversational manner. Hopefully, it also addresses capability limitations by being able to take action in any app or website without previously having been exposed to it, which seems like the limitation in Apple’s system.
Generative AI seems to tackle all the complaints of Google’s past approach, but I do wonder how people will take to it this time.
Some scenarios where this would be useful are quite obvious, like hands-free usage, as Google wanted to show in the Astra demo. Conservatively, I’d expect this to be the extent of mainstream adoption next year.
The implications for smart glasses (or even watches) are profound. After all, you won’t be running phone-sized apps on glasses with displays in the near term. Imagine if your phone could be controlled and information relayed from those secondary devices, including headphones, while the screen remains off in your pocket.
Beyond that, my big question is whether this voice control — assuming perfect accuracy — one day becomes the primary user interaction method over touch for your phone, if not laptop.
FTC: We use income earning auto affiliate links. More.

Comments