Apple’s new AI model learns to understand your apps and screen: Could it unlock Siri's full potential?
Artificial intelligence is quickly becoming a part of our mobile experience, with Google and Samsung leading the charge. Apple, however, is also making significant strides in AI within its ecosystem. Recently, the Cupertino tech giant introduced a project known as MM1, a multimodal large language model (MLLM) capable of processing both text and images. Now, a new study has been released, unveiling a novel MLLM designed to grasp the nuances of mobile display interfaces.
The paper, published by Cornell University and highlighted by Apple Insider, introduces "Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs."
When reading between the lines, it suggests that Ferret-UI could enable Siri to understand better the appearance and functionality of apps and the iOS interface itself.
To address this, Ferret-UI introduces a magnification feature that enhances the readability of screen elements by upscaling images to any desired resolution. This capability is a game-changer for AI's interaction with mobile interfaces.
As per the paper, Ferret-UI stands out in recognizing and categorizing widgets, icons, and text on mobile screens. It supports various input methods like pointing, boxing, or scribbling. By doing these tasks, the model gets a good grasp of visual and spatial data, which helps it tell apart different UI elements with precision.
What sets Ferret-UI apart is its ability to work directly with raw screen pixel data, eliminating the need for external detection tools or screen view files. This approach significantly enhances single-screen interactions and opens up possibilities for new applications, such as improving device accessibility.
The research paper touts Ferret-UI's proficiency in executing tasks related to identification, location, and reasoning. This breakthrough suggests that advanced AI models like Ferret-UI could revolutionize UI interaction, offering more intuitive and efficient user experiences.
While it is not confirmed whether Ferret-UI will be integrated into Siri or other Apple services, the potential benefits are intriguing. Ferret-UI, by enhancing the understanding of mobile UIs through a multimodal approach, could significantly improve voice assistants like Siri in several ways.
This could mean Siri gets better at understanding what users want to do within apps, maybe even tackling more complicated tasks. Plus, it could help Siri grasp the context of queries better by considering what is on the screen. Ultimately, this could make using Siri a smoother experience, letting it handle actions like navigating through apps or understanding what is happening visually.
Ferret-UI is a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities.
When reading between the lines, it suggests that Ferret-UI could enable Siri to understand better the appearance and functionality of apps and the iOS interface itself.
The study highlights that, despite progress in MLLMs, many models struggle with understanding and interacting with mobile user interfaces (UI). Mobile screens, often used in portrait mode, present unique challenges with their dense arrangement of icons and text, making it difficult for AI to interpret.
Ferret-UI in action, analyzing the display of an iPhone (Image Credit–Apple)
To address this, Ferret-UI introduces a magnification feature that enhances the readability of screen elements by upscaling images to any desired resolution. This capability is a game-changer for AI's interaction with mobile interfaces.
As per the paper, Ferret-UI stands out in recognizing and categorizing widgets, icons, and text on mobile screens. It supports various input methods like pointing, boxing, or scribbling. By doing these tasks, the model gets a good grasp of visual and spatial data, which helps it tell apart different UI elements with precision.
The research paper touts Ferret-UI's proficiency in executing tasks related to identification, location, and reasoning. This breakthrough suggests that advanced AI models like Ferret-UI could revolutionize UI interaction, offering more intuitive and efficient user experiences.
What if Ferret-UI gets integrated into Siri?
While it is not confirmed whether Ferret-UI will be integrated into Siri or other Apple services, the potential benefits are intriguing. Ferret-UI, by enhancing the understanding of mobile UIs through a multimodal approach, could significantly improve voice assistants like Siri in several ways.
Things that are NOT allowed: