Smartphones & Tablets

Apple researchers are developing a local AI agent that interacts with apps


Despite having 3 billion parameters, Ferret-UI Lite matches or exceeds the benchmark performance of models up to 24 times. Here are the details.

Ferret’s little background

In December 2023, a team of 9 researchers published a study called “FERRET: Refer and Ground Anything Anywhere Anywhere at Any Granularity”. In it, they presented a large multimodal linguistic model (MLLM) that is able to understand natural language cues in certain parts of an image:

Image: An apple

Since then, Apple has published a series of follow-up papers that expand the Ferret family of models, including Ferretv2, Ferret-UI, and Ferret-UI 2.

Specifically, Ferret-UI variants are extended from FERRET’s original capabilities, and they are trained to overcome what the researchers describe as the shortcomings of standard domain MLLMs.

From the original Ferret-UI paper:

Recent advances in large-scale linguistic models (MLLMs) have been remarkable, however, these general domain MLLMs often fall short of their ability to understand and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM designed for advanced understanding of mobile UI screens, equipped with pointing, rendering, and reasoning capabilities. Given that UI screens often show a longer view and contain smaller objects of interest (eg, icons, text) than native images, we include “any resolution” on top of Ferret to increase detail and use advanced visual features.

Image: An apple
The original Ferret-UI research included an interesting use of the technology, where the user can talk to the model to better understand how to interact with the interface, as seen on the right.

A few days ago, Apple expanded the Ferret-UI family of models even more, with a study called Ferret-UI Lite: Lessons from an Agent for Building Small GUIs on a Device.

Ferret-UI is built on the 13B parameter model, with a focus on mobile UI understanding and fixed resolution screenshots. Meanwhile, Ferret-UI 2 extended the system to support multiple platforms and a high resolution view.

In contrast, Ferret-UI Lite is a very lightweight model, designed to work on the device, while still being competitive with the biggest GUI agents.

Ferret-UI Lite

According to the researchers of the new paper, “many of the existing methods for GUI agents […] focus on large base models.” That’s because “the strong reasoning and programming capabilities of large server-side models allow these agent systems to achieve incredible capabilities in a variety of GUI navigation tasks.”

They note that although there has been great progress in both multi-agent GUI systems, and end-to-end GUI, which take different approaches to manage many tasks involving agency interactions with GUIs (“low GUI level, screen understanding, multi-step programming, and self-reflection”), they are fundamentally too large and performance-hungry to be effective.

Therefore, they set out to develop Ferret-UI Lite, a variant of the 3-billion-parameter Ferret-UI, “built with several key components, guided by details in training micro-linguistic models”.

Features of Ferret-UI Lite:

  • Real and synthetic training data from multiple GUI domains;
  • On-the-fly (or, inference-time) cropping and zooming techniques to better understand certain parts of the GUI;
  • Monitored fine tuning and reinforcement learning strategies.

The result is a model that closely matches or even outperforms competing GUI agent models by up to 24 times its parameter count.

Image: An apple

While the overall architecture (which is detailed in the study) is interesting, the real-time cropping and zooming techniques are particularly noteworthy.

The model makes an initial prediction, the surrounding vegetation, and then re-predicts for that cut region. This helps such a small model to compensate for its limited capacity to process large numbers of image tokens.

Image: An apple

Another notable contribution of the paper is how Ferret-UI Lite generates its own training data. The researchers developed a multi-agent system that interacts directly with live GUI platforms to generate artificial training examples at scale.

There is a curriculum task generator that suggests goals of increasing difficulty, a planning agent breaks it down into steps, a reinforcement agent executes it on screen, and a critique model evaluates the results.

Image: An apple

Through this pipeline, the training system captures the vagaries of real-world interactions (such as errors, unexpected situations, and recovery strategies), which can be very challenging to do while relying on clean, human-annotated data.

Interestingly, while Ferret-UI and Ferret-UI 2 used iPhone screenshots and other Apple interfaces in their analysis, Ferret-UI Lite was trained and tested in an Android, web, and desktop GUI environment, using benchmarks such as AndroidWorld and OSWorld.

The researchers aren’t exactly sure why they chose this route for Ferret-UI Lite, but it likely reflects the availability of reproducible, large-scale GUI agent testbeds available today.

Regardless, the researchers found that while Ferret-UI Lite performed well on a short, low-level task, it did not perform as well on complex, multi-step interactions, a trade-off that might not have been expected, given the constraints of a small, on-device model.

On the other hand, Ferret-UI Lite offers a local, and by extension, private agent (since no data needs to go to the cloud and be processed on remote servers) that automatically interacts with user interfaces based on user requests, which, by all accounts, is great.

To learn more about the study, including a breakdown of the benchmark and results, follow this link.

Accessories deals on Amazon

Add 9to5Mac as a favorite source on Google
Add 9to5Mac as a favorite source on Google

FTC: We use auto affiliate links to earn income. More.

Back to top button