Apple researchers are developing a local AI agent that interacts with apps

2 weeks ago

3 minutes read

Apple researchers are developing a local AI agent that interacts with apps

Despite having 3 billion parameters, Ferret-UI Lite matches or exceeds the benchmark performance of models up to 24 times. Here are the details.

Ferret’s little background

In December 2023, a team of 9 researchers published a study called “FERRET: Refer and Ground Anything Anywhere Anywhere at Any Granularity”. In it, they presented a large multimodal linguistic model (MLLM) that is able to understand natural language cues in certain parts of an image:

Since then, Apple has published a series of follow-up papers that expand the Ferret family of models, including Ferretv2, Ferret-UI, and Ferret-UI 2.

Specifically, Ferret-UI variants are extended from FERRET’s original capabilities, and they are trained to overcome what the researchers describe as the shortcomings of standard domain MLLMs.

From the original Ferret-UI paper:

Recent advances in large-scale linguistic models (MLLMs) have been remarkable, however, these general domain MLLMs often fall short of their ability to understand and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM designed for advanced understanding of mobile UI screens, equipped with pointing, rendering, and reasoning capabilities. Given that UI screens often show a longer view and contain smaller objects of interest (eg, icons, text) than native images, we include “any resolution” on top of Ferret to increase detail and use advanced visual features.

A few days ago, Apple expanded the Ferret-UI family of models even more, with a study called Ferret-UI Lite: Lessons from an Agent for Building Small GUIs on a Device.

Ferret-UI is built on the 13B parameter model, with a focus on mobile UI understanding and fixed resolution screenshots. Meanwhile, Ferret-UI 2 extended the system to support multiple platforms and a high resolution view.

In contrast, Ferret-UI Lite is a very lightweight model, designed to work on the device, while still being competitive with the biggest GUI agents.

Ferret-UI Lite

According to the researchers of the new paper, “many of the existing methods for GUI agents […] focus on large base models.” That’s because “the strong reasoning and programming capabilities of large server-side models allow these agent systems to achieve incredible capabilities in a variety of GUI navigation tasks.”

They note that although there has been great progress in both multi-agent GUI systems, and end-to-end GUI, which take different approaches to manage many tasks involving agency interactions with GUIs (“low GUI level, screen understanding, multi-step programming, and self-reflection”), they are fundamentally too large and performance-hungry to be effective.

Therefore, they set out to develop Ferret-UI Lite, a variant of the 3-billion-parameter Ferret-UI, “built with several key components, guided by details in training micro-linguistic models”.

Features of Ferret-UI Lite:

Real and synthetic training data from multiple GUI domains;
On-the-fly (or, inference-time) cropping and zooming techniques to better understand certain parts of the GUI;
Monitored fine tuning and reinforcement learning strategies.

The result is a model that closely matches or even outperforms competing GUI agent models by up to 24 times its parameter count.

While the overall architecture (which is detailed in the study) is interesting, the real-time cropping and zooming techniques are particularly noteworthy.

The model makes an initial prediction, the surrounding vegetation, and then re-predicts for that cut region. This helps such a small model to compensate for its limited capacity to process large numbers of image tokens.

Another notable contribution of the paper is how Ferret-UI Lite generates its own training data. The researchers developed a multi-agent system that interacts directly with live GUI platforms to generate artificial training examples at scale.

There is a curriculum task generator that suggests goals of increasing difficulty, a planning agent breaks it down into steps, a reinforcement agent executes it on screen, and a critique model evaluates the results.

Through this pipeline, the training system captures the vagaries of real-world interactions (such as errors, unexpected situations, and recovery strategies), which can be very challenging to do while relying on clean, human-annotated data.

Interestingly, while Ferret-UI and Ferret-UI 2 used iPhone screenshots and other Apple interfaces in their analysis, Ferret-UI Lite was trained and tested in an Android, web, and desktop GUI environment, using benchmarks such as AndroidWorld and OSWorld.

The researchers aren’t exactly sure why they chose this route for Ferret-UI Lite, but it likely reflects the availability of reproducible, large-scale GUI agent testbeds available today.

Regardless, the researchers found that while Ferret-UI Lite performed well on a short, low-level task, it did not perform as well on complex, multi-step interactions, a trade-off that might not have been expected, given the constraints of a small, on-device model.

On the other hand, Ferret-UI Lite offers a local, and by extension, private agent (since no data needs to go to the cloud and be processed on remote servers) that automatically interacts with user interfaces based on user requests, which, by all accounts, is great.

To learn more about the study, including a breakdown of the benchmark and results, follow this link.

Accessories deals on Amazon

FTC: We use auto affiliate links to earn income. More.

2 weeks ago

3 minutes read

Apple researchers are developing a local AI agent that interacts with apps

Ferret’s little background

Ferret-UI Lite

Accessories deals on Amazon

LG’s smart sensor aims to turn dumb appliances into slightly smarter ones

Philips Hue blocks access to ‘untested’ third-party bulbs due to ‘interoperability issues’

Revolv, acquired by Nest Labs in 2014, is shutting down all of its services next month

Researchers show how malicious apps can control Samsung SmartThings locks, lights and more [Video]

Fitbit Sense 2 vs Sense 1 – Specs Comparison

The 10 Best Aftermarket Apple Watch Ultra Bands

Ferret’s little background

Ferret-UI Lite

Accessories deals on Amazon

Government Documents Reveal New Details About Tesla and Waymo Robotaxis' Children

12 of the most exciting products I saw at the Bristol Hi-Fi Show 2026

Related Articles

Samsung is showing off a few OLED concepts, including a cute little robot

These 6 Port Moving Products Will Save Your Back

Charred Ryobi Air Compressor Warning To Anyone Who Keeps Power Tools In Their Car

HONOR MagicPad 4 : 4.8 mm Thin, 12.3-inch 3K 165Hz OLED, Snapdragon 8 Gen 5, and Real PC style mode