Apple tested whether AI could improve App Store search results

2 days ago

2 minutes read

Apple tested whether AI could improve App Store search results

Apple researchers conducted an A/B test to measure how AI-generated labels would affect App Store search ranking and app downloads. Here’s what they found.

AI-generated related labels are slightly improved App Store search conversions

In a new study titled Scoring Search Relevance: Augmenting App Store Ratings with LLM-Based Judgments, a group of Apple researchers tested whether LLMs could help improve App Store search results by generating relevant labels used to train the rating system.

As the study explains, compatibility is key to helping users find the apps they want. And while there are many factors that can influence search rankings, researchers focus on two important ones:

Compliance and disciplinewhich shows how users interact with results, such as when they tap or download an app.
Text compatibilitywhich measures how well an app’s metadata (such as its name, description, and keywords) matches a user’s search query.

In the study, the researchers say that while there is a lot of data available about the validity of behavior (since that can be easily measured), the same is not true for the relevance of the text:

Although behavioral relevance labels are abundant, textual relevance labels produced by human judges are very rare. This creates a key problem: high-quality text-related labels are rare and expensive to produce, making them difficult to measure and leaving the goal of text-relatedness underpowered for multi-objective training.

To address this problem, the researchers fine-tuned a 3-billion-parameter LLM on existing human judgments to learn to assign relevant labels to apps based on the user’s search query and app metadata.

Next, they generated millions of new labels related to that model, and retrained the App Store ranking system using both the original data, and the labels generated by LLM.

Once that was done, they ran an offline test, followed by a global A/B test on live App Store traffic:

“(…) the llm-augmented model showed a statistically significant +0.24% increase in our key metric, conversion rate, defined as the proportion of search sessions with at least one app download. Although this number may seem small, it is considered a significant improvement for a mature industrial manager. This benefit was seen in 89% of storefronts.”

In other words, users who saw search results ranked using the LLM-augmented model downloaded at least one app 0.24% more often than users who saw search results presented by the standard ranking model.

And while 0.24% is obviously a very small increase, it grows rapidly when we consider that the number of App Store downloads in 2025 is around 38 billion. In practice, that can translate into millions of additional downloads from App Store searches, which developers can really appreciate.

To read the full tutorial, follow this link.