Apple unveils new AI model boosting Siri, rivals GPT-4 in performance

Apple has recently announced a new AI model designed to understand ambiguous references on screens and in conversation contexts.

The model, collectively dubbed ReALM (Reference Resolution As Language Modeling) by the company's AI researcher, was unveiled in a published paper at the end of March.

Coming in four parameter sizes: 80 million, 250 million, 1 billion, and 3 billion, the models reportedly will significantly enhance user interactions with the voice assistant Siri.

According to reports from Venture Beat and Apple Insider, the model is designed to identify and determine the specific objects or concepts referred to by pronouns or unclear entities in text. Applying this to language modeling means the model must comprehend the context and the semantic relationships between words.

A more understanding Siri

When users interact with Siri, they often mention various contextual entities or information, using terms like "this" or "that." In the paper, researchers categorized these entities into three types: on-screen entities, conversational entities, and background entities.

On-screen entities refer to content currently displayed on the screen while conversational entities relate to entities mentioned in the conversation, originating from the user's previous statements or options provided by the voice assistant. For example, when a user says, "Call mom," the word "mom" becomes a conversational entity.

Background entities refer to the user's current activities or entities not directly related to the content on the screen, such as playing music or an upcoming alarm.

Traditional model analysis often relies on large models and parameters, such as images. However, Apple's approach involves converting all image information into text, eliminating the need for complex image recognition algorithms.

This makes the AI model lighter, more efficient, and less resource-intensive, allowing for easy deployment on terminal devices.

For instance, when a user is browsing a website and decides to call a store, they only need to say "Call this store" to Siri. In turn, the assistant will read the content on the screen, identifying the store's phone number before making the call.

In the paper, Apple's AI researchers emphasized the importance of voice assistants to understand the current context and the mentioned objects or concepts. Thus allowing users to command voice assistants without having to manually perform tasks.

Benchmark performances

The researchers compared the model's performance with GPT-3.5 and GPT-4. Since GPT-3.5 exclusively handles text input, the researchers solely utilized text prompts in their evaluations. Conversely, for GPT-4, which boasts image and text comprehension capabilities, researchers supplemented the text prompts with screenshots.

Even the smallest ReALM model showed an accuracy improvement of over 5% in recognizing different types of entities compared to the existing systems. Compared to GPT-3.5 and GPT-4, the smallest ReALM model performs similarly to GPT-4, while larger versions significantly outperform GPT-4.

Apple's annual Worldwide Developers Conference (WWDC) is scheduled for June 10^th. Greg Joswiak, Apple's senior vice president of global marketing, hinted on social media that AI would be the conference's focus.