CONNECT WITH US

Arm unveils next-generation GPU architecture, propelling visual experiences of mobile computing to new heights

News highlights 0

As smartphones become the primary devices for streaming media and with the rise of augmented reality (AR), 3D gaming, and increasingly sophisticated generative AI technologies, consumers' pursuit of immersive experiences has led to higher and more complex computational demands on mobile computing platforms. To meet these diverse application needs, Arm recently launched their 2023 Total Compute Solutions (TCS23), which further enhances performance and efficiency, once again pushing the boundaries of visual computing.

Arm Total Compute Solutions is Arm's system-level solution for SoC designs of mobile devices. It encompasses a broad range of technology components, including hardware IP (CPU, GPU), interconnect and system IP technologies, as well as software and development tools. The recently announced TCS23 brings several key features. These include the flagship Immortalis-G720 GPU, which is based on the new 5th Generation GPU architecture, the most powerful Armv9 Cortex compute cluster to date, and enhanced system optimization technology. These advancements will keep driving the innovative applications for mobile devices.

GPU Technology is Crucial to Drive Digital Experiences

Arm has been dedicated to GPU development for many years, starting with the widely used Mali GPU to the groundbreaking flagship Immortalis-G715 GPU, which last year introduced hardware-based ray tracing technology for the first-time, pushing the performance of mobile GPUs to new heights.

From left, Andy Craigen, Chris Bergey, and Stefan Rosinger

From left, Andy Craigen, Chris Bergey, and Stefan Rosinger

Chris Bergey, Senior Vice President and General Manager of the Client Line of Business at Arm, emphasized the growing importance of GPUs in mobile phone design. He highlighted that GPUs play an increasingly crucial role in delivering exceptional visual experiences and enabling machine learning (ML). There is also a growing need for GPUs that offer both high performance and efficiency.

Andy Craigen, Director of Product Management of Arm's Client Line of Business, added that improving graphics performance is very important to mobile phone design, and Arm has invested a lot of resources in building graphics platforms. "We know that graphics are very power-hungry on desktop and game consoles. Therefore, Arm needs to convince the developer community that a complex visual experience similar to that on a PC can be achieved on a mobile phone, so that they will be interested in porting their games to the Android mobile graphics platform."

Bringing Hardware-based Ray Tracing Technology to Mobile Devices

Chris Bergey highlighted that since the launch of the Immortalis-G715 last year, it has garnered positive feedback from the industry in terms of performance, power, and area (PPA). The developer community has also shown a keen interest in the application of ray tracing technology on mobile phones.

Achieving true 3D imagery on mobile phones poses a significant challenge in balancing performance and power consumption. Andy Craigen said, "Directly transplanting ray tracing technology from PCs to the mobile platform is not feasible. Therefore, Arm has dedicated substantial time to analyze ray tracing technology and determine which functions can yield optimal results while meeting the power consumption and die area requirements of mobile phones. We embarked on this journey with the introduction of the Immortalis-G715 last year, and we will continue to push forward."

At the Game Developers Conference 2023 (GDC 2023) held earlier this year, Arm, MediaTek, and Tencent Games jointly demonstrated the application of ray tracing technology. Furthermore, Arm is actively promoting the understanding of this technology within the ecosystem, assisting developers in utilizing various resources for game development, including popular game engines like Unity and the free Arm Mobile Studio software development tools.

In order to demonstrate how to implement ray tracing technology, the Arm Taiwan team also tried to develop their own games for demonstration. "Our aim is to demonstrate the feasibility and exceptional visual effects of the Immortalis platform in supporting 3D graphics, while ensuring it remains within the power budget of mobile phones," said Chris Bergey.

The New 5th Gen Arm GPU Architecture

In order to further boost GPU performance and achieve a more immersive visual experience, Arm recently unveiled its 5th Generation GPU architecture, which they are simply calling 5th Gen, and the new Immortalis-G720 based on this architecture. This is Arm's most powerful and efficient GPU ever, delivering a 15% increase in performance and efficiency over its predecessor with only a 2% increase in area and a 40% reduction in memory bandwidth usage.

The 5th Gen GPU architecture introduces a key feature called Deferred Vertex Shading (DVS), which revolutionizes data flow within the GPU and expands the number of GPU cores, reaching up to 16 cores for enhanced performance.

According to Chris Bergey, memory access and data movement are the primary factors contributing to GPU power consumption. The efficiency of bandwidth usage distinguishes a mobile graphics platform from a desktop computer. By incorporating DVS technology, bandwidth usage and access to external DRAM can be significantly reduced, resulting in higher frame rates and enabling mobile phones to handle more complex graphics workloads.

"The 5th Gen GPU architecture extends beyond gaming and finds applications in various markets. In addition to gaming, 3D vision presents new opportunities for mobile devices, including augmented reality (AR) and computer-aided graphics (CAD) design."

Enabling AI/ML Applications on Mobile Phones

The improvement of GPU performance is also crucial for enhancing the AI processing capabilities of mobile devices. Chris Bergey stated that Arm provides a powerful and essential foundational compute platform through TCS23 for mobile devices, and customers can also make differentiated designs for NPUs in their SoCs. Arm will continue to provide relevant support through close cooperation with partners.

Chris emphasized the significance of heterogeneous computing in enhancing AI capabilities, which involves various computing requirements, including machine learning tasks, inference tasks, and power-sensitive tasks. The challenge lies in enabling developers to effectively program AI to utilize the most appropriate processor for each task. For Arm, this involves not only advancing and optimizing the hardware architecture but also providing comprehensive software and application support. So, Arm's objective is to assist customers in efficiently building AI capabilities where required, in order to effectively tackle these challenges.

With the rapid emergence of new intelligent applications like generative AI, Arm is proactively advancing the AI processing capabilities of mobile phones, doubling them every two years. Additionally, Arm is leading the way in supporting developers to take advantage of AI and machine learning (ML) workloads by enabling its hardware with increased ML capabilities via open-source software libraries. Arm NN and Arm Compute Library are being used by Google apps on Android with over 100 million active users already, enabling developers to optimize the execution of their ML workloads on Armv9 Cortex-A CPUs and Arm GPUs.

Cortex-X4: the Highest-Performing and Most Efficient Design

In terms of CPU, Arm has introduced the fourth-generation Cortex-X core, the Cortex-X4, which is Arm's fastest CPU to date. Compared to the Cortex-X3, it offers a 15% increase in performance. Additionally, under the same process technology, the new power-efficient microarchitecture of the Cortex-X4 reduces power consumption by 40% compared to the Cortex-X3, with only 10% area increase, making it the most performant per millimeter squared in Cortex-X history.

Stefan Rosinger, Senior Director of CPU Product Management, Client Line of Business at Arm, stated that, as shown in the figure, it is evident that the power–efficiency curve of the Cortex-X4 has shifted noticeably to the right. This indicates that the Cortex-X4 achieves significant power savings compared to Cortex-X3 while delivering the same level of performance. In other words, Cortex-X4 can provide higher performance under the same power consumption.

Compute Cluster: Upgrades

The Cortex-X4 achieves significant power savings compared to Cortex-X3 while delivering the same level of performance.

"Although Cortex-X series cores are designed with the "performance-first" concept, it is essential to consider efficiency within the limited power budget of mobile phones to deliver tangible value to customers. Notably, Cortex-X4 not only improves power consumption but also enhances area efficiency, enabling it to offer higher performance within the same area. This was an important consideration in the Cortex-X4 design."

Chris Bergey added, "The power budget of mobile phones is limited, so improving performance must consider power consumption. By utilizing the efficient Cortex-X4, customers can apply the performance gains achieved at the same power consumption to various computations such as AI. Additionally, this curve represents the results under iso-process conditions. If the N4 or N3 process is adopted, the efficiency gains will be even more significant."

Furthermore, Cortex-X4 alongside its companion cores Cortex-A720 and Cortex-A520, can be scaled up to a 14-core cluster using the DSU-120 featuring L2 caches up to 2MB, as well as a 32MB L3 cache. This scalability ensures excellent performance and enables flexible configurations to meet the specific requirements of customers across various application markets. In addition to flagship smartphones, the enhanced performance and efficiency offered by Cortex-X4 will also help the expansion of the Windows-on-Arm laptop market.

Powerful Compute Cluster Enabled by System Optimization

In addition to the introduction of new CPUs and GPUs, TCS23 also places significant emphasis on enhancing system optimization technology, which is aimed at improving overall performance.

According to Chris Bergey, when developing GPUs, Arm also takes CPU and system performance into consideration. Taking the newly introduced Immortalis-G720 as an example, it can utilize a shared system-level cache of up to 32MB with the CPU, optimizing its configuration based on the workload. The aim is to access data locally, minimizing the use of external DRAM and reducing GPU power consumption.

In terms of CPU clusters, Arm has upgraded its DSU (DynamIQ Shared Unit) to DSU-120. Alongside the previously mentioned 14-core scalability and 32MB system cache, another notable feature is the provision of different power modes for different workloads.

Stefan Rosinger stated that mobile devices utilize different cores such as Cortex-X and Cortex-A. With this functionality, specific cores can be powered on or off based on different workloads. As shown in the figure, when comparing the wide range of CPU use cases, it is evident how power savings can be achieved across various usage scenarios. The newly introduced power modes of DSU-120 can effectively reduce chip leakage current.

"The scaling of the on-chip memory does not keep pace with logic scaling, so we need to carefully consider power consumption when increasing L3 cache capacity to improve performance. As the larger cache introduces new power consumption requirements, the reduction of leakage power is increasingly critical."

Different Modes for Different Workloads

With different power modes for different workloads, DSU-120 can effectively reduce chip leakage current.

Embracing the Era of Heterogeneous Integration

In the upcoming years, Arm will continue to develop its next-generation key IP, including the Krake GPU and Blackhawk CPU, to fulfill the increasing demands of its partners for improved computing and graphics performance.

However, as chips approach the scaling limit of the 2nm process, it has become an inevitable trend to move towards 3D stacking and advanced packaging technologies. As Arm provides intellectual property (IP) for system-on-a-chip (SoC) designers, what impact and changes will this transition bring?

Chris Bergey emphasized the importance of considering the trade-off between performance, power, and area as the semiconductor process reaches its scaling limits. This is crucial to adapt to process advancements and enable customers to maximize the benefits. As the semiconductor industry enters the era of 3D stacking and advanced packaging technology, it becomes imperative to approach designs from a system perspective and partition them to offer customers optimal solutions. For instance, it may involve keeping memory in a mature process while utilizing advanced processes for computing cores. Therefore, close collaboration with foundries such as TSMC is essential, as a deep understanding of process technology will enable Arm and partners to develop the best solutions.

He highlighted that advanced packaging components utilizing heterogeneous integration will become increasingly prevalent in the mobile market in the future. Arm is committed to embracing this trend and maintaining its leadership in the "More than Moore" era. "The ever-growing demand for computing is insatiable, and I believe the future is full of hope and possibilities, a future built on Arm!"