The future of hardware and software development for artificial intelligence

Reflecting on the history of the computer industry, it all began with a small transistor, evolving into integrated circuits and eventually transforming into powerful CPUs capable of processing billions of instructions per second. Intel dedicated decades to optimizing CPU for complex instructions and refining branch predictors to enhance computational speed. However, this focus on intricate processing was rendered historical with the rise of GPUs by Nvidia. Instead of relying on complicated branch predictors, GPUs use hundreds of thousands of small cores to achieve significant computation throughput, which is particularly useful in large-scale computer graphic processing. This shift has empowered us to recreate real-world environments in the digital realm, unlocking immense potential.

Taking this evolution of computing even further, there is a prominent and swift wave in computing–Artificial intelligence (AI) computing. The advent of the ChatGPT has demonstrated how AI can revolutionize our world. Self-driving cars, which seemed impossible in the past, are now becoming tangible. All of them are propelled by the advancements in the hardware used to train AI models. While CPUs excel at handling conditional paths in programs with state-of-the-art branch predictors, AI models operate on a completely different paradigm. Instead of predicting the next instruction, AI models compute the probability of the next instruction to be executed, relying heavily on computational power. This underscores the significance of GPUs in the domain of AI applications.

The most capable AI accelerator with GPU architecture currently is Nvidia's H100. Compared to its predecessor, the A100, the H100 has a 1.5 times increase in bandwidth, about 3.5 times greater computing throughput (FP32-based), and 2 times higher power consumption. In addition to Nvidia, AMD is actively developing AI chips as well. Their MI300X, utilizing Chiplet technology to integrate multiple GPUs on one chip, surpasses the H100 in terms of Performance per Watt and Performance per Dollar. Despite having 147 billion transistors, nearly double that of the H100, the MI300X manages to increase chip size by only about 25%, resulting in a lower manufacturing cost compared to the H100.

	MI300X*	H100
Watt	750	700
Performance (FP32 TFLOP)	140	67
Performance (TF32 TFLOP)	N/A	989
Cost (US$)	30,000	33,000
Performance per watt	0.19	0.1
Performance per Dollar	4.67	2.03
Performance per Dollar in TF32	N/A	29.97

*The data for MI300X is not official data but inferred from previous AMD chip generations. The table indicates that, in terms of both Performance per Watt and Performance per Dollar, MI300X outperforms H100 by approximately two times / Table created by the author

Nonetheless, Nvidia's significant edge lies in CUDA, a crucial factor highlighted in the Performance of TF32 TFLOPs in Table 1. TF32 is a Tensor Floats operation format introduced with Nvidia A100 and compiled using CUDA. When computing the performance per Dollar using the TF32 format, the H100 significantly outperforms the MI300X by an impressive margin of about 7~8 times. This stark contrast shows the importance of Nvidia's software stack CUDA compared to AMD's counterpart ROCmt, emphasizing how critical it is to use the right software for AI computing acceleration.

AI accelerators utilizing GPU architecture are still inherently designed as general-purpose systems for tasks related to computer graphics, such as rendering images and processing graphics-related algorithms. However, for AI applications, these capabilities are often unnecessary. Instead, the chip needs to focus on computation throughput, memory bandwidth, cost, and power consumption. As a result, industries are increasingly optimizing accelerators specifically for AI applications, offering Application-Specific Integrated Circuit (ASIC) solutions.

Several tech giants, recognizing the need for specialized AI chips, have developed their own solutions. Google's Tensor Processing Unit (TPU), evolving to TPU v5, demonstrates a 2.7 times improvement in Performance per Dollar compared to its predecessor. Meta has introduced its MTIA v1 (Meta Training and Inference Accelerator) chip for efficient AI recommendation system calculations, rivaling the efficiency of the H100. In pursuit of fully autonomous vehicles, Tesla is developing its customized chip, Dojo, with lower costs than the H100 GPUs.

General-purpose AI chips have more comprehensive software support and a higher degree of programmability, enabling almost any AI model. While GPU hardware architecture may not be the optimal solution for AI, its programmable features enable many researchers to accelerate AI training for their development. However, I am particularly excited about the development of AI-specific accelerators, which I like to term AIPU (AI Processing Units). This signifies a shift towards specialized AIPUs tailored for the diverse landscape of AI applications, including LLM, self-driving vehicles, AR/VR, and climate science. With this entirely new paradigm, the software stack for the AIPU is as crucial as the hardware.

The advent of specialized AI chips holds the promise of delivering superior performance and cost-effectiveness customized to the unique requirements of each application. I anticipate continuous innovation in hardware architecture, shaping the trajectory of our post-2020s era of computers. More importantly, As the cost of AI computing decreases, envisioning a future where AI robots seamlessly integrate into urban environments and coexist with humans becomes an exciting and plausible prospect. The future is indeed exciting!

About the Author:

Eric Wu is a software engineer at Tesla, contributing to the software development of chips for Tesla's Full Self-Driving (FSD) computer. He played a vital role in the successful launch of the Tesla Semi in 2022 and continues to contribute to the development of various new products at Tesla. Eric holds a Master of Engineering degree from UC Berkeley, specializing in integrated circuit designs, where he helped with the tape-out of a wireless IC using RISC-V architecture. In addition, in 2017 and 2018, Eric and his team designed a sign-language translation system and a self-driving robot car, winning second place and first place in prestigious contests hosted by Arm and Synopsys. He is contributing the article as a member of the North America Taiwanese Engineering and Science Association (NATEA) member.