AI Inference Revolution: Wallace Kou on Memory Shifts

The global semiconductor landscape is undergoing a fundamental shift, moving from a focus on raw training power to the practical complexities of large-scale deployment. In an in-depth interview, Wallace Kou, President and CEO of Silicon Motion, detailed how the generative AI has evolved beyond its initial stages. While the market's early gaze was fixed almost exclusively on NVIDIA's GPUs, the High Bandwidth Memory (HBM), and the CoWoS advanced packaging technology, Kou argues that the industry is now entering the "Inference" era that is turning previous under-estimation about storage's importance on their head.

The Shift from Training to Inference

The turning point for this realization occurred during the NVIDIA GTC conference in March 2026. CEO Jensen Huang unveiled the Vera Rubin architecture, a move that signaled a massive spike in demand for NAND flash memory. During the initial AI boom, the industry was preoccupied with training massive models, a process that relies heavily on the lightning-fast throughput of HBM. However, as these models move into the inference phase - where they are actually used by end-users to generate content or solve problems - the access to context, historical data, and massive datasets storage become the primary bottleneck.

Kou notes a dramatic shift in market sentiment. Only two years ago, storage was often an afterthought in the AI conversation; today, it is a critical scarcity. "There is currently not a single global cloud service provider or major smartphone manufacturer whose demand for DRAM and NAND is being fully satisfied," Kou observed. This supply-demand gap has triggered a financial windfall for storage module manufacturers and memory giants, with some stock prices skyrocketing up to tenfold as the market reacts to persistent shortages and rising prices.

Technical Paradigm Shift: CMX and the Infrastructure of Thought

At the heart of this transition is a new architecture introduced by NVIDIA: the CMX Context Memory Storage platform. This architecture is designed specifically to handle the "KV Cache" (Key-Value Cache), which allows AI models to remember the context of a conversation or a complex task during the inference process.

The hardware requirements for the CMX architecture are staggering in their scale and technical demands. Each individual Rubin GPU requires 16TB of dedicated storage to function effectively within this framework. At a system-level scale, a single NV72 Vera-Rubin setup can demand more than 1 Petabyte, or 1,000 Terabytes, of total storage capacity. Beyond mere capacity, the CMX architecture facilitates direct GPU access to storage, a feature that bypasses traditional latency bottlenecks and ensures that AI inference remains fluid and responsive.

While this creates a massive commercial opportunity for the storage industry, it also places an unprecedented strain on NAND production. Kou emphasizes that this is not just a cloud-based phenomenon. The explosion of Edge AI - AI processed locally on devices - is further complicating the supply chain. For instance, driven by major players like Meta, the market for smart glasses is expected to reach 60 million units this year. These wearable devices require high-performance embedded storage, creating a secondary front in the war for NAND capacity.

Silicon Motion's Role: Solving the QoS Bottleneck

As the world's leading NAND controller maker, Silicon Motion sits at the intersection of these competing demands. The primary technical challenge in modern AI environments is maintaining Quality of Service (QoS). In a multi-tenant cloud environment, where multiple GPUs are accessing shared storage simultaneously for different inference tasks, data transfer speeds can often fluctuate or drop.

To solve this, Silicon Motion has deployed its proprietary PerformaShape technology. This technology ensures that even under heavy, concurrent workloads, the transmission speed remains stable. By stabilizing these data flows, Silicon Motion has positioned itself as an "indispensable stabilizer" in the AI ecosystem.

Beyond data path optimization, Silicon Motion is also extending its role into system-level infrastructure by providing enterprise-grade boot drives for leading AI GPU, TPU, and DPU platforms, ensuring system reliability and fast initialization at scale.

The Crisis of Imbalance: Kou's "Capacity Persuasion" Efforts

Despite the record-breaking revenues, Kou is deeply concerned about the "shadows" lurking behind this prosperity. The current memory market is suffering from a dangerous imbalance. To maximize profits and satisfy the insatiable hunger of AI cloud giants, major manufacturers like Samsung, SK Hynix, and Micron are funneling the majority of their capital expenditure (CAPEX) into HBM and DDR5 production.

This strategic pivot has effectively "squeezed" the production capacity available for standard NAND flash. Kou warns that this "AI squeezing effect" could lead to a collapse in traditional sectors. Over the past eight months, Kou has embarked on a global mission, meeting with leaders at Samsung, SK Hynix, Kioxia, SanDisk, YMTC, and Micron. His message is one of "capacity persuasion": he is urging these giants to reserve a portion of their production lines for the automotive, PC, and smartphone industries.

"If these foundational industries break because they cannot find parts, Edge AI will have no 'soil' to grow in," Kou warned. He believes that a total focus on the high-margin AI server market could eventually backfire, destroying the broader technology ecosystem that supports AI development.

A Stabilizing Strategy: From Cloud to Edge

Silicon Motion is positioning itself as the "transition enabler" for an industry in flux amid an expected 2–3 year supply shortage. As NAND manufacturers concentrate their internal resources on AI-driven initiatives, they are increasingly outsourcing non-core and mainstream projects, such as PCIe Gen5 controllers and embedded solutions. In this shift, Silicon Motion has emerged as a preferred partner to fill the resulting gap.

At the same time, as rising prices weigh on demand in the PC and smartphone markets, the company is helping customers pivot toward automotive and AIoT applications, including rapidly growing segments such as smart glasses, which are seeing a surge in shipments this year.

One of the most critical areas is the automotive sector, where Silicon Motion has spent a decade building a presence. While memory giants might see automotive requirements as "niche" or low volume compared to AI servers, Kou views them as essential to global stability. When major OEMs consider abandoning these specialized demands due to capacity constraints, Silicon Motion steps in to ensure the global automotive supply chain does not grind to a halt.

"We are not just looking for a surge in revenue; we want to fulfill our responsibility to the industry," Kou said. By providing stable controllers and storage solutions for AIoT and automotive applications, Silicon Motion is effectively repairing the cracks in a fractured global supply chain.

Future Outlook: 2027 and Beyond

The current supply-demand imbalance is not a temporary glitch but a structural reality that Kou expects to persist until at least late 2027 or 2028. Several factors make it nearly impossible to add capacity quickly, for example, land acquisition is increasingly difficult. The lead time for building specialized cleanrooms and procuring critical equipment now exceeds one year.

Kou predicts that while the DRAM shortage might begin to ease by the end of 2027, the relief for NAND will likely come even later. In this high-pressure environment, Silicon Motion's role as a key stabilizing force becomes increasingly important.

Particularly in emerging sectors such as smart IoT and automotive applications, Silicon Motion delivers reliable controller and storage solutions, filling the vacuum left by production shifts at major manufacturers or by projects lacking sufficient engineering support.

By helping global clients navigate the complexities of geopolitics and capacity wars, Silicon Motion aims to ensure that the AI revolution leads to a steady, sustainable future rather than a chaotic collapse of the broader tech industry.

AI inference boom fuels supply-demand imbalance until 2027-2028, says Wallace Kou. Credit: Silicon Motion