When DeepSeek unveiled its V4 model, it ended with a line from ancient Chinese thinker Xunzi: ignore applause and criticism, and focus on doing things the right way. In today's AI context, the message reads less like philosophy and more like positioning.
For the past two years, the generative AI race has been largely defined by scale. Model size, compute intensity, and benchmark rankings became shorthand for capability, driving companies such as OpenAI, Google, and Anthropic to push toward ever larger architectures and longer context windows. The result has been a steady escalation in computing demand and cost.
DeepSeek's approach diverges at a critical point. Rather than maximising scale, it focuses on whether comparable or better performance can be achieved with less computing power. Techniques such as mixture-of-experts (MoE) architectures and attention optimisation aim to reduce redundant calculations and concentrate resources where they matter most. In effect, the competitive question shifts from who spends more compute to who uses it more efficiently.
This design philosophy is no longer abstract. It is beginning to align with how the market itself is evolving.
Inference economics reshape AI stack
A structural shift is underway from training-centric to inference-driven demand. According to Omdia, inference already accounts for 90% to 95% of enterprise AI workloads. As AI agents and real-time applications scale, the cost of generating outputs, rather than training models, has become the dominant economic factor.
This transition is also changing how computing is priced. A cloud infrastructure expert from Nebius noted in an AlphaSense interview that the industry is moving away from GPU-hour pricing toward a more intuitive metric: cost per million tokens. Under this framework, peak performance matters less than cost efficiency per unit of output.
That shift is beginning to erode the long-standing dominance of Nvidia, whose GPUs have powered more than 90% of AI workloads and delivered gross margins of around 75%. While high-end chips such as H100, H200, and Blackwell B200 remain central to training, their cost structures are less optimised for large-scale inference.
In response, major technology firms — including Amazon, Meta, and Google — are investing heavily in inference-focused AI ASICs to reduce token-level costs and limit dependence on Nvidia's ecosystem. Goldman Sachs estimates that Google's TPU development, in partnership with Broadcom, has already reduced inference costs by around 70% between successive generations.
Startups are also entering the field. Cerebras has filed for an IPO backed by a multi-billion-dollar agreement with OpenAI to develop custom inference chips, while architectures such as Groq's LPU are demonstrating significantly higher token throughput at lower cost.
Against this backdrop, DeepSeek's design choices begin to look less contrarian and more aligned with where the market is heading.

Credit: DIGITIMES
Cost collapse signals a different competitive axis
DeepSeek V4 embodies this shift in practical terms. The model family, including V4-Pro and V4-Flash, combines large-scale architectures with aggressive optimisation for inference efficiency, supporting context lengths of up to one million tokens while focusing on latency and cost control.
The results are reflected most clearly in pricing. DeepSeek's API output cost is roughly one-hundredth that of comparable offerings from OpenAI, according to disclosed figures. This is not simply a pricing strategy but the outcome of architectural decisions, including KV cache optimisation, lower-precision computation such as FP4, and more efficient expert routing.
A key enabler is hardware independence. DeepSeek V4 is deeply adapted to Huawei Ascend chips, reducing reliance on Nvidia GPUs and aligning with a broader push toward alternative AI compute ecosystems. Performance gains of up to 1.96 times in latency-sensitive scenarios have been reported under this setup, with multiple domestic chipmakers completing compatibility validation.
This integration also hints at a wider structural shift. As inference becomes the dominant workload, the competitive advantage may increasingly lie not in proprietary chips alone, but in tightly integrated hardware-software stacks optimised for cost per token.
Even Nvidia has acknowledged the direction of travel. The company has moved to strengthen its position in inference through acquisitions such as Groq's LPU architecture, recognising that future demand will not be driven solely by training-scale compute.
From scale to efficiency, and from capability to usability
DeepSeek's framing of its strategy, echoing Xunzi's emphasis on disciplined execution, maps onto a broader recalibration in the AI industry. The focus is shifting from theoretical performance ceilings to practical deployment metrics: cost, efficiency, and reliability.
This includes not only computational efficiency but also model behaviour. As large models move into decision-making and reasoning tasks, stability and trustworthiness — through alignment and reduced hallucination — are becoming as critical as raw capability.
In this context, DeepSeek V4's significance may lie less in any single benchmark and more in what it represents. It points to an alternative development path where engineering efficiency, economic viability, and system integration define competitiveness.
The implication is clear. The next phase of the AI race will not be decided solely by who builds the largest models, but by who can make them usable at scale.
Article edited by Jack Wu