How productization elevated GPT to ChatGPT: the evolution of language models

Yenkai (YK) Huang, special to DIGITIMES Asia, Taipei 0

Credit: AFP

ChatGPT, the chatbot that uses the GPT language model developed by OpenAI, has become the fastest-growing consumer application in history, according to a UBS study. Since its release in 2020, GPT-3 has been praised for its ability to generate human-like text. These factors, along with ChatGPT's impressively two-month surge to achieve 100M users, make it an excellent case study for technology productization.

Facing competition from Google's Bard and Meta's LLaMA, GPT4-based ChatGPT still tops user rankings, which can be largely accredited to its productization process. In this article, we'll deep dive into the key difference among large language models and insight into how OpenAI's GPT model creates a moat in this highly competitive market of large language model (LLM) technology.

Human Intervention: Tailoring Language Models to Preferences

As LLMs continue to proliferate, most of these models are based on the transformer architecture initially published by Google Brain in 2017. In the 5 years until the launch of ChatGPT at the end of 2022, and the GPT-3 model before it, the largest improvement, aside from its sleek and minimalist chatbot interface, is its honed ability to tailor results to human preferences, guided by human feedback.

The technique is called "reinforcement learning with human feedback" (RLHF) which teams at OpenAI describe working "significantly well" and has become essential to develop state-of-the-art models. Given that all language model developers would collect as much data as possible, the "human feedback" would set the distinction among all models.

Establishing Dominance: RLHF as a Strategic Moat

For InstructGPT, which preceded GPT-3.5, OpenAI employed a team of 40 labelers. Instead of subject matter expert of any verticals, the role they played was more like the "communication coach" to set the tone and exemplify the narration of the language model. Their feedback was significantly amplified during the reward modeling (RM) and reinforcement learning (RL) phases, the mechanisms to set targets on the learning algorithm so the model can improve itself to approach the goal on the benchmark.

RLHF has emerged as the linchpin in transforming cutting-edge language model technology from a theoretical construct into a practical tool, ready for mainstream adoption. In contrast, the previous trailblazers such as Google did not direct sufficient attention to this crucial process of "productization," leading to a notable gap between invention and application.

Identifying this untapped potential, OpenAI strategically focused on a conversational aspect of GPT. This pioneering step has effectively catapulted OpenAI to the forefront of the industry, granting them and their patron, Microsoft, a coveted first-mover advantage.

The Power of Metrics: Steering the Direction of Language Models

In the process of building a language model, metric-setting is a crucial product decision rather than a mere technical one. The productization process of language models is shaped by their vision, goal, and metrics, which differentiates them from one another. When comparing GPT with Google's LaMDA model, which powers Google's Bard and is also based on the transformer architecture, it's evident that different directions were taken by language models.

In particular, the quality metrics of Google's LaMDA were sensibleness, specificity, and interestingness (SSI) according to the release note. That means the text generated by the model needs to be coherent, detailed, and "interesting." OpenAI, on the other hand, has charted a different course for its GPT models. Instead of aiming for "interestingness," the focus is on usefulness, ensuring that the model provides helpful responses that made the "look and feel" of the narration of the GPT model a far cry from LaMDA.

In hindsight, helpfulness seems to highly outweigh Interestingness, which even backfired in the practical use of the language model. The sense of humor and emotions derived from the "Interestingness" metric has set one of Google's engineersto claim that LaMDA AI has human-like actual feelings. As the details of the model were revealed to be the probability distribution of the next word, the myth ended up being a groundless allegation.

The distinction in quality metrics setting is just one nuance in the productization process of language models, and there are many fine details setting all language model products apart. To the real-world use cases, RLHF is more of an art than science. According to Sam Altman, there are no plans to increase the number of parameters or computation power of the upcoming model. With the current scale of 175 billion parameters in GPT-3.5, scientists have learned how to "do more with less." Blindly investing in scaling up training data will not provide a significant edge in the existing GPT model. This suggests that future advancements in language models will rely more on improving the productization process rather than simply increasing the scale of their training data.

In the context of AI language model products, given that everything is under the hood of deep learning, the tone and narration of the model become a critical component of the user interface that make a first impression on everyday users. In this highly competitive LLM arena, the user experience is a key differentiator. Thus, it is essential to focus on the practical application and user experience of these models, rather than simply on model performance, to establish a competitive edge and lock-in effect.

Small Margin of Error Determines Winners and Losers

Still, it is yet too early to allege that OpenAI's GPT has the one and only secret sauce to create the "right" model for large use case deployment. With the global awareness of the large language model, it is predictable that the performance of the models developed by top players would become fairly comparable, at least indiscernible from the end users' perspectives. With that being said, it has indeed achieved the product-market fit better and faster than anyone else.

Even though the gap in the performance benchmark of each model might be negligible, for the following reasons, the first round of the AI war might be a game of "winner takes all":

The Habit Loop: From Familiarity to the Lock-in Effect in AI

To productize the language model, we want users to form a habit of using it as often as possible. In the context of deploying AI models, this tendency can play a significant role in creating a lock-in effect, where users become accustomed to a particular model or solution and are reluctant to switch to alternatives, even if they might be more powerful on some benchmark datasets.

That's the flywheel in the AI era: the more data drives a higher quality of the output. The GPT-3.5 model breakthrough was driven by the massive amount of computation.

The Economic Burden of Ethics: The Cost of Moderating Toxic Content in AI

The requirement for AI to be responsible has become the main bottleneck in the development of language models. As a formerly small startup, OpenAI managed to avoid the dilemma of potentially damaging their reputation by providing so-called "AI hallucinations," or untruthful information.

Besides hallucinations, one of the key topics in AI ethics and governance is language toxicity. As models are trained from a human-generated text corpus, part of the training data from the internet is replete with toxicity and bias. In its quest to make ChatGPT less toxic, OpenAI used outsourced Kenyan labor. Even a team of hundreds of offshore contractors would have taken a significant amount of time to trawl through the enormous dataset manually. While the mechanism of building language models has been largely unveiled, it remains confidential to the public how to rightfully govern the safety of models in production. The only information is that even for OpenAI, inarguably one of the most cutting-edge AI companies, there was no easy way of purging those toxic sections other than outsourcing to offshore contractors in Eastern Africa. That essentially builds a moat over new market entrants.

The Safety Net: How Decision-Makers Mitigate Risk in AI Deployment

"Nobody ever got fired for buying IBM." This commonly used expression illustrates the risk-averse mentality in IT decision-making.

It's especially true for large-scale IT deployment, such as the use cases of AI. Decision-makers often opt for well-established and trusted options to minimize risk and potential negative consequences. By choosing a reputable and reliable model, the risks of deploying AI can be mitigated.

With the growing and dominating popularity of OpenAI's GPT model backed by Microsoft, the margin of error on its competitors' solutions becomes narrower to win over the first mover.

In conclusion, the meteoric rise of OpenAI's ChatGPT offers invaluable insights into the productization of large language models, which has fundamentally reshaped the AI industry. Its success can be traced back to strategic decisions in RLHF, the choice of quality metrics, and the expert crafting of the user interface, underlining the critical role of a human-centric approach in differentiating ChatGPT in a highly competitive market. Additionally, we shed light on the significant influence of a lock-in effect in fostering user loyalty and the importance of addressing AI ethics to maintain reputation and ensure safe deployment.

While the performance benchmarks of various models might appear similar, ChatGPT's ability to swiftly attain product-market fit sets it apart. This examination of its evolution prompts us to realize that the battle for dominance in the AI field may not be solely about technological prowess but rather about creating superior user experiences, mastering the productization process, and ensuring responsible deployment.

As we navigate this exciting era of AI, the journey from GPT to ChatGPT serves as a beacon guiding the way. As the field continues to evolve, our understanding of the essence of productization will define the winners in this high-stakes game of AI.

Arthur's Bio

Yenkai (YK) Huang, a board member of the North America Taiwanese Engineering & Science Association (NATEA), serves as an Engineering Product Manager at Cisco Systems, focusing on product management, data analytics, and AI and ML technologies. His responsibilities include forming product roadmap strategies, managing cloud-native product development, and utilizing data analytics for decision-making. He has consulting experience from Accenture and holds an MBA from the Haas School of Business, UC Berkeley, along with an M.S. degree from National Taiwan University. His diverse experiences contribute to an in-depth understanding and insights into technology trends and product management.