BLOG POST
by Poomjai Nacaskul, Ph.D.
Chulalongkorn School of Integrated Innovation (ScII),
Poomjai.N@chula.ac.th
— 2023.06.07 —
1. First of all, let’s get the nomenclature out of the way.
GPT (Generative Pre-trained Transformer) was what we now refer to generically as a Large Language Model (LLM) first introduced to the world in 2018 by OpenAI, who then followed up with GPT-2 in February 2019, and GPT-3 in June 2020, and GPT-4 in March 2023. Hence retroactively one may refer to the original 2018 GPT as GPT-1, and henceforth GPT is reserved for a family/series of OpenAI’s LLMs. ChatGPT, introduced in November 2022, thus belonging to the same generation as GPT-3, was pre-trained on a smaller task than that of GPT-3, and intended by OpenAI as a tour-de-force demonstration of conversationalist AI technology. Since then we have seen a myriad other GPT-based LLMs, pre-trained on a whole variety of specialised tasks, whence “ThisGPT” and “ThatGPT”, coming on scene.
Funny enough (or infuriatingly confusing, depending on your sense of humour), “generative pre-trained transformer” also best describes just about every LLM out there competing for AI supremacy, even those that do not go by the “GPT” moniker”, in particular, Google’s Bard (introduced in March 2023 as a direct competitor to ChatGPT), itself based on/around LaMDA/LaMDA 2 (released in June 2021/May 2022). My guess is that the term LLM was created precisely in case OpenAI succeeded in trademarking GPT!
2. What’s in the name?
There are countless online blogs and YouTube explainer videos the very topic of how GPT works. Here I’ll try to approach it from a different angle, teasing out the secret of GPT from its very name. And so, in that spirit, let’s rhetorically ask 3 questions: What is “generative” about GPT? Where does the “pre-trained” part come in? What exactly does “transformer” refer to?
A correct, albeit not particularly useful, answer would be thus: “transformer” is the name given to a class of deep learning methodology that GPT utilizes; “pre-trained” highlights the fact that GPT is not just an empty deep learning architecture, but one that had already been trained with a ginormous language corpus; “generative” refers to the manner in which GPT output responses to our (human) prompts. So let’s delve/dig a bit deeper.
a. What is “generative” about GPT?
Let’s start with garden variety predictive analytics. Suppose we have a Machine Learning (ML) engine, say, a generic Artificial Neural Network (ANN), that, once trained on past data, can fairly accurately predict whether or not it is going to rain tomorrow. That would certainly be a useful thing to have. But a more useful model would be one capable of predicting rainfalls over the next, say, two weeks. We could formulate this as a Supervised Learning problem in one of two ways.
One way is to create an ANN that must always output 14 numbers, one corresponding to the likelihood of rain tomorrow, another corresponding to the likelihood the day after, and so on. Historically this is the preferred method, as it allows the data scientists to tune the accuracy tolerance over the span of 14 days into the future. Let’s call this a vector prediction formulation (vector being an ordered set of numbers, 14 in this case).
Another way is to create an ANN that outputs just 1 number, the likelihood of rain tomorrow, then use said output as input to output the likelihood of rain the day after, and so on, until 14 outputs are produced. Historically this is a terrible method, as inaccuracies quickly pile on from tomorrow’s prediction to the day after, etc. Let’s call this a scalar prediction formulation. In essence, this formulation is considered generative to the extent that the input used to predict the day after is not a real piece of data, but an output that the model itself generated in the prior step.
But while historically unsuccessful, “scalar prediction” formulation holds a greater appeal, if only it’s intrinsic tendency to degenerate very quickly in terms of predictive performance can be overcome. For one thing, we no longer need to pre-specify a fixed horizon. In principle, the same predictive engine could be used to predict 1-day ahead, 2-week ahead, or months ahead.
Now let’s turn our attention to the task of creating a conversational AI, which we can immediately cast as a prediction problem thus: in lieu of input from past/present meteorological conditions as input, here input comes in the form of our (human) prompts; in lieu of output signifying future likelihood of rainfall, here output sought after is a word/sequence of words; in lieu of forecast accuracy being the overriding criterion, here linguistic sensibility of conversational response is the overriding criterion.
Analogous to a vector prediction formulation, our conversational AI would always respond in the form of a sentence with exactly the same pre-specified number of words, immediately rousing a suspicion that we are in fact chatting with a bot. But formulated and trained as a scalar prediction engine, our conversational AI can reply initially with one word, which is then used as input to generate a follow up word, and so on until a special word signifying “end of sentence” is predicted as the appropriate output.
In this way, the GPT breakthrough can be rephrased as the breakthrough in terms of getting the generative, scalar prediction formulation to work within the context of predicting appropriate verbal response, all without allowing the produced string of words to quickly degenerate to nonsense.
It is in this sense that all LLM is “generative”, i.e. regardless of whether there is a capital “G” in the name of the algorithm.
b. Where does the “pre-trained” part come in?
The short version of the answer to this question (given above) is correct. But the implication is deep and as a matter of fact takes us to the very definition of “intelligence” itself.
You see, it would be only a slight exaggeration that GPT has been pre-trained on THE Internet, all of it (in reality, as much of the internet that could possibly be mustered by the folks at OpenAI, given the fast-expanding computational capacity at their disposal). It is a wonder, but at the same time, it is no wonder that ChatGPT speaks perfect English and knows everything there is to know (or just about everything that could possibly be googled).
Back in the 50’s/60’s, the word “artificial” in “artificial intelligence” meant convincing mimicry and the word “intelligence” there meant human like interactive response. So an AI, one that successfully passed the Turing Test, needed only convince an experimenter that he/she is conversing with a human (technically, has no expressible means of distinguish whether it is human or some artificial algorithm generating the responses). If an AI can mimic the response of the most unintelligentperson on earth, that’ll do. To wit, a chatbot that convincingly gives the dumbestinvestment advice imaginable may still qualify as a successful conversational AI.
More appropriate to what 2020’s LLMs have shown what is possible, now “artificial” essentially refers to the fact that is a human creation, whence an “artifice” or “artifact”, a family/series of algorithmic-computational engines manufactured and created by a crack team of human (at least for now) computer scientists, data scientists, linguistic experts, and so on, and “intelligence” must now qualify the quality of the knowledge base expressed in the response. The very fact that a few of us take gleeful delight in making fun of ChatGPT on the rare occasions it produced incorrect answers is a testament to the new reality, one in which we now expects a computational-algorithmic artifact to give us genuinely intelligent solutions, ones equivalent not just to any human, but to, say, a young adult human with immediate access to Google and the internet!
c. What exactly does “transformer” refer to?
This is perhaps the only unavoidably technical bit of the article. But I will nonetheless try to cobble together a non-technical explanation thereof.
Let’s detour first to the world of time-series econometrics. There, a majority of methods are known as “autoregressive” models. They “regressive” part has to do with the fact that prediction is formulated as a form of regression equation, and the “auto” part indicates that the regression equation only takes on past values of the time series as inputs. So predicting occurrence of rain tomorrow from today’s barometric/hygrometric readings would not count as autoregressive, nor would predicting future stock returns from the company’s various financial ratios. But predicting future occurrences of from daily records of rain over the past month would, as would predicting future stock returns from past stock returns.
Long story short, “transformer” is the name given to a deep learning methodology representing an amalgam of many technical innovations, tools, and tricks that came out of deep learning researches, especially those in the past 5-10 years or so, the most crucial element of which is the so-called “self-attention mechanism” [1], which for our purpose can be thought of as an “autoregressive” means of generating the very next word from any/all of the previous words¾“self” being analogous to “auto”, “attention” analogous to “regressive”.
3. We haven’t even got to the good part yet!
In terms of technical foundation, that’s pretty much the story. But of course, the tremendous advancement came about not only from one or two conceptual breakthroughs, but from incremental accumulation of knowledge and knowhow built up by the scientific community over many decades. There are multiple layers of deep dives that mere pages will not do justice. So I won’t.
Instead, my main message is that we are only at the beginning chapters of the GPT/LLM saga. There’s a lot more to come. And no, I don’t mean how many mid-level “white-collar” professional jobs (accounting, legal, management, marketing, consulting, journalism, coding, data analytics, etc.) will be taken over, or, for the more optimistic amongst you, enhanced through man-machine cooperation. I can see a whole new future of arts and science, akin to how I think Wolfram Physics (which I roughly paraphrase as translating physical laws and causality to graph-computation on fundamentally discrete space-time reality) represents a step change in how the physical universe is studied and modelled. The entirety of this ideation is still sketchy at this time, even to me. But I’ll try to give a glimpse of what we could be seeing over our very lifetime.
For the longest time, humanity has progressed artistically and scientifically through a process of abstraction. I won’t justify the statement here, but let’s push on. In particular, for science and engineering pursuits, we scientists and engineers try to abstract physical realities onto a set of mathematical descriptions, apply the mathematical analytics to obtain some insights therefrom, and map the mathematical/abstract solutions back onto the real systems we wish to understand, improve, and/or optimise. In the past century or so, social sciences also adopted this scientific paradigm, turning financial-economic, socio-political, psycho-anthropological entities and processes into mathematically quantifiable objects, subject to similarly rigorous logical/mathematical analysis, with varying degrees of success.
In a sense, mathematics has been the “go to” computational substrate for all of life’s pressing questions.
But with GPT in particular, and LLM in general, there is the possibility that digitally-captured, neurocomputationally-embedded representation of real-world entities and processes will serve as the computational substrate, essentially one less level of abstraction than what is required for working within the mathematical realm. With sufficient safeguard in terms of alignment and calibration (big words for saying that we don’t want AI to hallucinate), it is hope that we can actually “do sciences” on the so-called digital twins and get accurate, useful insights vis-à-vis real-world phenomena.
That, not getting ChatGPT to do your essay homework, is the genuine promise of this technology. That, not just the man-machine conflict in the job markets, is the future we can look forward to.
[1] Vaswani, et al. (2017), “Attention Is All You Need”, [https://paperswithcode.com/paper/attention-is-all-you-need].
ALSO: Read the previous blogpost titled Before we get to Design Thinking, let’s think of what we mean by Design