GPT Model: How Does it Work?. Let’s look together under the hood with… | by Dmitrii Eliuseev

Contents

Let’s look together under the hood with Python and PyTorch Loading The Model

Let’s look together under the hood with Python and PyTorch

During the last few years, the buzz around AI has been enormous, and the main trigger of all this is obviously the advent of GPT-based large language models. Interestingly, this approach itself is not new. LSTM (long short-term memory) neural networks were created in 1997, and a famous paper, “Attention is All You Need,” was published in 2017; both were the cornerstones of modern natural language processing. But only in 2020 will the results of GPT-3 be good enough, not only for academic papers but also for the real world.

Nowadays, everyone can chat with GPT in a web browser, but probably less than 1% of people actually know how it works. Smart and witty answers from the model can force people to think that they are talking with an intelligent being, but is it so? Well, the best way to figure it out is to see how it works. In this article, we will take a real GPT model from OpenAI, run it locally, and see step-by-step what is going on under the hood.

This article is intended for beginners and people who are interested in programming and data science. I will illustrate my steps with Python, but deep Python understanding will not be required.

Let’s get into it!

Loading The Model

For our test, I will be using a GPT-2 “Large” model, made by OpenAI in 2019. This model was state-of-the-art at the time, but nowadays it does not have any business value anymore, and the model can be downloaded for free from HuggingFace. What is even more important for us is that the GPT-2 model has the same architecture as the newer ones (but the number of parameters is obviously different):

The GPT-2 “large” model has 0.7B parameters (GPT-3 has 175B, and GPT-4, according to web rumors, has 1.7T parameters).
GPT-2 has a stack of 36 layers with 20 attention heads (GPT-3 has 96, and GPT-4, according to rumors, has 120 layers).
GPT-2 has a 1024-token context length (GPT-3 has 2048, and GPT-4 has a 128K context length).

Naturally, the GPT-3 and -4 models provide better results in all benchmarks compared to the GPT-2. But first, they are not available for download (and…