You have probably used ChatGPT, Claude, or Gemini at some point. You type a question, and a few seconds later you get back a thoughtful, well-written response. It feels almost like talking to a person. But what is actually happening under the hood?
Large Language Models, or LLMs, are not magic. They are the result of decades of research, massive datasets, and some very clever mathematics. This guide breaks down exactly how they work, starting from the simplest ideas and building up to the technical machinery that powers them. No PhD required.
What Is a Large Language Model?
An LLM is a type of artificial intelligence trained to understand and generate human language. The word “large” refers to the scale of the model, meaning the number of parameters it has learned. Modern LLMs have billions or even trillions of parameters. Think of parameters as tiny dials inside the model, each one tuned during training to help the model make better predictions.
The core job of an LLM is surprisingly simple: given a sequence of words, predict what word comes next. That is it. Everything else, answering questions, writing code, summarizing documents, translating languages, emerges from doing that one thing extremely well at enormous scale.
Step One: Turning Words Into Numbers
Computers do not understand words. They understand numbers. So the first challenge is converting text into something a computer can process.
This is done through a process called tokenization. A tokenizer breaks text into small chunks called tokens. A token might be a full word, part of a word, or a single character, depending on the model.
For example, the word “unhappiness” might be split into three tokens: un, happi, and ness. The sentence “I love coding” might become three tokens: I, love, coding.
Each token is then converted into a number. That number acts as an ID for looking up a mathematical object called an embedding.
What Is an Embedding?
An embedding is a list of numbers, called a vector, that represents the meaning of a token in a multi-dimensional space. Words with similar meanings end up with similar vectors.
For example, the vectors for “king” and “queen” would be mathematically close to each other. The vector for “dog” would be far from the vector for “planet.” This is how the model captures meaning without understanding language the way humans do.
Step Two: The Transformer Architecture
The real engine behind modern LLMs is a neural network architecture called the Transformer. It was introduced in a 2017 research paper titled “Attention Is All You Need,” and it changed everything in AI.
Before Transformers, language models read text one word at a time, from left to right. This made it hard for them to connect ideas that appeared far apart in a sentence. Transformers solved this by processing all tokens at the same time and using a mechanism called attention to figure out which words are most relevant to each other.
Understanding Attention
Attention lets the model ask: for each word in this sentence, which other words should I pay the most attention to when predicting the next word?
Take this sentence: “The trophy did not fit in the suitcase because it was too big.”
What does “it” refer to? The trophy or the suitcase? A human knows it is the trophy. An attention mechanism lets the model figure this out by looking at the relationships between all the words in the sentence at once.
Every token computes three things: a Query, a Key, and a Value. Without getting too deep into the math, the Query asks “what am I looking for?”, the Key says “here is what I contain,” and the Value says “here is what I will contribute if I am selected.” The model uses these to calculate a score that determines how much attention each token should pay to every other token.
This happens in parallel across many attention heads at the same time, which is why it is called multi-head attention. Each head learns to focus on different types of relationships: one might track grammar, another might track factual references, and another might track tone.
Layers of Transformation
A Transformer model stacks many of these attention blocks on top of each other. GPT-4, for example, has 96 layers. Each layer refines the representation of the input a little more. By the time the text passes through all the layers, the model has built up a rich, context-aware understanding of what it is reading.
After the attention layers, each token passes through a feedforward neural network inside the Transformer block. This network applies learned transformations that help the model combine and refine information before passing it to the next layer.
Step Three: Training on Massive Data
Architecture alone does not make an LLM smart. What makes it powerful is training.
LLMs are trained on enormous amounts of text data. Books, websites, academic papers, code repositories, forums, news articles, and much more. GPT-3 was trained on roughly 570 gigabytes of text. That is hundreds of billions of words.
During training, the model is shown a piece of text with the last token hidden. Its job is to predict that hidden token. The model makes a guess, checks how wrong it was, and then adjusts its parameters slightly to do better next time. This process is called backpropagation, and it happens billions of times across the entire dataset.
After enough iterations, the model learns an enormous amount about language, facts, reasoning patterns, and even coding conventions, not because someone programmed these things in, but because they emerged naturally from predicting the next word across enough examples.
The Loss Function
The measure of how wrong the model is at each step is called the loss. Specifically, LLMs use a type of loss called cross-entropy loss, which measures the difference between the model’s predicted probability distribution over the vocabulary and the actual correct token. The goal of training is to minimize this loss over the entire dataset.
Step Four: Fine-Tuning and RLHF
A raw LLM trained only on next-token prediction can be strange to talk to. It might continue a question with another question instead of answering it. It might generate harmful content without realizing it. It needs to be shaped into a helpful assistant.
This is done through a process called fine-tuning, specifically a technique called Reinforcement Learning from Human Feedback, or RLHF.
Here is how it works:
- Human trainers write examples of ideal conversations: a user asks something, and an ideal assistant responds well.
- The model is fine-tuned on these examples to start behaving more like a helpful assistant.
- Human raters then rank different model responses from best to worst.
- A separate model called a reward model is trained to predict which responses humans prefer.
- The LLM is then trained using reinforcement learning to generate responses that score highly on the reward model.
This three-step loop is what transforms a raw text predictor into something like Claude or ChatGPT. It teaches the model to be helpful, honest, and to avoid harmful outputs.
Step Five: Generating a Response
When you type a message and hit send, here is what happens:
- Your text is tokenized into a sequence of token IDs.
- Each token is converted into an embedding vector.
- The embeddings pass through all the Transformer layers, one by one.
- At the final layer, the model produces a probability distribution over its entire vocabulary (which might be 50,000 or more tokens).
- The model samples from this distribution to pick the next token.
- That token is added to the sequence, and the whole process repeats.
- This continues until the model generates a special end-of-sequence token, or until it hits a maximum length limit.
This is why LLMs generate text one token at a time. They are not looking up pre-written answers. They are constructing each word on the fly based on everything that came before it.
Temperature and Sampling
You may have seen settings like “temperature” in AI tools. Temperature controls how random the model’s choices are. A low temperature makes the model pick the highest-probability token almost every time, producing more predictable but sometimes repetitive outputs. A high temperature makes the model sample more randomly, producing more creative but occasionally less coherent responses.
Why LLMs Can Still Get Things Wrong
LLMs are impressive, but they have real limitations worth understanding.
They do not have access to real-time information unless given a tool to search the web. They can confidently state incorrect facts, a behavior often called hallucination. This happens because the model is optimizing for plausible-sounding text, not verified truth.
They also have a context window, a maximum number of tokens they can process at once. Once you exceed this limit, the model starts to lose track of earlier parts of the conversation.
These are active areas of research, and every new generation of LLMs pushes these limits further.
The Bigger Picture
LLMs are the result of three things coming together at the right time: the Transformer architecture, the availability of massive datasets, and the computing power to train on them.
What makes them so remarkable is that no one explicitly programmed them to know facts, to write code, or to tell jokes. These abilities emerged from doing one simple thing repeatedly at unimaginable scale: predicting the next word.
The technology is still evolving fast. But understanding how it works puts you in a much better position to use it well, build on top of it, and think critically about its limitations.
If you want to go deeper, the original “Attention Is All You Need” paper is freely available online. It is surprisingly readable and worth the time.
Written by Manish Bhurtel | CodeWithBhurtel
