How Large Language Models Work

Large Language Models (LLMs) are the foundation of modern generative AI. From ChatGPT to Claude to open-source models like LLaMA, these systems have fundamentally changed how we interact with computers. In this lesson we will peel back the layers of how LLMs are built, trained, and used — then explore two powerful application patterns: Retrieval-Augmented Generation (RAG) and AI Agents.

What Is an LLM?

A Large Language Model is a neural network (typically a Transformer) trained on massive text corpora to predict the next token in a sequence. By learning statistical patterns across trillions of tokens, the model acquires broad language understanding, reasoning ability, and even world knowledge — all encoded in its billions of parameters.

The Three-Stage Training Pipeline

Building a modern LLM is a multi-stage process, each stage shaping the model’s capabilities in a different way.

Stage 1: Pre-Training

The model is trained on trillions of tokens scraped from the internet, books, code repositories, and curated datasets. The objective is simple — next-token prediction: given a sequence of tokens, predict the most likely next token.

Data scale: GPT-4-class models train on 10+ trillion tokens

Compute: Thousands of GPUs running for weeks or months

Result: A “base model” that can complete text fluently but does not follow instructions well

Stage 2: Supervised Fine-Tuning (SFT)

The base model is further trained on instruction-response pairs — curated examples of a user asking a question and a high-quality answer. This teaches the model to follow instructions rather than simply complete text.

Instruction: "Explain photosynthesis in simple terms."
Response:    "Photosynthesis is how plants turn sunlight, water,
              and CO2 into food (glucose) and oxygen..."

Stage 3: RLHF (Reinforcement Learning from Human Feedback)

Human evaluators rank model responses from best to worst. A reward model learns these preferences, and the LLM is then optimized via reinforcement learning to produce responses that score highly.

PPO (Proximal Policy Optimization): The classic RL algorithm used in early ChatGPT training

DPO (Direct Preference Optimization): A newer, simpler approach that skips the reward model and directly optimizes on preference pairs

Emergent Abilities

As models scale up, surprising capabilities ‘emerge’ that were not explicitly trained for — including multi-step reasoning, in-context learning, and even basic math. These emergent abilities are a major reason larger models feel qualitatively different from smaller ones.

Tokenization

LLMs do not process raw text — they work with tokens. Modern models use subword tokenization algorithms (like BPE — Byte Pair Encoding) that break text into frequent subword units.

Text	Tokens
"Hello world"	["Hello", " world"]
"unhappiness"	["un", "happiness"]
"ChatGPT"	["Chat", "G", "PT"]

Why subword tokenization?

Handles any word (including novel ones) by composing known subwords

Keeps common words as single tokens for efficiency

Typical vocabulary size: 32,000 to 100,000 tokens

Context Window

The context window is the maximum number of tokens the model can process in a single forward pass. Everything the model “sees” — system prompt, conversation history, user message, and its own response — must fit within this window.

Model	Context Window
GPT-3.5	4,096 tokens
GPT-4	8,192 / 128K tokens
Claude 3	200K tokens
LLaMA 3.1	128K tokens

Sampling and Generation

When generating text, the model outputs a probability distribution over all tokens at each step. Sampling strategies control which token is chosen:

Temperature: Scales the logits before softmax. Lower (0.0–0.3) = more deterministic; higher (0.7–1.5) = more creative and diverse.

Top-p (Nucleus Sampling): Only considers the smallest set of tokens whose cumulative probability exceeds *p*. For example, top-p=0.9 considers tokens that together account for 90% of the probability mass.

Top-k: Only considers the *k* most probable tokens.

Greedy decoding: Always picks the highest-probability token (temperature=0).

Temperature Rules of Thumb

Use low temperature (0.0–0.3) for factual tasks like code generation, data extraction, and classification. Use higher temperature (0.7–1.0) for creative tasks like brainstorming, story writing, and poetry. Avoid temperature above 1.2 as output becomes incoherent.

Retrieval-Augmented Generation (RAG)

LLMs are limited by their training data — they can hallucinate facts or lack knowledge about your private documents. RAG solves this by retrieving relevant context from an external knowledge base and injecting it into the prompt.

The RAG Pipeline

1. Chunk your documents into passages 2. Embed each chunk into a vector using an embedding model 3. Store vectors in a vector database 4. At query time, embed the user’s question 5. Retrieve the most similar chunks via cosine similarity 6. Augment the prompt with retrieved context 7. Generate the answer with the LLM

python

1from sentence_transformers import SentenceTransformer
2import numpy as np
3
4# --- Step 1: Prepare documents ---
5documents = [
6    "The Library of Congress is the largest library in the world, "
7    "with more than 170 million items in its collections.",
8    "Founded in 1800, the Library of Congress is the oldest federal "
9    "cultural institution in the United States.",
10    "The Library of Congress classification system is used by most "
11    "research and academic libraries in the US.",
12    "Machine learning is a subset of artificial intelligence that "
13    "enables systems to learn from data.",
14]
15
16# --- Step 2: Embed documents ---
17model = SentenceTransformer("all-MiniLM-L6-v2")
18doc_embeddings = model.encode(documents)
19
20# --- Step 3: Query ---
21query = "How many items does the Library of Congress have?"
22query_embedding = model.encode([query])
23
24# --- Step 4: Retrieve via cosine similarity ---
25similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
26top_indices = np.argsort(similarities)[::-1][:2]  # top 2
27
28retrieved_context = "\n".join([documents[i] for i in top_indices])
29
30# --- Step 5: Augment prompt ---
31prompt = f"""Answer the question based on the context below.
32
33Context:
34{retrieved_context}
35
36Question: {query}
37Answer:"""
38
39print(prompt)
40# This prompt would be sent to an LLM for answer generation

RAG for the Library of Congress

The Library of Congress holds over 170 million items — far too much for any LLM context window. RAG lets you search this vast collection, retrieve the most relevant passages, and feed just those to the model. This pattern works for any large knowledge base: legal documents, medical records, codebases, or enterprise wikis.

AI Agents

An AI Agent goes beyond simple prompt-response interactions. It can plan, use tools, and remember previous steps to accomplish complex multi-step tasks.

The ReAct Pattern (Reason + Act)

The dominant agent architecture follows the ReAct loop:

1. Thought: The model reasons about what to do next 2. Action: The model selects and calls a tool (search, calculator, code execution, API call) 3. Observation: The tool returns a result 4. Repeat until the task is complete

Thought: I need to find today's weather in Tokyo.
Action: search("current weather Tokyo")
Observation: Tokyo — 22°C, partly cloudy, humidity 65%
Thought: I now have the answer.
Action: respond("It is currently 22°C and partly cloudy in Tokyo.")

Key Agent Components

Component	Purpose
Planning	Breaking complex tasks into sub-tasks
Tool Use	Calling APIs, running code, searching the web
Memory	Short-term (conversation) and long-term (vector store)
Reflection	Self-evaluating output quality and retrying

Agents are the frontier of LLM applications — enabling autonomous coding assistants, research agents, and multi-step data analysis workflows.

How Large Language Models Work

What Is an LLM?

The Three-Stage Training Pipeline

Building a modern LLM is a multi-stage process, each stage shaping the model’s capabilities in a different way.

Stage 1: Pre-Training

Data scale: GPT-4-class models train on 10+ trillion tokens

Compute: Thousands of GPUs running for weeks or months

Result: A “base model” that can complete text fluently but does not follow instructions well

Stage 2: Supervised Fine-Tuning (SFT)

Instruction: "Explain photosynthesis in simple terms."
Response:    "Photosynthesis is how plants turn sunlight, water,
              and CO2 into food (glucose) and oxygen..."

Stage 3: RLHF (Reinforcement Learning from Human Feedback)

Human evaluators rank model responses from best to worst. A reward model learns these preferences, and the LLM is then optimized via reinforcement learning to produce responses that score highly.

PPO (Proximal Policy Optimization): The classic RL algorithm used in early ChatGPT training

DPO (Direct Preference Optimization): A newer, simpler approach that skips the reward model and directly optimizes on preference pairs

Emergent Abilities

Tokenization

LLMs do not process raw text — they work with tokens. Modern models use subword tokenization algorithms (like BPE — Byte Pair Encoding) that break text into frequent subword units.

Text	Tokens
"Hello world"	["Hello", " world"]
"unhappiness"	["un", "happiness"]
"ChatGPT"	["Chat", "G", "PT"]

Why subword tokenization?

Handles any word (including novel ones) by composing known subwords

Keeps common words as single tokens for efficiency

Typical vocabulary size: 32,000 to 100,000 tokens

Context Window

Model	Context Window
GPT-3.5	4,096 tokens
GPT-4	8,192 / 128K tokens
Claude 3	200K tokens
LLaMA 3.1	128K tokens

Sampling and Generation

When generating text, the model outputs a probability distribution over all tokens at each step. Sampling strategies control which token is chosen:

Temperature: Scales the logits before softmax. Lower (0.0–0.3) = more deterministic; higher (0.7–1.5) = more creative and diverse.

Top-k: Only considers the *k* most probable tokens.

Greedy decoding: Always picks the highest-probability token (temperature=0).

Temperature Rules of Thumb

Retrieval-Augmented Generation (RAG)

The RAG Pipeline

python

1from sentence_transformers import SentenceTransformer
2import numpy as np
3
4# --- Step 1: Prepare documents ---
5documents = [
6    "The Library of Congress is the largest library in the world, "
7    "with more than 170 million items in its collections.",
8    "Founded in 1800, the Library of Congress is the oldest federal "
9    "cultural institution in the United States.",
10    "The Library of Congress classification system is used by most "
11    "research and academic libraries in the US.",
12    "Machine learning is a subset of artificial intelligence that "
13    "enables systems to learn from data.",
14]
15
16# --- Step 2: Embed documents ---
17model = SentenceTransformer("all-MiniLM-L6-v2")
18doc_embeddings = model.encode(documents)
19
20# --- Step 3: Query ---
21query = "How many items does the Library of Congress have?"
22query_embedding = model.encode([query])
23
24# --- Step 4: Retrieve via cosine similarity ---
25similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
26top_indices = np.argsort(similarities)[::-1][:2]  # top 2
27
28retrieved_context = "\n".join([documents[i] for i in top_indices])
29
30# --- Step 5: Augment prompt ---
31prompt = f"""Answer the question based on the context below.
32
33Context:
34{retrieved_context}
35
36Question: {query}
37Answer:"""
38
39print(prompt)
40# This prompt would be sent to an LLM for answer generation

RAG for the Library of Congress

AI Agents

An AI Agent goes beyond simple prompt-response interactions. It can plan, use tools, and remember previous steps to accomplish complex multi-step tasks.

The ReAct Pattern (Reason + Act)

The dominant agent architecture follows the ReAct loop:

Thought: I need to find today's weather in Tokyo.
Action: search("current weather Tokyo")
Observation: Tokyo — 22°C, partly cloudy, humidity 65%
Thought: I now have the answer.
Action: respond("It is currently 22°C and partly cloudy in Tokyo.")

Key Agent Components

Component	Purpose
Planning	Breaking complex tasks into sub-tasks
Tool Use	Calling APIs, running code, searching the web
Memory	Short-term (conversation) and long-term (vector store)
Reflection	Self-evaluating output quality and retrying

Agents are the frontier of LLM applications — enabling autonomous coding assistants, research agents, and multi-step data analysis workflows.

How LLMs Work

How Large Language Models Work

What Is an LLM?

The Three-Stage Training Pipeline

Stage 1: Pre-Training

Stage 2: Supervised Fine-Tuning (SFT)

Stage 3: RLHF (Reinforcement Learning from Human Feedback)

Emergent Abilities

Tokenization

Context Window

Sampling and Generation

Temperature Rules of Thumb

Retrieval-Augmented Generation (RAG)

The RAG Pipeline

RAG for the Library of Congress

AI Agents

The ReAct Pattern (Reason + Act)

Key Agent Components

How LLMs Work

How Large Language Models Work

What Is an LLM?

The Three-Stage Training Pipeline

Stage 1: Pre-Training

Stage 2: Supervised Fine-Tuning (SFT)

Stage 3: RLHF (Reinforcement Learning from Human Feedback)

Emergent Abilities

Tokenization

Context Window

Sampling and Generation

Temperature Rules of Thumb

Retrieval-Augmented Generation (RAG)

The RAG Pipeline

RAG for the Library of Congress

AI Agents

The ReAct Pattern (Reason + Act)

Key Agent Components