How Large Language Models Work
Large Language Models (LLMs) are the foundation of modern generative AI. From ChatGPT to Claude to open-source models like LLaMA, these systems have fundamentally changed how we interact with computers. In this lesson we will peel back the layers of how LLMs are built, trained, and used β then explore two powerful application patterns: Retrieval-Augmented Generation (RAG) and AI Agents.
What Is an LLM?
The Three-Stage Training Pipeline
Building a modern LLM is a multi-stage process, each stage shaping the modelβs capabilities in a different way.
Stage 1: Pre-Training
The model is trained on trillions of tokens scraped from the internet, books, code repositories, and curated datasets. The objective is simple β next-token prediction: given a sequence of tokens, predict the most likely next token.
Stage 2: Supervised Fine-Tuning (SFT)
The base model is further trained on instruction-response pairs β curated examples of a user asking a question and a high-quality answer. This teaches the model to follow instructions rather than simply complete text.
Instruction: "Explain photosynthesis in simple terms."
Response: "Photosynthesis is how plants turn sunlight, water,
and CO2 into food (glucose) and oxygen..."
Stage 3: RLHF (Reinforcement Learning from Human Feedback)
Human evaluators rank model responses from best to worst. A reward model learns these preferences, and the LLM is then optimized via reinforcement learning to produce responses that score highly.
Emergent Abilities
Tokenization
LLMs do not process raw text β they work with tokens. Modern models use subword tokenization algorithms (like BPE β Byte Pair Encoding) that break text into frequent subword units.
| Text | Tokens |
|---|---|
| "Hello world" | ["Hello", " world"] |
| "unhappiness" | ["un", "happiness"] |
| "ChatGPT" | ["Chat", "G", "PT"] |
Context Window
The context window is the maximum number of tokens the model can process in a single forward pass. Everything the model βseesβ β system prompt, conversation history, user message, and its own response β must fit within this window.
| Model | Context Window |
|---|---|
| GPT-3.5 | 4,096 tokens |
| GPT-4 | 8,192 / 128K tokens |
| Claude 3 | 200K tokens |
| LLaMA 3.1 | 128K tokens |
Sampling and Generation
When generating text, the model outputs a probability distribution over all tokens at each step. Sampling strategies control which token is chosen:
Temperature Rules of Thumb
Retrieval-Augmented Generation (RAG)
LLMs are limited by their training data β they can hallucinate facts or lack knowledge about your private documents. RAG solves this by retrieving relevant context from an external knowledge base and injecting it into the prompt.
The RAG Pipeline
1. Chunk your documents into passages 2. Embed each chunk into a vector using an embedding model 3. Store vectors in a vector database 4. At query time, embed the userβs question 5. Retrieve the most similar chunks via cosine similarity 6. Augment the prompt with retrieved context 7. Generate the answer with the LLM
1from sentence_transformers import SentenceTransformer
2import numpy as np
3
4# --- Step 1: Prepare documents ---
5documents = [
6 "The Library of Congress is the largest library in the world, "
7 "with more than 170 million items in its collections.",
8 "Founded in 1800, the Library of Congress is the oldest federal "
9 "cultural institution in the United States.",
10 "The Library of Congress classification system is used by most "
11 "research and academic libraries in the US.",
12 "Machine learning is a subset of artificial intelligence that "
13 "enables systems to learn from data.",
14]
15
16# --- Step 2: Embed documents ---
17model = SentenceTransformer("all-MiniLM-L6-v2")
18doc_embeddings = model.encode(documents)
19
20# --- Step 3: Query ---
21query = "How many items does the Library of Congress have?"
22query_embedding = model.encode([query])
23
24# --- Step 4: Retrieve via cosine similarity ---
25similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
26top_indices = np.argsort(similarities)[::-1][:2] # top 2
27
28retrieved_context = "\n".join([documents[i] for i in top_indices])
29
30# --- Step 5: Augment prompt ---
31prompt = f"""Answer the question based on the context below.
32
33Context:
34{retrieved_context}
35
36Question: {query}
37Answer:"""
38
39print(prompt)
40# This prompt would be sent to an LLM for answer generationRAG for the Library of Congress
AI Agents
An AI Agent goes beyond simple prompt-response interactions. It can plan, use tools, and remember previous steps to accomplish complex multi-step tasks.
The ReAct Pattern (Reason + Act)
The dominant agent architecture follows the ReAct loop:
1. Thought: The model reasons about what to do next 2. Action: The model selects and calls a tool (search, calculator, code execution, API call) 3. Observation: The tool returns a result 4. Repeat until the task is complete
Thought: I need to find today's weather in Tokyo.
Action: search("current weather Tokyo")
Observation: Tokyo β 22Β°C, partly cloudy, humidity 65%
Thought: I now have the answer.
Action: respond("It is currently 22Β°C and partly cloudy in Tokyo.")
Key Agent Components
| Component | Purpose |
|---|---|
| Planning | Breaking complex tasks into sub-tasks |
| Tool Use | Calling APIs, running code, searching the web |
| Memory | Short-term (conversation) and long-term (vector store) |
| Reflection | Self-evaluating output quality and retrying |