Skip to main content

How LLMs Work

LLM training pipeline, tokenization, sampling strategies, RAG introduction, and AI agents overview

~50 min
Listen to this lesson

How Large Language Models Work

Large Language Models (LLMs) are the foundation of modern generative AI. From ChatGPT to Claude to open-source models like LLaMA, these systems have fundamentally changed how we interact with computers. In this lesson we will peel back the layers of how LLMs are built, trained, and used β€” then explore two powerful application patterns: Retrieval-Augmented Generation (RAG) and AI Agents.

What Is an LLM?

A Large Language Model is a neural network (typically a Transformer) trained on massive text corpora to predict the next token in a sequence. By learning statistical patterns across trillions of tokens, the model acquires broad language understanding, reasoning ability, and even world knowledge β€” all encoded in its billions of parameters.

The Three-Stage Training Pipeline

Building a modern LLM is a multi-stage process, each stage shaping the model’s capabilities in a different way.

Stage 1: Pre-Training

The model is trained on trillions of tokens scraped from the internet, books, code repositories, and curated datasets. The objective is simple β€” next-token prediction: given a sequence of tokens, predict the most likely next token.

  • Data scale: GPT-4-class models train on 10+ trillion tokens
  • Compute: Thousands of GPUs running for weeks or months
  • Result: A β€œbase model” that can complete text fluently but does not follow instructions well
  • Stage 2: Supervised Fine-Tuning (SFT)

    The base model is further trained on instruction-response pairs β€” curated examples of a user asking a question and a high-quality answer. This teaches the model to follow instructions rather than simply complete text.

    Instruction: "Explain photosynthesis in simple terms."
    Response:    "Photosynthesis is how plants turn sunlight, water,
                  and CO2 into food (glucose) and oxygen..."
    

    Stage 3: RLHF (Reinforcement Learning from Human Feedback)

    Human evaluators rank model responses from best to worst. A reward model learns these preferences, and the LLM is then optimized via reinforcement learning to produce responses that score highly.

  • PPO (Proximal Policy Optimization): The classic RL algorithm used in early ChatGPT training
  • DPO (Direct Preference Optimization): A newer, simpler approach that skips the reward model and directly optimizes on preference pairs
  • Emergent Abilities

    As models scale up, surprising capabilities β€˜emerge’ that were not explicitly trained for β€” including multi-step reasoning, in-context learning, and even basic math. These emergent abilities are a major reason larger models feel qualitatively different from smaller ones.

    Tokenization

    LLMs do not process raw text β€” they work with tokens. Modern models use subword tokenization algorithms (like BPE β€” Byte Pair Encoding) that break text into frequent subword units.

    TextTokens
    "Hello world"["Hello", " world"]
    "unhappiness"["un", "happiness"]
    "ChatGPT"["Chat", "G", "PT"]
    Why subword tokenization?
  • Handles any word (including novel ones) by composing known subwords
  • Keeps common words as single tokens for efficiency
  • Typical vocabulary size: 32,000 to 100,000 tokens
  • Context Window

    The context window is the maximum number of tokens the model can process in a single forward pass. Everything the model β€œsees” β€” system prompt, conversation history, user message, and its own response β€” must fit within this window.

    ModelContext Window
    GPT-3.54,096 tokens
    GPT-48,192 / 128K tokens
    Claude 3200K tokens
    LLaMA 3.1128K tokens

    Sampling and Generation

    When generating text, the model outputs a probability distribution over all tokens at each step. Sampling strategies control which token is chosen:

  • Temperature: Scales the logits before softmax. Lower (0.0–0.3) = more deterministic; higher (0.7–1.5) = more creative and diverse.
  • Top-p (Nucleus Sampling): Only considers the smallest set of tokens whose cumulative probability exceeds *p*. For example, top-p=0.9 considers tokens that together account for 90% of the probability mass.
  • Top-k: Only considers the *k* most probable tokens.
  • Greedy decoding: Always picks the highest-probability token (temperature=0).
  • Temperature Rules of Thumb

    Use low temperature (0.0–0.3) for factual tasks like code generation, data extraction, and classification. Use higher temperature (0.7–1.0) for creative tasks like brainstorming, story writing, and poetry. Avoid temperature above 1.2 as output becomes incoherent.

    Retrieval-Augmented Generation (RAG)

    LLMs are limited by their training data β€” they can hallucinate facts or lack knowledge about your private documents. RAG solves this by retrieving relevant context from an external knowledge base and injecting it into the prompt.

    The RAG Pipeline

    1. Chunk your documents into passages 2. Embed each chunk into a vector using an embedding model 3. Store vectors in a vector database 4. At query time, embed the user’s question 5. Retrieve the most similar chunks via cosine similarity 6. Augment the prompt with retrieved context 7. Generate the answer with the LLM

    python
    1from sentence_transformers import SentenceTransformer
    2import numpy as np
    3
    4# --- Step 1: Prepare documents ---
    5documents = [
    6    "The Library of Congress is the largest library in the world, "
    7    "with more than 170 million items in its collections.",
    8    "Founded in 1800, the Library of Congress is the oldest federal "
    9    "cultural institution in the United States.",
    10    "The Library of Congress classification system is used by most "
    11    "research and academic libraries in the US.",
    12    "Machine learning is a subset of artificial intelligence that "
    13    "enables systems to learn from data.",
    14]
    15
    16# --- Step 2: Embed documents ---
    17model = SentenceTransformer("all-MiniLM-L6-v2")
    18doc_embeddings = model.encode(documents)
    19
    20# --- Step 3: Query ---
    21query = "How many items does the Library of Congress have?"
    22query_embedding = model.encode([query])
    23
    24# --- Step 4: Retrieve via cosine similarity ---
    25similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
    26top_indices = np.argsort(similarities)[::-1][:2]  # top 2
    27
    28retrieved_context = "\n".join([documents[i] for i in top_indices])
    29
    30# --- Step 5: Augment prompt ---
    31prompt = f"""Answer the question based on the context below.
    32
    33Context:
    34{retrieved_context}
    35
    36Question: {query}
    37Answer:"""
    38
    39print(prompt)
    40# This prompt would be sent to an LLM for answer generation

    RAG for the Library of Congress

    The Library of Congress holds over 170 million items β€” far too much for any LLM context window. RAG lets you search this vast collection, retrieve the most relevant passages, and feed just those to the model. This pattern works for any large knowledge base: legal documents, medical records, codebases, or enterprise wikis.

    AI Agents

    An AI Agent goes beyond simple prompt-response interactions. It can plan, use tools, and remember previous steps to accomplish complex multi-step tasks.

    The ReAct Pattern (Reason + Act)

    The dominant agent architecture follows the ReAct loop:

    1. Thought: The model reasons about what to do next 2. Action: The model selects and calls a tool (search, calculator, code execution, API call) 3. Observation: The tool returns a result 4. Repeat until the task is complete

    Thought: I need to find today's weather in Tokyo.
    Action: search("current weather Tokyo")
    Observation: Tokyo β€” 22Β°C, partly cloudy, humidity 65%
    Thought: I now have the answer.
    Action: respond("It is currently 22Β°C and partly cloudy in Tokyo.")
    

    Key Agent Components

    ComponentPurpose
    PlanningBreaking complex tasks into sub-tasks
    Tool UseCalling APIs, running code, searching the web
    MemoryShort-term (conversation) and long-term (vector store)
    ReflectionSelf-evaluating output quality and retrying
    Agents are the frontier of LLM applications β€” enabling autonomous coding assistants, research agents, and multi-step data analysis workflows.