Skip to main content

RL Foundations

Markov Decision Processes, value functions, and the exploration-exploitation trade-off

~45 min
Listen to this lesson

Reinforcement Learning Foundations

Reinforcement Learning (RL) is fundamentally different from supervised and unsupervised learning. Instead of learning from a fixed dataset, an agent learns by interacting with an environment, taking actions, and receiving rewards.

The RL Loop

Agent  --action-->  Environment
Agent  <--state--   Environment
Agent  <--reward--  Environment

At each time step *t*: 1. The agent observes state s_t 2. The agent chooses action a_t 3. The environment transitions to state s_{t+1} 4. The agent receives reward r_{t+1}

The goal: learn a policy (strategy) that maximizes the total cumulative reward over time.

Markov Decision Process (MDP)

An MDP is defined by the tuple (S, A, P, R, gamma) where S is the set of states, A is the set of actions, P(s'|s,a) is the transition probability, R(s,a,s') is the reward function, and gamma is the discount factor. The Markov property means the future depends only on the current state, not the history.

States, Actions, and Rewards

States describe the current situation. In a chess game, the state is the board configuration. In a self-driving car, it's the sensor readings.

Actions are what the agent can do. In chess, moving a piece. In a car, steering angle and acceleration.

Rewards are scalar feedback signals. Positive rewards encourage behavior, negative rewards discourage it. Designing good reward functions is one of the hardest parts of RL.

ComponentChess ExampleCartPole Example
StateBoard positionCart position, pole angle, velocities
ActionMove a piecePush left or right
Reward+1 win, -1 lose, 0 otherwise+1 for each step the pole stays up

Value Functions

The state value function V(s) tells us how good it is to be in a state:

**V(s) = E[R_t + gamma * R_{t+1} + gamma^2 * R_{t+2} + ... | S_t = s]

The action-value function Q(s, a) tells us how good it is to take an action in a state:

Q(s, a) = E[R_t + gamma * R_{t+1} + ... | S_t = s, A_t = a]

The Bellman Equation

The Bellman equation expresses the relationship between the value of a state and the values of successor states:

V(s) = max_a [R(s, a) + gamma * sum P(s'|s,a) * V(s')]**

This recursive relationship is the foundation of most RL algorithms. It says: the value of a state equals the best immediate reward plus the discounted value of where you end up.

python
1import numpy as np
2
3# Simple GridWorld Value Iteration Example
4# Grid: 4x4, goal at (3,3), reward = -1 per step, +10 at goal
5
6def value_iteration(grid_size=4, gamma=0.9, theta=1e-6):
7    """Compute optimal values using the Bellman equation."""
8    V = np.zeros((grid_size, grid_size))
9    actions = [(0, 1), (0, -1), (1, 0), (-1, 0)]  # right, left, down, up
10    goal = (grid_size - 1, grid_size - 1)
11
12    while True:
13        delta = 0
14        for i in range(grid_size):
15            for j in range(grid_size):
16                if (i, j) == goal:
17                    continue  # Goal state has value 0
18
19                v_old = V[i, j]
20                values = []
21                for di, dj in actions:
22                    ni, nj = max(0, min(i+di, grid_size-1)), max(0, min(j+dj, grid_size-1))
23                    reward = 10 if (ni, nj) == goal else -1
24                    values.append(reward + gamma * V[ni, nj])
25
26                V[i, j] = max(values)
27                delta = max(delta, abs(v_old - V[i, j]))
28
29        if delta < theta:
30            break
31
32    return V
33
34V = value_iteration()
35print("Optimal Value Function:")
36print(np.round(V, 1))

Discount Factor (gamma)

The discount factor gamma (between 0 and 1) controls how much the agent values future rewards vs immediate rewards:

  • gamma = 0: The agent is completely myopic, only caring about the next reward
  • gamma = 0.99: The agent is far-sighted, valuing future rewards almost as much as immediate ones
  • gamma = 1: No discounting (can be unstable in infinite-horizon problems)
  • Exploration vs Exploitation

    This is the fundamental dilemma in RL:

  • Exploitation: Choose the action you currently believe is best
  • Exploration: Try new actions to discover potentially better rewards
  • If you only exploit, you might miss the best strategy. If you only explore, you never benefit from what you've learned.

    Epsilon-Greedy Strategy

    The simplest exploration strategy: with probability epsilon, take a random action (explore). With probability 1-epsilon, take the greedy action (exploit). Start with a high epsilon (e.g., 1.0) and decay it over time so the agent explores more early on and exploits more as it learns.
    python
    1import numpy as np
    2
    3class EpsilonGreedy:
    4    """Epsilon-greedy action selection with decay."""
    5
    6    def __init__(self, n_actions, epsilon=1.0, epsilon_min=0.01, decay=0.995):
    7        self.n_actions = n_actions
    8        self.epsilon = epsilon
    9        self.epsilon_min = epsilon_min
    10        self.decay = decay
    11
    12    def select_action(self, q_values):
    13        """Select action using epsilon-greedy policy."""
    14        if np.random.random() < self.epsilon:
    15            return np.random.randint(self.n_actions)  # Explore
    16        return np.argmax(q_values)  # Exploit
    17
    18    def decay_epsilon(self):
    19        """Decay epsilon after each episode."""
    20        self.epsilon = max(self.epsilon_min, self.epsilon * self.decay)
    21
    22# Demo
    23strategy = EpsilonGreedy(n_actions=4)
    24fake_q_values = np.array([1.2, 3.5, 0.8, 2.1])
    25
    26for episode in range(5):
    27    action = strategy.select_action(fake_q_values)
    28    print(f"Episode {episode}: epsilon={strategy.epsilon:.3f}, action={action}")
    29    strategy.decay_epsilon()