Reinforcement Learning Foundations

Reinforcement Learning (RL) is fundamentally different from supervised and unsupervised learning. Instead of learning from a fixed dataset, an agent learns by interacting with an environment, taking actions, and receiving rewards.

The RL Loop

Agent  --action-->  Environment
Agent  <--state--   Environment
Agent  <--reward--  Environment

At each time step *t*: 1. The agent observes state s_t 2. The agent chooses action a_t 3. The environment transitions to state s_{t+1} 4. The agent receives reward r_{t+1}

The goal: learn a policy (strategy) that maximizes the total cumulative reward over time.

Markov Decision Process (MDP)

An MDP is defined by the tuple (S, A, P, R, gamma) where S is the set of states, A is the set of actions, P(s'|s,a) is the transition probability, R(s,a,s') is the reward function, and gamma is the discount factor. The Markov property means the future depends only on the current state, not the history.

States, Actions, and Rewards

States describe the current situation. In a chess game, the state is the board configuration. In a self-driving car, it's the sensor readings.

Actions are what the agent can do. In chess, moving a piece. In a car, steering angle and acceleration.

Rewards are scalar feedback signals. Positive rewards encourage behavior, negative rewards discourage it. Designing good reward functions is one of the hardest parts of RL.

Component	Chess Example	CartPole Example
State	Board position	Cart position, pole angle, velocities
Action	Move a piece	Push left or right
Reward	+1 win, -1 lose, 0 otherwise	+1 for each step the pole stays up

Value Functions

The state value function V(s) tells us how good it is to be in a state:

**V(s) = E[R_t + gamma * R_{t+1} + gamma^2 * R_{t+2} + ... | S_t = s]

The action-value function Q(s, a) tells us how good it is to take an action in a state:

Q(s, a) = E[R_t + gamma * R_{t+1} + ... | S_t = s, A_t = a]

The Bellman Equation

The Bellman equation expresses the relationship between the value of a state and the values of successor states:

V(s) = max_a [R(s, a) + gamma * sum P(s'|s,a) * V(s')]**

This recursive relationship is the foundation of most RL algorithms. It says: the value of a state equals the best immediate reward plus the discounted value of where you end up.

python

1import numpy as np
2
3# Simple GridWorld Value Iteration Example
4# Grid: 4x4, goal at (3,3), reward = -1 per step, +10 at goal
5
6def value_iteration(grid_size=4, gamma=0.9, theta=1e-6):
7    """Compute optimal values using the Bellman equation."""
8    V = np.zeros((grid_size, grid_size))
9    actions = [(0, 1), (0, -1), (1, 0), (-1, 0)]  # right, left, down, up
10    goal = (grid_size - 1, grid_size - 1)
11
12    while True:
13        delta = 0
14        for i in range(grid_size):
15            for j in range(grid_size):
16                if (i, j) == goal:
17                    continue  # Goal state has value 0
18
19                v_old = V[i, j]
20                values = []
21                for di, dj in actions:
22                    ni, nj = max(0, min(i+di, grid_size-1)), max(0, min(j+dj, grid_size-1))
23                    reward = 10 if (ni, nj) == goal else -1
24                    values.append(reward + gamma * V[ni, nj])
25
26                V[i, j] = max(values)
27                delta = max(delta, abs(v_old - V[i, j]))
28
29        if delta < theta:
30            break
31
32    return V
33
34V = value_iteration()
35print("Optimal Value Function:")
36print(np.round(V, 1))

Discount Factor (gamma)

The discount factor gamma (between 0 and 1) controls how much the agent values future rewards vs immediate rewards:

gamma = 0: The agent is completely myopic, only caring about the next reward

gamma = 0.99: The agent is far-sighted, valuing future rewards almost as much as immediate ones

gamma = 1: No discounting (can be unstable in infinite-horizon problems)

Exploration vs Exploitation

This is the fundamental dilemma in RL:

Exploitation: Choose the action you currently believe is best

Exploration: Try new actions to discover potentially better rewards

If you only exploit, you might miss the best strategy. If you only explore, you never benefit from what you've learned.

Epsilon-Greedy Strategy

The simplest exploration strategy: with probability epsilon, take a random action (explore). With probability 1-epsilon, take the greedy action (exploit). Start with a high epsilon (e.g., 1.0) and decay it over time so the agent explores more early on and exploits more as it learns.

python

1import numpy as np
2
3class EpsilonGreedy:
4    """Epsilon-greedy action selection with decay."""
5
6    def __init__(self, n_actions, epsilon=1.0, epsilon_min=0.01, decay=0.995):
7        self.n_actions = n_actions
8        self.epsilon = epsilon
9        self.epsilon_min = epsilon_min
10        self.decay = decay
11
12    def select_action(self, q_values):
13        """Select action using epsilon-greedy policy."""
14        if np.random.random() < self.epsilon:
15            return np.random.randint(self.n_actions)  # Explore
16        return np.argmax(q_values)  # Exploit
17
18    def decay_epsilon(self):
19        """Decay epsilon after each episode."""
20        self.epsilon = max(self.epsilon_min, self.epsilon * self.decay)
21
22# Demo
23strategy = EpsilonGreedy(n_actions=4)
24fake_q_values = np.array([1.2, 3.5, 0.8, 2.1])
25
26for episode in range(5):
27    action = strategy.select_action(fake_q_values)
28    print(f"Episode {episode}: epsilon={strategy.epsilon:.3f}, action={action}")
29    strategy.decay_epsilon()

Reinforcement Learning Foundations

The RL Loop

Agent  --action-->  Environment
Agent  <--state--   Environment
Agent  <--reward--  Environment

At each time step *t*: 1. The agent observes state s_t 2. The agent chooses action a_t 3. The environment transitions to state s_{t+1} 4. The agent receives reward r_{t+1}

The goal: learn a policy (strategy) that maximizes the total cumulative reward over time.

Markov Decision Process (MDP)

States, Actions, and Rewards

States describe the current situation. In a chess game, the state is the board configuration. In a self-driving car, it's the sensor readings.

Actions are what the agent can do. In chess, moving a piece. In a car, steering angle and acceleration.

Rewards are scalar feedback signals. Positive rewards encourage behavior, negative rewards discourage it. Designing good reward functions is one of the hardest parts of RL.

Component	Chess Example	CartPole Example
State	Board position	Cart position, pole angle, velocities
Action	Move a piece	Push left or right
Reward	+1 win, -1 lose, 0 otherwise	+1 for each step the pole stays up

Value Functions

The state value function V(s) tells us how good it is to be in a state:

**V(s) = E[R_t + gamma * R_{t+1} + gamma^2 * R_{t+2} + ... | S_t = s]

The action-value function Q(s, a) tells us how good it is to take an action in a state:

Q(s, a) = E[R_t + gamma * R_{t+1} + ... | S_t = s, A_t = a]

The Bellman Equation

The Bellman equation expresses the relationship between the value of a state and the values of successor states:

V(s) = max_a [R(s, a) + gamma * sum P(s'|s,a) * V(s')]**

This recursive relationship is the foundation of most RL algorithms. It says: the value of a state equals the best immediate reward plus the discounted value of where you end up.

python

1import numpy as np
2
3# Simple GridWorld Value Iteration Example
4# Grid: 4x4, goal at (3,3), reward = -1 per step, +10 at goal
5
6def value_iteration(grid_size=4, gamma=0.9, theta=1e-6):
7    """Compute optimal values using the Bellman equation."""
8    V = np.zeros((grid_size, grid_size))
9    actions = [(0, 1), (0, -1), (1, 0), (-1, 0)]  # right, left, down, up
10    goal = (grid_size - 1, grid_size - 1)
11
12    while True:
13        delta = 0
14        for i in range(grid_size):
15            for j in range(grid_size):
16                if (i, j) == goal:
17                    continue  # Goal state has value 0
18
19                v_old = V[i, j]
20                values = []
21                for di, dj in actions:
22                    ni, nj = max(0, min(i+di, grid_size-1)), max(0, min(j+dj, grid_size-1))
23                    reward = 10 if (ni, nj) == goal else -1
24                    values.append(reward + gamma * V[ni, nj])
25
26                V[i, j] = max(values)
27                delta = max(delta, abs(v_old - V[i, j]))
28
29        if delta < theta:
30            break
31
32    return V
33
34V = value_iteration()
35print("Optimal Value Function:")
36print(np.round(V, 1))

Discount Factor (gamma)

The discount factor gamma (between 0 and 1) controls how much the agent values future rewards vs immediate rewards:

gamma = 0: The agent is completely myopic, only caring about the next reward

gamma = 0.99: The agent is far-sighted, valuing future rewards almost as much as immediate ones

gamma = 1: No discounting (can be unstable in infinite-horizon problems)

Exploration vs Exploitation

This is the fundamental dilemma in RL:

Exploitation: Choose the action you currently believe is best

Exploration: Try new actions to discover potentially better rewards

If you only exploit, you might miss the best strategy. If you only explore, you never benefit from what you've learned.

Epsilon-Greedy Strategy

python

1import numpy as np
2
3class EpsilonGreedy:
4    """Epsilon-greedy action selection with decay."""
5
6    def __init__(self, n_actions, epsilon=1.0, epsilon_min=0.01, decay=0.995):
7        self.n_actions = n_actions
8        self.epsilon = epsilon
9        self.epsilon_min = epsilon_min
10        self.decay = decay
11
12    def select_action(self, q_values):
13        """Select action using epsilon-greedy policy."""
14        if np.random.random() < self.epsilon:
15            return np.random.randint(self.n_actions)  # Explore
16        return np.argmax(q_values)  # Exploit
17
18    def decay_epsilon(self):
19        """Decay epsilon after each episode."""
20        self.epsilon = max(self.epsilon_min, self.epsilon * self.decay)
21
22# Demo
23strategy = EpsilonGreedy(n_actions=4)
24fake_q_values = np.array([1.2, 3.5, 0.8, 2.1])
25
26for episode in range(5):
27    action = strategy.select_action(fake_q_values)
28    print(f"Episode {episode}: epsilon={strategy.epsilon:.3f}, action={action}")
29    strategy.decay_epsilon()

RL Foundations

Reinforcement Learning Foundations

The RL Loop

Markov Decision Process (MDP)

States, Actions, and Rewards

Value Functions

The Bellman Equation

Discount Factor (gamma)

Exploration vs Exploitation

Epsilon-Greedy Strategy

RL Foundations

Reinforcement Learning Foundations

The RL Loop

Markov Decision Process (MDP)

States, Actions, and Rewards

Value Functions

The Bellman Equation

Discount Factor (gamma)

Exploration vs Exploitation

Epsilon-Greedy Strategy