Neural Networks from Scratch
Neural networks are the foundation of deep learning. Before we use any frameworks, let's build one from scratch so you truly understand what's happening under the hood.
What Is a Neural Network?
A neural network is a series of layers, each of which transforms its input using a simple formula:
$$\text{output} = \text{activation}(\text{input} \times \text{weights} + \text{bias})$$
That's it. Every layer in every neural network — from a tiny classifier to GPT — follows this pattern. The magic comes from stacking many layers and learning the right weights through training.
Anatomy of a Single Neuron
A single neuron takes a vector of inputs, multiplies each by a learned weight, adds a bias term, and passes the result through an activation function:
| Component | Role |
|---|---|
| Inputs (x) | The data flowing in (features or outputs from a previous layer) |
| Weights (W) | Learned parameters that scale each input |
| Bias (b) | A learned offset that shifts the output |
| Activation (f) | A non-linear function applied to the weighted sum |
The Core Equation
Building Blocks in NumPy
Let's implement each piece from scratch. We'll start with the individual components and then assemble them into a working network.
Dense (Fully Connected) Layer
A dense layer connects every input to every output. It stores weights and biases, and computes the linear transformation.
1import numpy as np
2
3class DenseLayer:
4 """A fully connected layer: output = input @ weights + bias"""
5
6 def __init__(self, input_size: int, output_size: int):
7 # He initialization — good default for ReLU networks
8 # Scale by sqrt(2/fan_in) to keep variance stable across layers
9 self.weights = np.random.randn(input_size, output_size) * np.sqrt(
10 2.0 / input_size
11 )
12 self.bias = np.zeros((1, output_size))
13
14 def forward(self, inputs: np.ndarray) -> np.ndarray:
15 """
16 inputs shape: (batch_size, input_size)
17 output shape: (batch_size, output_size)
18 """
19 self.inputs = inputs # Cache for backprop
20 return inputs @ self.weights + self.biasActivation Functions
Activation functions introduce non-linearity. Without them, stacking layers would be useless — a chain of linear transformations is just one big linear transformation. Let's implement the two most important ones.
1class ReLU:
2 """Rectified Linear Unit: max(0, x)
3
4 The most popular activation for hidden layers.
5 Simple, fast, and works well in practice.
6 """
7
8 def forward(self, inputs: np.ndarray) -> np.ndarray:
9 self.inputs = inputs # Cache for backprop
10 return np.maximum(0, inputs)
11
12
13class Softmax:
14 """Converts raw scores (logits) into probabilities.
15
16 Used as the final activation for classification tasks.
17 Output values are in [0, 1] and sum to 1.
18 """
19
20 def forward(self, inputs: np.ndarray) -> np.ndarray:
21 # Subtract max for numerical stability (prevents overflow in exp)
22 shifted = inputs - np.max(inputs, axis=1, keepdims=True)
23 exp_values = np.exp(shifted)
24 return exp_values / np.sum(exp_values, axis=1, keepdims=True)Putting It Together: A Simple Neural Network
Now let's combine these pieces into a complete network. This is a 3-layer classifier that takes in feature vectors and outputs class probabilities.
1class SimpleNN:
2 """A simple feedforward neural network.
3
4 Architecture: Input -> Dense -> ReLU -> Dense -> ReLU -> Dense -> Softmax
5 """
6
7 def __init__(self, input_size: int, hidden_size: int, output_size: int):
8 self.layer1 = DenseLayer(input_size, hidden_size)
9 self.activation1 = ReLU()
10 self.layer2 = DenseLayer(hidden_size, hidden_size)
11 self.activation2 = ReLU()
12 self.layer3 = DenseLayer(hidden_size, output_size)
13 self.softmax = Softmax()
14
15 def forward(self, X: np.ndarray) -> np.ndarray:
16 """Forward pass: push data through all layers."""
17 out = self.layer1.forward(X)
18 out = self.activation1.forward(out)
19 out = self.layer2.forward(out)
20 out = self.activation2.forward(out)
21 out = self.layer3.forward(out)
22 out = self.softmax.forward(out)
23 return out
24
25
26# --- Demo: classify random data ---
27np.random.seed(42)
28X = np.random.randn(4, 3) # 4 samples, 3 features each
29network = SimpleNN(input_size=3, hidden_size=8, output_size=3)
30
31probabilities = network.forward(X)
32print("Input shape: ", X.shape) # (4, 3)
33print("Output shape:", probabilities.shape) # (4, 3)
34print("\nPredicted probabilities:")
35print(probabilities)
36print("\nRow sums (should be ~1.0):", probabilities.sum(axis=1))
37print("Predicted classes:", np.argmax(probabilities, axis=1))What About Backpropagation?
The Forward Pass Step by Step
Let's trace data through our network to make the flow concrete:
1. Input: A batch of feature vectors, shape (batch_size, input_size)
2. Layer 1: Linear transform X @ W1 + b1 — projects input to hidden dimension
3. ReLU: Zeros out negatives — introduces non-linearity
4. Layer 2: Another linear transform — learns more complex combinations
5. ReLU: More non-linearity
6. Layer 3: Final linear transform — projects to number of classes
7. Softmax: Converts raw scores to probabilities
Each layer's output becomes the next layer's input. The whole thing is just a chain of simple operations.
Why Depth Matters
A single layer (linear + activation) can only learn simple decision boundaries. By stacking layers:
This hierarchical feature learning is what makes deep networks so powerful.