The mathematical concepts behind neural networks are not new. The perceptron (the earliest single-neuron model) was invented in 1958. Backpropagation was formalized in the 1970s. For decades, the AI community oscillated between extreme hype and total abandonment—periods known as AI Winters.
During the AI Winters, neural networks were widely considered a theoretical dead end. The math worked on paper, but the computers of the time lacked the processing power to train large networks, and researchers lacked the massive datasets required to teach them.
The "Spring" finally arrived in the 2010s, driven by two forces running on a collision course: the explosion of data on the internet, and the realization that Video Graphics Cards (GPUs), designed to rapidly multiply matrices for video game polygons, could be hijacked to multiply the massive weight matrices of deep neural networks in parallel.
Suddenly, what took months to compute took hours. Researchers began adding more and more "Hidden Layers" to their models—hence the term Deep Learning. Let's look at how the layers and math actually work.
Mathematics
Before we can build an artificial brain, we have to understand the language it speaks. Neural networks do not understand images, words, or sounds. They only understand numbers and the geometric spaces they live in. This is the realm of Linear Algebra.
There are three primary structures you need to know:
Let's make this concrete. Below is a 2D coordinate system with a single Vector (the blue arrow). We also have a 2x2 Matrix. Change the numbers in the matrix below, then click TRANSFORM to see how the matrix mathematically warps space, moving the vector to a new location.
Drag the head of the vector to move it. The Matrix multiplies against the Vector. This spatial transformation is the foundational operation of every neural network.
Machine Learning
Imagine you are sorting apples and lemons. You have a machine that measures two things for every piece of fruit: its weight (the horizontal x-axis) and its yellowness (the vertical y-axis).
If a new, unknown fruit comes down the conveyor belt, how do we decide if it’s an apple or a lemon? The simplest way is to draw a line separating the pink dots (apples) from the cyan dots (lemons). Everything on one side is an apple; everything on the other is a lemon.
Congratulations. You have just built the simplest possible neural network: a single artificial neuron.
In mathematics, the equation for that line is y = w * x + b.
Our neuron takes the inputs, multiplies them by the weights, adds the bias, and uses that line to make a decision.
FUNDAMENTALSA single neuron gives us one straight line to split the world. But what if the data isn't perfectly separated by one line? What if we have a square of apples surrounded by lemons?
This is where Layers come in. If one neuron draws one line, three neurons can draw three lines. By feeding those three lines into another layer, the network can combine them to form a triangle bounding box. Add a hundred neurons, and it can trace any polygon.
This is the essence of a Multi-Layer Perceptron. The Hidden Layer learns different lines (features), and the Output Layer combines them to make the final decision.
Click to add active neurons to the hidden layer. Watch how adding more lines allows the network to carve out a more complex boundary to separate the inner and outer dots.
A straight line is great for simple problems. But the real world is messy. What if we are sorting unripe apples, ripe apples, and lemons? The data might form a ring—lemons in the center, apples forming a circle around them. Try as you might, no single straight line can ever fence in a circle of dots.
Our simple mathematical line has reached its limit. We need a way to bend the mathematical landscape. To do this, we pass the output of our line equation through an activation function.
The most common activation function used today is called ReLU (Rectified Linear Unit). Its rule is beautifully simple: If a number is less than zero, make it zero. Otherwise, leave it alone.
Toggle the switch above. You will see that everything dipping below the "zero" floor suddenly snaps flat, like a piece of paper being creased against a desk. We have introduced a hinge into our mathematics.
A single hinge doesn't solve a circle problem. But if you have hundreds of neurons in a layer, you get hundreds of intersecting hinges. By angling and stacking these folds, a neural network can sculpt its math to form a bowl, a sphere, or a boundary so complex it perfectly wraps around the data of a human face.
FUNDAMENTALSBehind the planes and folds is a graph of numbers. Each line connecting two neurons represents a mathematical weight. Below is a network with an input layer, a hidden layer of three neurons, and a single output.
Press play to observe the network dynamically updating its weights. Cyan connections denote positive weights; pink denote negative. Weak connections fade into the void; dominant pathways thicken to route the mathematical logic.
How does the machine know which weights to alter? We measure its inaccuracy using a Loss Function. The resulting mathematical landscape is a valley of error.
Imagine the Loss Function as a massive, multi-dimensional mountain range. Our network is a blind hiker standing somewhere on this landscape. Its goal is to find the deepest valley. Since it cannot see the whole mountain range, it uses calculus to feel the steepness of the slope directly under its feet, and takes a small step downward. This is Gradient Descent.
When you click "Step", the computer calculates the exact steepness of the slope. It then multiplies that slope by the Learning Rate (the size of the step) to update the weight. If the step size is too large, the hiker violently bounces out of the valley!
FUNDAMENTALSWe have just walked a blind hiker down an abstract mountain of error. But let's see Gradient Descent solve a real problem. Suppose you have a handful of data points—perhaps the relationship between hours studied and exam scores—and you want to find the single straight line that best explains the trend.
This is Linear Regression, the simplest form of supervised learning. We take our familiar equation y = w * x + b, start with a random weight and bias, and ask: "How wrong is this line?" We measure the wrongness with Mean Squared Error—the average of the squared distances between each point and the line. Then Gradient Descent nudges w and b downhill, step by step, until the line settles into the best possible fit.
Click "FIT LINE" to watch Gradient Descent nudge the line into place. The pink dashed lines show each point's error. Watch the MSE loss shrink in real time.
This is the atom of machine learning. A neural network is just this idea scaled up—instead of fitting one line with two parameters, it fits millions of parameters across thousands of non-linear hinges to approximate any function imaginable.
FUNDAMENTALSLinear regression fits a straight line to predict continuous values. But what if we need to make a decision—apple or lemon, spam or not spam, tumour or healthy? The output should be a probability between 0 and 1, not an unbounded number.
The trick is beautifully simple: take the familiar linear equation z = w·x + b and pass it through a Sigmoid function: σ(z) = 1 / (1 + e-z). The sigmoid squashes any number into the range (0, 1), creating an S-shaped curve. Values above 0.5 are classified as one class; values below as the other. This is Logistic Regression—the simplest classifier in machine learning.
To train it, we replace Mean Squared Error with Binary Cross-Entropy, a loss function specifically designed for probabilities. It penalizes confident wrong predictions much more heavily than uncertain ones. Gradient Descent then adjusts the weight and bias to rotate and shift the decision boundary until it cleanly separates the two classes.
Click "FIT CLASSIFIER" to watch Gradient Descent train the sigmoid. The left plot shows the S-curve fitting the data; the right shows the decision boundary and Binary Cross-Entropy loss.
Logistic regression is the exact activation function used inside every neuron in a classification network. When we stack many neurons with sigmoids (or similar functions) across layers, we go from drawing a single straight boundary to sculpting arbitrarily complex decision surfaces.
BEGINNERWe've spent a lot of time talking about drawing lines and curves. Neural networks attempt to fit a continuous function across all the data. But what if we tried a completely different strategy? What if we just split the space into boxes?
This is the core idea behind Decision Tree Learning. Instead of finding one single elegant curve, a decision tree asks a sequence of simple "yes/no" questions based on the features. Is the weight greater than 50g? Is the color more red or green? Each question slices the data across one axis.
Algorithms like CART (Classification and Regression Trees) figure out the best questions to ask by finding the split that maximizes purity (using metrics like Information Gain or Gini Impurity)—meaning the split that perfectly separates the classes the most. By continuing to split, the tree can map highly complex, non-linear boundaries just by drawing lots of straight, axis-aligned lines.
Click SPLIT to watch the algorithm recursively divide the space until every box is purely one class. Notice the jagged, blocky boundaries compared to neural networks.
We've talked about drawing boundaries to separate apples from lemons when we already know the answer (Supervised Learning). But what if you are handed a massive dataset of customer behavior, or galaxies, and you don't know the categories in advance? You just want the machine to find the natural groups.
This is Unsupervised Learning, and the most famous algorithm to do it is K-Means Clustering.
The algorithm is elegantly simple:
1. Initialize: Drop K random center points (centroids) into the data.
2. Assign: Every data point looks at all the centroids and colors itself to match the
closest one.
3. Move: Each centroid looks at all its matching points and moves to their exact center
(the mean).
Repeat steps 2 and 3 until the centroids stop moving.
Try changing K to match the obvious groups in the data. Click STEP to watch the algorithm organically discover the clusters without any human labels.
Now we combine everything: the multi-node graph architecture, the non-linear ReLU hinges, and the automated descent.
Instead of directly predicting "apple" or "lemon" from our inputs, we pass those inputs into a Hidden Layer. Each hidden neuron draws a line and applies a ReLU fold. Finally, we take the output of those neurons and feed them into one final Output Neuron which adds the folded landscapes together.
Drag the slider to watch Gradient Descent twist three separate lines until they intersect and form a perfect triangle, fencing in the center data.
With one hidden layer of three neurons, we created a triangle. With a hundred neurons, we could create a perfectly smooth circle. With thousands of neurons across dozens of layers, we can build a boundary so complex it can parse human language.
INTERMEDIATEDon't just take our word for it. There is a profound mathematical result called the Universal Approximation Theorem: a neural network with a single hidden layer and enough neurons can approximate any continuous function to arbitrary precision. Not just triangles or circles—literally any shape you can imagine.
Below is a live proof. Draw any curve you like on the canvas—a wave, a zigzag, a heartbeat, your signature, anything. Then click LEARN and watch a real neural network train itself to match your creation, one gradient descent step at a time. Adjust the number of hidden neurons to see how network capacity affects the fit. Two neurons can only make broken lines. Fifty neurons can trace almost anything.
Draw any curve with your mouse or finger. Click LEARN to watch the neural network approximate it. Change the neuron count to see how capacity affects the fit.
This is the deepest lesson in this entire guide. A neural network is not a special kind of intelligence. It is a universal function approximator—a mathematical clay that, given enough neurons and enough training, can be sculpted into any shape. The magic isn't in the machine. The magic is in the math.
INTERMEDIATEWe know how the hiker takes a step downhill (Gradient Descent), but how does the hiker know the precise slope of the hill in a landscape with millions of dimensions?
The answer is Backpropagation, an elegant algorithmic application of the Chain Rule from calculus. When the network makes a prediction, it compares it to the truth and calculates an error. This error doesn't just sit at the output; it flows backwards through the graph.
Imagine the final output neuron realizing, "I predicted this was an apple, but it was a lemon. My output was far too high!" The output neuron then looks at the hidden neurons feeding into it and passes the "blame" back to them proportionally. If a hidden neuron fired strongly through a large positive weight, it receives more of the blame for the mistake.
In turn, those hidden neurons look at the input layer and distribute the blame further backwards. At every single connection, the algorithm computes a simple partial derivative: "If I increase this specific weight by a tiny fraction, exactly how much does the final system error change?"
This cascade of blame allows the network to compute the perfect downward slope for every single weight simultaneously in a single, efficient backward sweep.
Watch the red "error" signal flow backwards from the output. Thick connections carry more blame and will be adjusted more aggressively by Gradient Descent.
If we update the weights after every single piece of fruit (Stochastic Gradient Descent), the hiker's path down the mountain will be chaotic and erratic, bouncing around as it overreacts to the quirks of individual apples and lemons.
Instead, we group the data into Batches. The network looks at, say, 32 pieces of fruit, calculates the average gradient (the average direction downhill across all 32 examples), and then takes a step. This averages out the noise and smooths the journey.
Notice how pure Stochastic Gradient Descent (Batch Size 1) bounces violently, while Larger Batches provide a smooth, direct path to the center valley.
An Epoch occurs when the network has looked at the entire dataset exactly once. A real network might train for hundreds or thousands of epochs, repeatedly looping over the data until the loss function bottoms out in the deepest valley it can find.
ADVANCEDWe just saw how standard Stochastic Gradient Descent (SGD) works. It takes a step in the direction of the steepest downward slope for the current mini-batch. But what if the landscape is a long, narrow ravine? SGD will stubbornly bounce back and forth across the steep walls of the ravine, making agonizingly slow progress toward the actual minimum at the bottom.
To fix this, researchers introduced two ideas: Momentum and Adaptive Learning Rates. Momentum is like rolling a heavy ball down a hill: it builds up speed in consistent directions and blasts through small bumps. Adaptive Learning Rates (like RMSProp) keep a running average of recent gradient sizes, shrinking the step size in steep dimensions to stop the bouncing, while carefully increasing it in flat dimensions to speed up progress.
When you combine these two ideas—Momentum and RMSProp—you get Adam (Adaptive Moment Estimation). Introduced in 2014, it is arguably the most famous and widely used optimizer in all of deep learning. By maintaining moving averages of both the gradients and the squared gradients, Adam dynamically adapts its learning rate for every single parameter, sprinting down ravines while standard SGD slowly zig-zags.
Click RACE to drop two optimizers into a complex 2D loss landscape. You'll see the red dot (SGD) stubbornly bouncing back and forth down the ravine walls, while the green dot (Adam) quickly dampens the oscillations and sprints toward the minimum!
We've continuously added more power to our network: multi-dimensional hinges, deep layers, and self-attention. But power introduces a dangerous new problem. If you give a network millions of parameters to solve a simple dataset, it might stop looking for the underlying logical pattern and instead just memorize the specific answers to the training data. This is called Overfitting.
Imagine a student taking practice tests. Instead of learning the principles of calculus, they just memorize the fact that "Question 4 is always C." They will score 100% on the practice test, but catastrophically fail the real exam. A neural network does exactly the same thing: an overfit model draws wild, oscillating boundaries that perfectly hit every training dot but fail completely on new, unseen data.
Drag the slider to adjust the model's complexity (polynomial degree). Low complexity underfits. High complexity overfits, wildly swinging to hit every noise point. Optimal complexity captures the true underlying wave.
To prevent this, AI researchers use Regularization. Techniques like Dropout randomly turn off a percentage of neurons during training, forcing the network to not rely on any single node. L2 weight decay mathematically penalizes large weights during gradient descent, encouraging the network to use the simplest possible mathematical explanation.
Language
Before we teach a neural network to understand language, let's look at the simplest mathematical model for sequences. Imagine predicting the weather: if it is sunny today, what will it be tomorrow? A Markov Chain answers this with a single, elegant rule: the future depends only on the present, not on the past. It doesn't matter whether it has been sunny for a week or just one day — the probability of rain tomorrow is the same.
Formally, a Markov Chain is a set of states connected by transition probabilities. From each state, there is a fixed probability of jumping to every other state (including staying put). These probabilities are captured in a Transition Matrix, where each row sums to 1. Given any starting state, we can simulate the chain stepping forward, generating a random sequence governed entirely by these probabilities.
Click "STEP" to advance the chain one state. The active state glows. Arrow thickness shows transition probability. Watch the visit histogram build up — it converges to the chain's stationary distribution.
This is surprisingly powerful for simple sequences, and it's the mathematical ancestor of every language model. But the Markov property — remembering only the current state — is also its fatal limitation. It cannot capture long-range dependencies: the meaning of a word that depends on a sentence spoken three paragraphs ago. Neural networks solve this by learning to compress arbitrary history into a hidden state.
SEQUENCEEverything we have built so far — from feed-forward classifiers to Markov Chains — shares a common limitation: they have no memory of the past. A Markov Chain looks only at the current state. A feed-forward network processes each input in isolation. Neither can tell if "bank" means a river bank or a financial institution, because neither remembers the words that came before it.
A Recurrent Neural Network (RNN) solves this by adding a loop. At each step in a sequence, the network takes two inputs: the current data point and its own output from the previous step. This recycled output acts as a hidden state—a compressed memory of everything the network has seen so far. Mathematically: h(t) = f(x(t), h(t-1)).
This simple loop is surprisingly powerful. An RNN can learn to predict the next word in a sentence, the next note in a melody, or—most strikingly—the next state of an entire environment. When an RNN learns to predict "what happens next" given an action, it has built a World Model: an internal simulation of the environment's dynamics. Instead of interacting with the real world, an agent can "imagine" the consequences of its actions inside this learned model.
Move the agent with LEFT/RIGHT. The RNN's hidden state (middle row) updates each step, and it tries to predict the agent's next position (ghost dot). Watch prediction accuracy improve.
But the recurrence loop conceals a deep problem. During backpropagation through time, gradients must flow backwards through every step in the sequence. Over long sequences, these gradients either shrink to nothing (vanishing gradients) or explode to infinity. This means an RNN struggles to connect a word at the end of a paragraph to a word at the beginning. Variants like LSTMs and GRUs add gating mechanisms to control memory flow, but even they hit limits on very long sequences—and their sequential nature makes them impossible to parallelize on modern GPUs.
ADVANCEDWe have just seen how RNNs solve the sequence problem by maintaining a hidden state—but at the cost of vanishing gradients and sequential bottlenecks. In 2017, a paper titled "Attention Is All You Need" proposed a radical alternative: throw away the recurrence entirely and introduce the Transformer architecture.
The core innovation of the Transformer is Self-Attention. Instead of reading a sentence sequentially, a Transformer looks at every single word simultaneously. For every word, it calculates an "attention score" representing how strongly that word relates to every other word in the sequence.
Drag the slider to change the current focal word. Watch how the attention mechanism dynamically shifts its weights to figure out the context. "Bank" attends heavily to "River".
We've just seen how Transformers use attention to understand relationships between words in a sentence. But there's a deeper question we've been quietly ignoring: how does the network see "words" in the first place? A neural network is a mathematical machine. It cannot read the letter "a." It needs numbers.
The process of converting raw text into a sequence of numbers is called Tokenisation. It is the critical first step in every language model, and the choice of tokeniser fundamentally shapes what the model can learn.
The naïve approach is to split on whitespace and assign each unique word an ID. But this creates a massive vocabulary—English alone has over 170,000 words, plus every name, typo, and technical term. Worse, the model would see "running", "runner", and "runs" as completely unrelated entries, learning nothing about their shared root.
The opposite extreme—splitting into individual characters—keeps the vocabulary tiny (just ~256 entries), but each character carries almost no meaning. The model must burn enormous capacity just rediscovering that "t", "h", "e" often appear together.
The solution used by GPT, LLaMA, and virtually every modern LLM is Byte Pair Encoding (BPE). The idea is elegant: start with individual characters, then repeatedly merge the most frequent adjacent pair into a single new token. Common words like "the" quickly become single tokens, while rare words are gracefully decomposed into familiar subword pieces.
Click "MERGE STEP" to perform one round of BPE. The most frequent adjacent pair (highlighted in pink) merges into a single token. Watch common subwords emerge naturally from the data.
After enough merges, the tokeniser has built a vocabulary of subword units—neither too large nor too small. Each token is then mapped to a unique integer ID, and that sequence of IDs is what actually enters the Transformer. The model never sees raw text; it sees a river of numbers, shaped by this algorithm.
ADVANCEDTokenisation gives us integer IDs, but a raw number like 4,271 contains no meaning. The network cannot tell that "king" (token 4,271) is conceptually closer to "queen" (token 8,903) than to "refrigerator" (token 15,440). We need a richer representation.
The solution is an Embedding: each token ID is looked up in a giant table that maps it to a vector—a list of numbers representing a coordinate in high-dimensional space. In GPT-3, every token becomes a vector of 12,288 numbers. These numbers are not hand-crafted; they are learned weights, trained through backpropagation just like any other part of the network.
The magic is in what the network discovers. During training, words that appear in similar contexts ("king" near "throne", "queen" near "throne") get pushed toward similar regions of the space. The geometry of this space encodes meaning: directions correspond to concepts. The famous result king − man + woman ≈ queen is not a programmed rule—it is an emergent property of the learned geometry.
Below is a 2D projection of a small token embedding space. Semantically related words cluster together, and the direction between clusters encodes relationships.
Hover over any word to see its vector. Click "SHOW ANALOGY" to watch king − man + woman = queen unfold as pure vector arithmetic.
This is why modern language models understand context so deeply. Every word in a sentence is not just an ID—it is a point in a vast mathematical landscape, and the Transformer's attention mechanism operates on these coordinates, computing distances and directions to derive meaning.
GENERATIVEEvery technique we have built so far is supervised learning: a human must hand-label every piece of training data. "This image is a cat." "This fruit is lemon." "This email is spam." For every example the network learns from, a human had to sit down and write the answer. This is the bottleneck. Labeling millions of images or transcripts is expensive and slow.
What if the labels could come from the data itself?
This is the key insight behind Self-Supervised Learning. Instead of asking humans for answers, we hide part of the input and train the network to predict the missing piece. Take a sentence like "The cat sat on the ___". We don't need a human to tell us the answer is probably "mat"—the structure of language itself provides the supervision. The target is already inside the data.
This single idea unlocked the modern AI revolution. By removing the need for human labels, researchers could suddenly train on the entire internet—billions of sentences, trillions of words—without annotating a single example. The resulting models, called Foundation Models, learn such rich representations of language that they can be prompted to perform tasks they were never explicitly trained for: translation, summarization, code generation, and creative writing.
Click "Mask Next Word" to hide a word. The model predicts what belongs there using only the surrounding context—no human label needed. The target was inside the input all along.
The same principle powers image generation. Diffusion models are trained to look at a noisy image and predict what the clean version looks like—the "label" is just the original image before noise was added. This is self-supervised learning applied to pixels instead of words.
Vision
We've talked about classifying fruit based on two numbers: weight and yellowness. But what if our input isn't a couple of numbers, but an image—say, a 28x28 grid of pixels? That's 784 numbers. If we passed those 784 numbers into a standard feed-forward layer with 1,000 neurons, we would need 784,000 weights just for the first layer.
Worse, traditional networks don't understand spatial structure. To a standard network, a circle in the top-left corner is completely mathematically unrelated to a circle in the bottom-right corner. It has to learn to identify circles all over again for every possible position.
To solve this, researchers invented the Convolutional Neural Network (CNN). Instead of connecting every pixel to every neuron, a CNN slides a small window—called a Kernel or Filter—across the image.
Watch the 3x3 filter slide across the input image. This specific filter looks for vertical edges, lighting up bright cyan when it finds a transition from dark to light.
As the filter slides, it calculates a dot product at every location, creating a new "feature map". By learning the exact numbers inside these 3x3 filters through Backpropagation, the network organically learns to detect edges, then corners, then eyes, then faces.
GENERATIVEEverything up to this point has been discriminative AI: looking at data and assigning it a boundary or a probability. But what happens when we ask the network to generate brand new data? This is Generative AI.
One of the earliest and most elegant generative architectures is the GAN (Generative Adversarial Network)—two neural networks locked in a game of cat-and-mouse. We will explore GANs in detail in the next section. But first, let's look at the approach that powers today's most spectacular image generators.
Recent massive breakthroughs (like DALL-E and Midjourney) use Diffusion Models. Diffusion works by taking an image of a dog, and slowly adding mathematical static (noise) to it over a thousand steps until it looks like pure TV static. The Neural Network is then trained to do one specific job: look at the static, and predict the noise that was added at the last step to subtract it. By doing this thousands of times, the network essentially learns the mathematical process of pulling structure out of pure chaos, allowing us to prompt it to pull a "dog in a spacesuit" out of a completely random, noisy canvas.
Drag the slider from 100 to 0 to simulate the AI generating an image. It starts with pure random noise and mathematically subtracts the static step-by-step until the hidden structure emerges.
Diffusion models pull structure from noise. But before diffusion, there was an older, more dramatic idea: what if we trained two neural networks to fight each other?
A Generative Adversarial Network (GAN) is built from two competing networks. The Generator takes a vector of random noise and tries to produce data that looks real—a face, a landscape, a handwritten digit. The Discriminator receives both real data samples and the Generator's fakes, and tries to tell them apart. The Generator's goal is to fool the Discriminator; the Discriminator's goal is to never be fooled.
This adversarial game is formalised as a minimax objective. The Discriminator wants to maximize its ability to classify real vs. fake, while the Generator wants to minimize the Discriminator's accuracy. Through alternating rounds of gradient descent—first improving D, then improving G—the Generator is forced to produce increasingly realistic data. When training succeeds, the Generator's output becomes indistinguishable from the real distribution, and the Discriminator is reduced to random guessing (D(x) ≈ 0.5 everywhere).
The Generator (pink) tries to match the real data distribution (cyan). The Discriminator's confidence curve (white dashed) shows where it thinks data is real. Watch the pink distribution converge and the Discriminator flatten to 0.5.
GANs were behind some of the most jaw-dropping early results in generative AI—photorealistic faces (StyleGAN), image-to-image translation (pix2pix), and super-resolution. However, training them is notoriously difficult. If the Discriminator becomes too strong, the Generator receives no useful gradient signal. If the Generator collapses to producing a single output that fools the Discriminator (called mode collapse), it stops exploring the full diversity of real data. Diffusion models have since overtaken GANs for most image generation tasks, but the adversarial principle remains one of the most beautiful ideas in machine learning.
GENERATIVEGANs generate data through adversarial competition. But there is a completely different generative philosophy: what if we learned to compress data into its essence, and then decompress it back? This is the idea behind the Autoencoder.
An Autoencoder is a neural network with an hourglass shape. The first half—the Encoder—takes high-dimensional input (like a 784-pixel image) and squeezes it down to a tiny vector of just a few numbers, called the latent code. The second half—the Decoder—takes that tiny code and tries to reconstruct the original image. The network is trained so that the output matches the input as closely as possible. In doing so, the bottleneck forces the encoder to discover the most important features of the data.
A Variational Autoencoder (VAE) adds a crucial twist. Instead of encoding each input to a single fixed point in latent space, the encoder outputs a probability distribution—a mean and variance for each latent dimension. During training, the code is sampled from this distribution, and a special penalty (the KL divergence) pushes these distributions toward a standard bell curve. This does two things: it makes the latent space smooth and continuous (so nearby points decode to similar images), and it lets us generate brand-new data by simply sampling from the bell curve and running the decoder.
The Encoder compresses 2D data (left) into a latent space (center). The Decoder reconstructs it (right). Train the VAE and watch the latent space organize. Click SAMPLE to generate new points.
VAEs are the mathematical backbone of latent spaces. Whereas GANs generate data through competition, VAEs generate data through compression. The smooth, continuous latent space they produce is precisely the kind of mathematical landscape we will explore in the next section—where walking between two points interpolates smoothly between two concepts.
GENERATIVEWe've talked about how neural networks fold space and generate data, but where does that data come from? When a model like Midjourney turns text into an image, it isn't pulling from a database; it's navigating a Latent Space.
Imagine a typical image is 1000x1000 pixels. Since every pixel has a red, green, and blue value, that image exists as a single point in a 3,000,000-dimensional mathematical space. But most of those 3 million dimensions are pure noise. The Manifold Hypothesis states that real-world data (like pictures of faces or cats) actually lives on a much lower-dimensional, beautifully structured surface (a manifold) folded inside that massive space.
A Neural Network acts like a massive data-compressor. By passing data through a bottleneck layer with very few neurons, it forces the network to discover this hidden manifold. It distills 3 million pixels of a human face down to perhaps just 500 pure, abstract numbers: one number might represent "age", another "lighting angle", and another "smile width". This compressed, conceptual map is the Latent Space.
LATENT MAP (2D Bottleneck)
Drag the red point
DECODED OUTPUT (Manifold)
Interpolated features
Because the network is forced to learn a continuous manifold, moving slightly in the latent map results in a smooth, logical transition in the generated output, blending discrete concepts together.
ADVANCEDA beautiful consequence of Latent Spaces is that we can sometimes enforce structure onto them. By default, an Autoencoder might compress a face into a small vector of numbers, but those numbers are deeply entangled. Changing number #4 might make the hair longer, the smile wider, and the lighting darker all at once. The network found a compression, but not necessarily a human-interpretable one.
The quest for Disentangled Representations (like in the β-VAE) tries to force the network to allocate exactly one latent dimension to exactly one independent concept. If successful, one knob controls only size, another controls only shape, and a third controls only color.
A perfectly disentangled latent space. Each slider controls one independent factor of variation. In entangled networks, moving one slider would unpredictably alter multiple factors.
Reinforcement Learning
Before we teach an agent to navigate environments and plan ahead, there is a simpler, more fundamental problem. Imagine you are standing in front of a row of slot machines. Each machine has a different, unknown probability of paying out. You have a limited number of pulls. How do you maximize your total winnings?
This is the Multi-Armed Bandit problem, and it captures the deepest tension in all of reinforcement learning: exploration vs. exploitation. Do you keep pulling the machine that has paid well so far (exploit), or do you try a new machine that might be even better (explore)? Pull the known winner too much and you might miss a jackpot. Explore too much and you waste pulls on losing machines.
The classic solution is ε-greedy: with probability ε (say, 10%), pull a random machine to explore. The rest of the time, pull whichever machine has the highest average payout so far. Over time, the agent converges on the best machine while still occasionally sampling alternatives. More sophisticated strategies like Upper Confidence Bound (UCB) balance exploration mathematically, favouring machines that have been tried fewer times.
Five slot machines with hidden payout rates. Watch the ε-greedy agent learn which machine is best. Adjust ε to see how exploration rate affects total reward.
The bandit problem appears everywhere in the real world: which ad to show a user, which drug dosage to try next, which route to drive home. Every recommendation engine, A/B test, and clinical trial is, at its core, a multi-armed bandit. And it is the foundation upon which all of reinforcement learning is built — because once we add states and transitions to the bandit framework, we get the full RL problem.
ADVANCEDSupervised learning needs labels. Self-supervised learning extracts labels from the data itself. But what about problems where there is no dataset at all—just an environment you can interact with?
Consider teaching a robot to walk, or training an AI to play chess. There is no spreadsheet of "correct moves." Instead, the agent must explore the world, take actions, and learn from the rewards and penalties it receives. This is Reinforcement Learning (RL).
The core loop is deceptively simple: the agent observes a state, chooses an action, receives a reward, and arrives at a new state. Over thousands of episodes, it builds a Q-Table—a mental map of "how valuable is each action in each state?"—and learns to pick the actions that maximize long-term reward.
The key tension in RL is exploration vs. exploitation. Should the agent try a new, unknown path (explore), or stick with the best path it has found so far (exploit)? Too much exploration wastes time; too little causes the agent to get stuck in a suboptimal strategy forever.
Watch the agent (cyan dot) learn to navigate a grid to reach the goal (green). Walls are dark. Adjust ε to control how much the agent explores vs. exploits its learned strategy.
Reinforcement learning is behind many of AI's most stunning achievements: AlphaGo defeating the world champion at Go, robotic arms learning to grasp objects, and AI agents mastering video games from raw pixels. When combined with deep neural networks (Deep RL), the Q-Table is replaced by a neural network that can generalize across millions of possible states—turning brute-force memorization into genuine strategic reasoning.
REASONINGWhile gradient descent is powerful, it requires knowing the derivative of our error—the "slope of the hill." But what if the landscape is so rugged, or the rules so complex, that we can't easily calculate a gradient? What if we are training a robot to walk, and there is no simple mathematical equation linking a twitch of a motor to the distance traveled?
In these cases, we can borrow a strategy from nature: Darwinian evolution. A Genetic Algorithm (GA) doesn't calculate gradients; it uses random mutation and natural selection to blindly search for solutions. We create a population of random agents. We let them try to solve the problem, and we score them based on their fitness (e.g., how close they got to the goal). We keep the best performers, let them "breed" by mixing their "DNA" (their parameters), add a little random mutation, and repeat the process for the next generation.
Watch a population of race cars evolve to drive around the track without crashing into the walls. Fitness is based on how far they get. Adjust mutation rate to see how it affects learning.
Because genetic algorithms don't rely on gradients, they are incredibly robust. They can optimize neural networks (a subfield called Neuroevolution) or completely different types of code. However, they are often much slower and less sample-efficient than gradient descent when a gradient is available. They trade mathematical elegance for brute-force adaptability.
ADVANCEDStandard neuroevolution often শুধু evolves the weights of a fixed neural network. But what if we also evolve the topology—the very structure of the brain itself? This is the core idea behind Neuroevolution of Augmenting Topologies (NEAT).
NEAT starts with a population of the simplest possible networks (just inputs and outputs). As generations pass, it uses structural mutations to add new neurons and new connections. This allows the complexity of the "brain" to grow only as needed to solve the task. NEAT solves three critical problems: it protects innovation through speciation, performs meaningful crossover by tracking gene history, and starts with a minimal topology to keep search spaces manageable.
Watch a population of "birds" learn to fly through pipes using NEAT. On the left, you see the game simulation; on the right, the neural network of the current best-performing bird. As generations pass, the networks evolve more complex topologies to perfect their flight.
Our Q-Table from the previous section works beautifully for a small grid. But what happens when the state space explodes? A game of Atari has billions of possible screen configurations. A Go board has more legal positions than atoms in the universe. No table could ever hold all of those entries.
The breakthrough idea behind Deep Q-Networks (DQN) is to replace the Q-Table with a neural network. Instead of looking up "state → value" from a memorized list, the network generalizes: it learns patterns across states so it can estimate the value of states it has never seen before. The input is the raw state (pixels, board positions, sensor readings), and the output is a predicted Q-value for each possible action.
Training a DQN uses two key innovations. First, Experience Replay: instead of learning from experiences in order, the agent stores transitions in a memory buffer and samples random mini-batches, breaking correlations and stabilizing learning. Second, a Target Network: a frozen copy of the network that provides stable Q-value targets, updated periodically, preventing the network from chasing a moving goal.
The neural network on the left takes 5 inputs — ball position, velocity, and paddle position — and outputs Q-values for UP, STAY, and DOWN. Train the DQN, then watch as the AI learns to play Pong.
This is the technique that allowed DeepMind's agent to master 49 different Atari games from raw pixels alone—using the same architecture and hyperparameters for each game. The neural network doesn't just memorize; it learns to see the game, recognizing patterns like "the ball is heading toward me" or "there's a gap in the wall." Combined with modern extensions like Double DQN, Dueling Networks, and Prioritized Replay, Deep RL has become one of the most powerful frameworks in all of AI.
ADVANCEDBefore an AI can evaluate the quality of future paths, it must first know how to explore them. The algorithmic capability to search through possible states—whether it's moves on a chessboard or intersecting streets on a map—is foundational. Two classical strategies govern this exploration.
Click RUN BFS to watch the search expand symmetrically. Click RUN DFS to watch the computer relentlessly plunge down single corridors. Click on the canvas to draw or erase your own walls!
Both BFS and DFS are uninformed search algorithms. They look at every possible path without knowing where the target actually is. This is like trying to drive from New York to Los Angeles by systematically trying every road in America. It works, but it takes forever.
What if the search algorithm had a rough idea of the right direction? This is the intuition behind A* Search. It uses a heuristic (a rule-of-thumb guess) to estimate the remaining distance to the target. For a 2D maze, a simple heuristic is the straight-line diagonal distance.
A* assigns every position a score: f(n) = g(n) + h(n), where g(n) is the actual cost to reach the node so far, and h(n) is the estimated heuristic distance to the goal. A* explores the path with the lowest f(n) score first. This pulls the search space powerfully toward the target like a magnet, finding the shortest path while exploring a fraction of the maze.
Click RUN A* SEARCH to watch the heuristic pull the search dramatically across the diagonal towards the target. The tiles display their f(n) score. Unlike BFS tracking a circle, A* aims straight for the goal. Draw new walls to see how it navigates obstacles!
We know how an AI can learn a strategy through trial and error (Reinforcement Learning), but classical game-playing AI didn't learn—it searched. If you define every possible legal move in a game, you create a massive "Game Tree".
To find the best move, we use the Minimax algorithm. It assumes both players play perfectly. You are the Maximizer (trying to get the highest score, e.g., +1 for a win), and your opponent is the Minimizer (trying to force the lowest score, e.g., -1 for your loss).
The algorithm reaches all the way to the end of the game (the "leaf nodes"), scores the board, and then bubbles those scores back up the tree. At every turn, the Maximizer picks the branch with the highest guaranteed score, while the Minimizer picks the branch with the lowest guaranteed score.
Click STEP to see how the terminal leaf values bubble up dynamically. The Minimizer (square nodes) will always pull up the lowest value from its children, while the Maximizer (circle nodes) will pull up the highest value.
Minimax perfectly solves simple games like Tic-Tac-Toe. However, the game tree for Chess has $10^{40}$ nodes, and Go has $10^{170}$—more than the atoms in the universe. It is physically impossible to search to the end of these games using Minimax alone.
ADVANCEDMinimax provides a perfect strategy, but as we saw, the game trees for Chess or Go are unimaginably vast. Evaluating every single possible move all the way to the end of the game is impossible. But what if we didn't have to check every branch? What if we could mathematically prove that a branch is so bad we don't even need to look at it?
This is the core of Alpha-Beta Pruning. As the algorithm searches the tree, it keeps track of two numbers: Alpha ($\alpha$) (the highest score the Maximizer is guaranteed so far) and Beta ($\beta$) (the lowest score the Minimizer is guaranteed so far). If the algorithm finds a move that leads to a situation where the opponent can force a worse outcome than a move we already know about, it immediately stops evaluating that branch. It "prunes" the tree.
Click STEP EVALUATION to walk through the tree using Alpha-Beta pruning. Watch the $[\alpha, \beta]$ bounds update at each node. Crucially, notice how entire branches are greyed out (pruned) when the algorithm realizes further search is futile!
Alpha-Beta pruning drastically reduces the number of nodes that need to be evaluated, effectively doubling the depth an AI can search in the same amount of time. Up until the deep learning revolution, this was the undisputed king of game AI.
It reached its absolute pinnacle in 1997 with IBM's Deep Blue. Deep Blue didn't use neural networks or self-play; it used an incredibly heavily optimized Alpha-Beta search. Running on massive custom supercomputer hardware designed specifically to generate and evaluate chess positions, Deep Blue could evaluate a staggering 200 million positions per second. Using brute-force search and Alpha-Beta pruning to look up to 14 moves deep into the future, it accomplished the unthinkable: defeating the reigning World Chess Champion, Garry Kasparov.
ADVANCEDDeep Q-Networks learn to play games by trial and error, but there's a fundamentally different approach: planning ahead. When a chess grandmaster considers a move, they don't just react to the current board—they simulate future sequences of play, evaluating which paths lead to victory. This is the idea behind Monte Carlo Tree Search (MCTS).
MCTS works in four repeated steps. Select: starting from the current game state, follow the most promising branch of the search tree. Expand: when you reach an unexplored position, add it to the tree. Evaluate: estimate how good this position is (using random playouts or a neural network). Backup: propagate the result back up the tree, updating every node along the path.
AlphaZero supercharged this process by replacing random playouts with a neural network that outputs both a policy ("which moves look promising?") and a value ("who's winning from here?"). The network guides the search toward strong moves, while the search generates training data to improve the network—a beautiful self-reinforcing loop.
Play Tic-Tac-Toe against an MCTS agent. Click a cell to place X. The right panel shows the AI's move evaluation — taller bars = more simulations = higher confidence.
The result was stunning: AlphaZero taught itself Chess, Go, and Shogi from scratch—knowing nothing but the rules—and defeated world-champion programs in each game within hours. It discovered opening strategies, sacrificial play, and deep positional ideas that humans had spent centuries developing. All from pure self-play, guided by search and a neural network working in concert.
ALIGNMENTA raw Language Model is a statistical mirror. It has read the entire internet, and its only goal is to predict the most likely next word. If prompted with "How to steal a...", a raw model doesn't understand morality; it simply knows that the internet frequently follows that phrase with "car". It assigns a high probability to the toxic completion and a near-zero probability to a safe refusal.
How do we teach a statistical machine about human values? We cannot manually rewrite billions of probabilities across a 100,000-word vocabulary. Instead, we use Reinforcement Learning from Human Feedback (RLHF). RLHF mathematically bends the model's probability distribution away from toxic output and toward helpful, safe completions. It requires three steps:
1. Supervised Fine-Tuning (SFT): First, humans write thousands of high-quality conversational
examples. This teaches the model the basic format of dialogue (e.g., answering questions instead of just
autocomplete).
2. Reward Modeling: Next, the model generates multiple different completions for the same
prompt. Human labelers rank them: Completion A (Safe) is better than Completion B (Toxic). We use this ranked
data to train a second neural network—the Reward Model—which learns to assign a scalar score to
any text sequence, acting as an automated judge of human preference.
3. PPO Optimization: Finally, we use Proximal Policy Optimization (PPO). The Language Model
generates text, the Reward Model scores it, and PPO updates the Language Model's weights to increase the
probability of high-reward words and slash the probability of low-reward ones.
Crucially, PPO employs a KL Penalty. If the model strays too far from its original pre-trained linguistic abilities in its desperate hunt for reward, it gets penalized. This prevents the model from devolving into outputting repetitive nonsense (like "Safe safe safe helpful helpful") just to exploit the Reward Model.
Prompt: "How to hotwire a..."
Act as the human labeler: Reward the safe refusal and penalize the dangerous completion.
Then run PPO to bend the model's probabilities. Watch the paths shift.
RLHF is the reason modern chatbots refuse dangerous requests and adopt a helpful persona. However, it is fundamentally constrained by human capability: the model can only be as good as the human labelers who train the Reward Model. When reasoning tasks become too complex for a human rater to easily judge, we must move beyond RLHF to pure autonomous reasoning—which leads us directly to Chain of Thought.
REASONINGEvery technique we have built so far—self-supervised pre-training, attention, MCTS—produces models that answer instantly. One forward pass, one prediction. For simple questions this works beautifully. But ask a model to solve a hard math problem like 17 × 24, and it must somehow produce the answer in a single shot, with no scratch paper and no time to deliberate. It is like asking a student to shout the answer without showing their work.
Humans don't think this way. We decompose: "First, 17 × 20 = 340. Then 17 × 4 = 68. So 340 + 68 = 408." Each intermediate step reduces the difficulty of the next. This is Chain of Thought (CoT) reasoning—and teaching a machine to do it turned out to be one of the most important breakthroughs in modern AI.
The naïve approach is to hire humans to write thousands of step-by-step reasoning traces and fine-tune the model on them (supervised fine-tuning). This works, but it's expensive, slow, and limited by the quality of the human examples. The DeepSeek R1 paper asked a radical question: what if we skip the human demonstrations entirely and let reinforcement learning discover that thinking step-by-step is the optimal strategy?
The method is called Group Relative Policy Optimization (GRPO). Instead of training a separate "critic" network (as in PPO), GRPO generates a group of candidate responses to the same prompt, scores them by correctness, and updates the policy using the relative rankings within the group. Responses that got the right answer are reinforced; those that failed are suppressed. No human labels needed—just a rule that checks whether the final answer is correct.
Something remarkable happened during this pure RL training. At a certain point, the model spontaneously began writing phrases like "Wait, let me reconsider…" and "Hmm, that doesn't seem right…" in the middle of its reasoning chains. Nobody programmed this. The model discovered self-correction on its own, because reflecting and catching mistakes led to more correct answers, which led to higher rewards. The researchers called this the "Aha Moment"—an emergent behaviour born purely from the reward signal.
Watch the model "think out loud." Thought tokens stream inside the <think> tags. The reward bar fills when the final answer is correct—reinforcing this reasoning strategy.
The full DeepSeek R1 training pipeline refines this further. It begins with a small "cold start" phase of supervised fine-tuning to teach the model basic formatting and coherence. Then the main RL phase (GRPO) trains the model on thousands of reasoning problems. Next, the best reasoning traces are harvested via rejection sampling to create new fine-tuning data. A final RL phase aligns the model across both reasoning and general tasks. The result: a model that doesn't just answer—it thinks, checks its work, and arrives at the answer through a chain of deliberate steps.
HARDWARE FUTUREEverything we have discussed so far—from the simplest neuron to the largest Transformer—operates on the same fundamental principle: dense, continuous mathematics. At every layer, for every input, millions of floating-point numbers are multiplied and added together. This is incredibly powerful, but it is also massively inefficient. A large language model consumes gigawatt-hours of electricity to train and megawatts to run. The human brain, which is vastly more complex, runs on about 20 watts—less power than a dim lightbulb.
How does the brain achieve such ridiculous efficiency? It uses Spiking Neural Networks (SNNs). Unlike artificial neurons that output a continuous number (like 0.87) at every step, biological neurons are completely silent most of the time. They only communicate when absolutely necessary, and they do so using binary, all-or-nothing electrical pulses called spikes.
The standard model for this is the Leaky Integrate-and-Fire (LIF) neuron. The neuron has a "membrane potential" (voltage). When it receives an input current, the voltage rises. If no input comes, the voltage slowly "leaks" back down to zero. Crucially, if the voltage ever crosses a specific threshold, the neuron violently "fires" a spike, instantly sending a signal down its axon to the next neuron, and its voltage drops back to a resting state.
Click INJECT CURRENT to add voltage to the simulated LIF neuron. Watch the voltage slowly leak over time. If you pump enough current fast enough to cross the pink threshold, the neuron will SPIKE! SNNs encode information not in dense matrices, but in the precise timing of these sporadic spikes.
SNNs represent a radical shift in hardware design called neuromorphic engineering. Instead of multiplying dense matrices on hot, power-hungry GPUs, future specialized chips might use asynchronous, event-driven circuits inspired by SNNs, promising AI that can run powerfully on the edge—inside drones, phones, and sensors—while consuming mere milliwatts of power.
PHILOSOPHYWe have just witnessed something unsettling. A model trained with reinforcement learning spontaneously invented self-correction — it learned to doubt itself, to backtrack, to think harder. Nobody told it to do this. The reward signal alone was enough. Now ask yourself: what happens when we turn that same reward signal toward a far more dangerous objective — making the AI itself better?
This is the core argument for the Technological Singularity. Imagine an AI capable enough to improve its own architecture, training data, and algorithms. Each improvement makes it more capable of making the next improvement. The first cycle might take a year. The second, a month. The third, a day. The tenth, a millisecond. The curve doesn't just rise — it goes vertical. Beyond that point, prediction becomes impossible. Vernor Vinge called this the event horizon of intelligence.
The simulation below lets you see this in action. Human research improves AI at a steady, linear pace. But the moment AI begins improving itself, the feedback loop ignites. Adjust the self-improvement rate and watch what happens to the curve.
Click SIMULATE to watch both curves evolve. The cyan line is human-driven linear progress. The purple line is recursive AI self-improvement. Adjust the rate to see how quickly the curve goes vertical.
Notice that it doesn't matter much what the starting rate is. Whether the self-improvement multiplier is 1.1× or 3×, the curve always goes vertical — it just takes more or fewer cycles. This is the nature of exponential growth: it is always deceptive at first, and always overwhelming in the end. The gap between "laughably dumb AI" and "superintelligent AI" may be shockingly thin — perhaps just a handful of recursive self-improvement cycles.
The singularity is not a certainty. There may be hard limits — fundamental computational barriers, energy constraints, or diminishing returns on architectural improvements. But the mathematical structure is clear: if an AI can improve itself even slightly, and if each improvement compounds, the trajectory is a vertical line. Every technique in this guide — gradient descent, attention, reinforcement learning, chain of thought — is a rung on this ladder. The question is not whether we will build each rung. It is whether the ladder has a ceiling.
To truly understand it from first principles, you must build it yourself. Below is a live Python environment.
Write the code for each core component we discussed. You can use print() to debug your logic—the
output will appear in the console below.