The Best Things I Have Found on the Unreasonable Effectiveness of Neural Networks
It may simply be that I am in a very unusual position with respect to what I know & what I don't, but I found these three videos from Welch Labs to be incredibly enlightening with respect to the unreasonable effectiveness of neural-network models…
Folding space: How back-propagation and ReLUs can actually learn to fit pieces of the world. Intuition is actually possible! Geometrically, neural nets work not by magic, but by folding planes into shapes—again and again, at huge scale, in extraordinarily high numbers of dimensions. Addition of innumerable such shapes composes simple bends into complex functions, and back-propagation finds such functions that fit faster than it has any right to.
But our geometrical intuition is limited. Our low‑dimensional brains misread the danger that a model might get “stuck” in a bad local minimum of the loss function, and not know which way to move to get to a better result. In large numbers of dimensions—when you have hundreds of thousands, or more of parameters to adjust what emerges looks to us low-dimensional visualizers as “wormholes.” As gradient descent proceeds, the proper shift in the slice you see reveals nearby, better valleys in the loss function down which the model can move.
Your mileage may, and probably will, vary. But these visual intuitions click for me. And I can at least believe, even if not see, how things change when our vector spaces shift from three- to million-dimension ones, in which almost all vectors chosen at random are very close to being at right angles to each other.
From Welch Labs: <https://www.youtube.com/@WelchLabsVideo/videos>. The first of these (which was the third made) three “How Models Learn” videos is the one I found most illuminating.
As I understand its major points:
The idea is to classify which locations are in Belgium and which are in Holland,
You do this by constructing a 3-D surface, in which the x-axis is longitude, the y-axis is latitude, and the z-axis is your confidence that the location is in Holland.
Anything you can do with an n-layer neural network you could do with a (much larger) single-hidden-layer network.
Each node can be thought of as (a) taking a plane, (b) making a fold in it, (c) bending the fold up to a greater or lesser degree, and then (d) shift the resulting shape up or down.
Your final shape is the sum of all of these bent, folded, and shifted planes.
Stacking such layers composes many folds into a very rich piecewise-linear geometry.
And so you can see how, with enough nodes, extraordinary flexibility is possible with a single hidden layer of nodes.
But existence ≠ trainability: even with 100k neurons in one hidden layer, gradient descent may fail or simply need far too many nodes.
Optimization beats existence: the universal approximation nature of a single hidden layer doesn’t get you to where the rubber meets the road.
Depth of the network compounds expressivity: repeatedly folding, scaling, and combining surfaces yields far more complex tilings of input space than a single wide layer.
Back-propagation has geometry: gradients shift fold lines and surface heights; learning is moving joints and planes to reduce loss.
And here is the video:
<https://www.youtube.com/watch?v=qx7hirqgfuU&list=FLupRdJE0AjQUa3Ab-fo88NQ&index=13>
<https://www.youtube.com/watch?v=VkHfRKewkWw&t=1s>
I learned less—but still a lot—that I could grasp and visualize from the second video, the one just above. Still, notably:
Backpropagation is the workhorse of modern AI: a simple, scalable rule that updates millions to billions of parameters efficiently.
With two inputs (latitude/longitude), neurons become planes; the model learns which plane sits “on top” per region.
Simple linear models can’t carve intricate borders; they need depth/activation to capture complex partitions.
The Belgium–Netherlands enclave map illustrates why naive linear boundaries fail and why architecture matters.
History lesson: early skepticism vastly underestimated how far this mathematically modest method could scale with data and compute.
<https://www.youtube.com/watch?v=NrO20Jb-hy0>
And still very much worth watching is the third video (the first made) above. Specifically:
3-D visualization is useful for intuition at small scale, but math (gradients, chain rule) is what actually handles the dimensionality.
Gradient descent is the core learning rule, but our usual “downhill on a landscape” picture misleads for huge models.
That means that loss landscapes for LLMs are effectively astronomically high‑dimensional; since parameters are tightly coupled, you need to examine gradients that provide a local “compass” of how the loss function value is varying in all dimensions at once
There is a powerful “wormhole” effect when we try to visualize this with our limited 3-D brains: after a step in full-dimension space, the low‑dim region can appear to materialize in our low‑dim view.
Thus local minima aren’t the showstopper once feared; in very high dimensions, getting stuck in every direction is unlikely.
Backprop + gradient descent turns out to be a much more broadly applicable engine than a reasonable person would believe possible ex ante.
The Bitter Lesson: effective learning emerges from simple rules applied at scale, not from clever analytical tricks.
Neural networks aren’t as mysterious when you try to translate them into geometry. Each ReLU nearal-network unit “folds” a plane; layers compose folds into intricate, piecewise‑linear shapes that carve real‑world boundaries. Yes, a single hidden layer can approximate anything—but trainability matters, and gradient descent finds good solutions vastly more reliably when depth compounds expressivity. Welch Labs’ visuals make this concrete:
Belgium–Netherlands enclaves show how layered folds succeed. Start with a map: latitude, longitude, and a z‑axis of the degree of confidence the location is in Holland. Add neural-network ReLUs and watch each neuron fold space. And so simple bend, fold, shift, and add arithmetic neural-network unit by unit turns into true alchemy.
And the unreasonable effectiveness of neural-network models at scale suddenly appears less unreasonable.