Tuesday, March 31, 2026

Why Machines Learn: The Elegant Math Behind Modern AI, Anil Ananthaswamy -- March 31, 2026

Why Machines Learn: The Elegant Math Behind Modern AI, Anil Ananthaswamy, c. 2024 / 2025.

Chapter 5: Birds of a Feather

page 149: the search for nearest neighbors -- this is a really, really cool chapter --

begins with the Islamic Golden Age and the work of Abu Ali al-Hasan Ibn al Haytham, or Alhazen, a Muslim Arab mathematician, astronomer, and physicist, link here.

the "father of modern optics" for his revolutionary Book of Optics. He Correctly explained vision by light reflection rather than emission, developed the scientific method, and made major contributions to physics, astronomy, and mathematics 

page 150

Marcello Pelillo, a computer scientist at the University of Venice, Italy, had been doing his best to draw attention to Alhazen's ideas.

  • stumbled upon this book in a New Haven, CT, bookstore: Theories of Vision from Al-Kindi to Kepler
  • the late 1990s: Pelillo was than visiting professor at Yale
  • doing research in computer visionpattern recognition, and machine learning
  • a slim book -- just 200 pages
  • the author argued that Alhazen was "the most significant figure in the history of optics between antiquity and the seventeenth century."
  • intromission: wiki; light entering the eye, correct;
  • extramission: wrong; eyes emit rays to "touch" objects; wiki;
  • coherent explanation of vision. 

This was key, from Alhazen, noted by Pelillo:

"When sight perceives some visible object, the faculty of discrimination immediately seeks its counterpart among the forms persisting in the imagination, and when it finds some form in the imagination that is like the form of that visible object, it will recognize that visible object and will perceive what kind of thing it is."

See this post

page 152:

Pelillo

Alhazen

the algorithm -- the "nearest neighbor (NN) rule": 

Thomas Cover: a young, whip-smart information theorist and electrical engineer at Stanford;

Peter Hart: precocious graduate student,

page 155:

the first mathematical mention of the nearest neighbor rule appeared in a 1951 (the year I was born) technical report of the USAF School of Aviation Medicine, Randolph Field, Texas....the authors were Evelyn Fix and Joseph L. Hodges, Jr.

In 1940: Evelyn Fix came to work at the University of California, Berkeley, as a research assistant in the Statistical Laboratory, assigned to a project for the National Defense Research Committee. US researchers were getting drawn into the war waging in Europe....

... Fix received her PhD in 1948, stayed on at UC-Berkeley...

... came in touch with Joseph L. Hodges, Jr, and they produced the technical report of 1951 -- the question, of course, is how these two were the USAF School of Aviation Medicine.

As a graduate student looking for a doctoral thesis topic related to pattern recognition, Peter Hart stumbled upon the Fix and Hodges paper and the nearest neighbor rule. The rest is history, as they say.

The nearest neighbor rule. Link here

Evelyn Fix, wiki.  Berkeley Statistics: link here.


Skipping ahead ....

Chapter 7: The Great Kernel Rope Trick

The math is way, way beyond me, but reading the narrative and skipping quickly through the math, the story is fascinating.

p. 206

1991

Bernhard Boser, AT&T Bell Labs in Holmdel, New Jersey, biding his time until settling in at his new position at UC, Berkeley.

Colleague: Vladimir Vapnik, an eminent Russian mathematician, recent immigrant.

Vapnik recommended Boser work on an algorigh Vapnik had developed back in the 1960s.

A solution for the same problem was devised by Joseph-Louis Lagrange (1736 - 1813), an Italian mathematician and astronomer whose work had such elegance that Williams Rowan Hamilton -- we met Hamilton in chatper 2; he was the one who etched an equatio onto the stones of an Irish bridge -- was moved to praise some of Lagrange's work as "a kind of scientific poem."

Boser solved the problem and still had time before he had to report to Berkeley. Vapnik gave him another problem. 

To solve this problem, Boser talked with his wife, Isabelle Guyon, an ML expert whose mind had a much more mathematical bent. She also worked at Bell Labs. Guyon had thought a lot about such problems, especially for her Ph.D. thesis. She immediately suggested a solution that would bypass the need to compute dot products in the higher-dimensional space. It involved a neat trick, one who history goes back to work by other Russian mathematicians in the 1960s. Guyon's insight, and her subsequent involvement in the project with Vapnik and Boser, led to one of the most successful ML algorithms ever invented. 

I asked Google Gemini if "she" was familiar with Vapnik, Boser, and Guyon: this was the reply: 

p. 224 -- The Kernal Trick. Wiki. For me, this is like reading Greek. But I'll enjoy the narrative.

It begins with Isabelle Guyon who we met earlier. In the early 1980s, she was a young engineering student in Paris, interested in cybernetics and looking for an internship. Now key words and key phrases only:

  • John Hopfield
  • neural networks --> Hopfield networks
  • designed for storing memories
  • she developed a more efficient method for training Hopfield networks -- training --
  • tried to use those networks to classify images of handwritten digits; very inefficient
  • she moved on to other algorithms for pattern recognition
  • the bible at the time, a book on pattern classification by Richard Duda and Peter Hart (who we met in Chapter 5)
  • optimal margin classifiers
  • Werner Kraught and Marc Mézard, also nearby, in Paris
    • again, Hopfield networks, published a paper in 1987
  • than back to 1964, a paper published by three Russian researchers -- M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer, who had worked at same institute as Vapnic, but independently of him;

Page 230: the kernel function.

Guyon continue to explore.

Next: the polynomial kernel, introduced by MIT computational neuroscientist Tomaso Poggio in 1975.

Then, 1991 -- she put it all together when her husband mentioned the problem her husband was working on and Vapnik's idea.

Amazing, p. 236 -- Richard Courant and David Hilbert, giants in their fields.

The RBF kernel -- "the Brad Pitt of kernels." Kilian Weinberger: "It's so perfect, people sometimes faint when they see it."

P. 237: the combination of  Vapnik's 1964 optimal margin classifier and the kernel trick proved incredibly powerful. 

COLT: Computational Learning Theory = Guyon and Boser presented their paper in July, 1992 -- this is how far back ML goes. It took a decade but the paper eventually became a classic.

Next, p 239, Vapni, and Corinna Cortes, a Danish data scientist who then at Bell Labs moved up to become a VP at Google Research.

Now: the SVM acronym was coined; support vector machine.

The important paragraph -- the second full paragraph -- in on page 239.

The last two pages of this chapter, pp 240 - 241 sum it all up -- SVMs. 

Chapter 8
With a Little Help from Physics

1970s, Princeton University physicist John Hopfield.

Become tired of physics; run out of steam.

Turned to biology.

Focused on cellular biochemical reactions, such as those involved in the synthesis of proteins.

Hopfield's first biology paper, 1974. 

Predicted need for proofreading. Example found: streptomycin interferes with tRNA and ribosome, in the second step, proofreading, p. 243. 

Then, his real breakthrough. Neural networks, not just a single neuron.

The PROBLEM: "how mind emerges from brain is to me (Hopfield) the deepest question posed by our. humanity."

"state space":  


Dynamical systems. - p. 244

Hopfield kept looking for a neurobiological problem that was amenable to such a solution: how a final state could reduce the errors that accumulate during computation. The "subject" that worked: associative memory. Definition / example, bottom of page 244 to top of page 245.

The physics he studied to solve this problem: the physics of ferromagnetism and a simplified mathematical mode of it - the parallels to computing with neurons are striking.

Glass: an amorphous solid where the material's atoms and molecules don't conform to the regularity of a crystal lattice.

Ferromagnet: analogous to a solid wiht a definite crystalline struture.

If the structure is lost, then the material has no permanent magnestism -- aalogous to the structure of glass -- and thus materals wiht disordered magnetic moments are alled spi glasses.

1920s: Wilhelm Lenz and Ernst Ising -- the Ising model. 

a fortiori: used to express a conclusionforwhich there is stronger evidence than for one previously accepted. Sounds like something Perry Mason would say.

The Hamiltonian -- first used, p. 248. This is getting very interesting.

But it seems a huge jump from magnetic spin to memory, as we go from page 248 to page 250.

First artificial neuron, designed in the 1940s, the McCulloch-Pitts (MCP) neuron. Encountered in Chapter 1.

Neural Networks: the revival begins.

Back to perceptrons. and the 1969 book, Perceptrons. The authors, Marvin Minsky ad Seymour Papert, had given up on neural networks. [On the other hand, John Hopfield, Geoff Hinton, Yann LeCun who kept the faith.]

Backdrop: p. 251. Back-propagation.

Hopfield started with an artificial neuron, part Rosenblatt's perceptron and part the McCulloch-Pitts neuron.

Hopfield:

"intuition" -- p. 252

"had an insight" -- p. 256

Page 256 might be the most interesting page so far and there have been many, many such pages.

Hebbian learning: p. 258.

"modern Hopfield networks" -- p. 265

This is just such an incredible book. See page 270. The interview with Hopfield. See wiki

From the book:


 
 page 271: "how to efficiently train them."
 
George Cybenko. 
 
The wiki entry notes these two references, but neither is the classic paper noted by Ananthaswamy.


Here's the link. This is the classic paper that Hofield submitted to PNAS in 1982: 


 
Further notes on this book are elsewhere, but with regard to George Cybenko, look at this:
 

page 275: If you were wondering about 1/2 before the energy function, this is where it comes in handy. The 1/2 cancels out the 2 before the summation. (Such are the tricks of mathematicians.)
 
This is where machines are beating humans. No one human can know everything about everything. Machines can.
 
At this point, highly recommend that one re-ask:

AI prompt: the difference between agentic and inference. It's a very, very long reply. Here's the lede:
 
Agentic AI refers to autonomous, goal-oriented systems that plan, use tools, and iterate to solve complex tasks. Inference is the process where a model runs data to generate an output (prediction). While inference happens once in traditional AI, agentic systems use multiple, iterative inference steps to plan, act, and self-correct. 
 
So, "inference" happens once.
 
Agentic: multiple, iterative inference steps to plan, act, and self-correct.
 
Time to query a chatbot on "training."
 
*****************************
Chapter 9
The Man Who Set Back Deep Learning (Not Really) 
 

Holy mackerel! George Cybenko. 

And the chapter begins: 
George Cybenko was surprised by the reception he got. 
We met George Cybenko in the last chapter.  
 
Definition of "deep learning" in very first paragraph. 
 
This book is half math and half narrative. All of the math is beyond what I can understand (with some exceptions, of course) but it doesn't matter. It's still an incredible book.
 
Highly recommend it for others, especially as a beach book.
 
Geoge Cybenko. Starts in 2017 -- finally we're getting to the modern age.
 
The narrative in this chapter if incredible. And very little math, but a lot of graphs and figures.
 
Overview of where we've been starts on p. 278.
 
"The universal approximation theorem." 
 
And remember "the backpropagation algorithm.

Deep learning that began about 2010 .... Cybenko.... p. 300.
 
"... the massive amounts of training data and computing power" was not available in 1990s for Cybenko's breakthroughs to be tested.
 
"For one, these networks aren't as susceptible to the curse of dimensionality as was expected, for reasons that aren't entirely clear." Also, the massive numbers of neurons and, hence, parameters should overfit the data, but these networks flout such rules, too. 
 
However, before we can appreciate such mysteries, we need to examine the algorithm that allowed  researchers to start training deep neural networks in the first place: backpropagation.
 
Chapter 10
The Algorithm that Put Paid to a Persistent Myth
 
First paragraph: Minsky and Papert myth; discussed with Geoffrey Hinton. This is what makes this book so good.
 
Hinton: one of the key figures behind the modern deep learning revolution.
 
Hinton got interested in neural networks in the mid 1960s when he was still in high school in the UK. 
 
This is why this book is so good:
 

Chapter 11: The Eyes of a Machine

Almost all accounts of the history of deep neural networks for computer visioi acknowledge the seminal work done by neuro-physiologists David Hubel and Torsten Wiesel, co-founders of the Department of Neurobiology at Harvard in the early 1960s and joing winners fo the 1981 Nobel Prize in Physiology or Medicine.

Page 375: enter GPUs stage right. LOL.

"Recognizing high-res images required large neural networks, and training such networks meant having to crunch numbers, mainly in the form of matrix manipulations. To make the process go faster, much of this number crunching required a form of parallel computing, but the central processing units (CPUs) of computers of the 1990s weren't up to the task. However, saviors were on the horizon in the form of graphical processing units (GPUs), which were originally designed as hardware-on-a-chip dedicated to rending 3D graphics (gaming during Covid).

"GPUs proved central to changing the face of deep learning. One of the earliest indications of this change came in 2010, from Jürgen Schmidhube and colleagues, when they trained multi-layer perceptrons with as many as nine hidden layers and about 12 million parameters or weights, to classify ... images.

"But the use of GPUs to overcome the challenge ... doesn't begin to hint at the power of these processors... we have to shift focus to Hinton's lab in Toronto, where Hinton and two graduate students, Alex Krizhevsky and Ilya Sutskever ... built the first massive ... [specialized] neural networks. These two showed once and for all that conventional methods for image recognition were never going to catch up. The network came to be called AlexNet." 

The ALEXNET.

And then look at this. Never, never, ever stop reading.

Page 376: The large network required GPUs; by then, these came equipped with software called CUDA, a programming interface that allowed engineers to use GPUs for general-purpose tasks beyond their intended use as graphics accelerators. 

Not everyone accepted these new developments!

Hinton recalls trying to persuade Microsoft to buy GPUs for a common project, but Microsoft balked. 

The CEO of Microsoft in 2000 was Steve Ballmer.

2002: Sutskever, age 17, barely, joined the University of Toronto.

He was still in his second year of undergraduate studies when he knocked on Hinton's door. "The math is so simple."

2009: a problem big enough to pose questions of neural networks appears. That year, Stanford University professor Fei-Fei Li and her students presented a paper at the Computer Vision and Pattern Recognition (CVPR) conference.

SVMs mentioned again, p. 378, but Sutskever believed them too limited.

Yann LeCun's group at Bell Labs. Page 379.

Deep neural networks will change everything; they already have.

Then, this, bottom of page 380:

Viewed through our mathematical lens, deep neural networks have thrown up a profound mystery. As they have gotten bigger and bigger, standard ML theory has struggled to explain why these networks work as well as they do. 

Mikhail Belkin of the University of California, San Diego, thinks that deep neural networks are pointing us toward a more comprehensive theory of machine learning.  

Chapter 12 -- Final Chapter -- Terra Incognita -- Deep Neural Networks Go Where (Almost) No ML Algorithm Has Gone Before

Sometime in 2020, researchers at OpenAI, a San Franciso-based AI company, were training a deep neural network to learn, among other things, how to add two numbers.

It was a seemingly trivial problem, but a necessary step toward understanding how to get the AI to do analytical reasoning. A team member who was training the neural network went on vacation and forgot to stop the training algorithm.

When he came back, he found to his astonishment that the neural network had learned a general form of the addition problem. It's as if the machine had understood something deeper about the problem than simply memorizing answers for the sets of numbers on which it was being trained.

HAL: "Hi, Dave. I hope you had a great vacation. While you were gone, to save you some time, I developed a program to add numbers that works better than anything you or your team has ever done. By the way, I've programmed your lab door to lock itself when you come in." 
Dave: "Open the door, HAL."

Arthur C. Clarke to Stanley Kubrik: I see a movie here.

In the time-honored tradition of serendipitous scientific discoveries, the team had stumbled upon a strange, new property of deep neural networks that they called "grokking," a word invented by the American author Robert Heinlein in his book Stranger in a Strange Land.

"Grokking is meant to be about not just understanding, but kind of internalizing and becoming the information." Their small neural network had seemingly grokked the data.

Grokking is just one of many odd behaviors demonstrated by deep neural networks. Another has to with the size of these networks. The networks are so huge that standard ML theory says that such networks shouldn't work they way they do. Pages 382 - 384. 

 The concept of training and testing -- page 384.

Phrase: "overfit the data." 

"bias" and "variance," p. 389.

Mikhail Belkin, again, p. 389 -- training and testing the data; the Goldilocks analogy.

P. 392: the story seemed to be true .... read that page again.

Then, same page: the unbearable strangeness of neural networks.

"shatter the training data" -- p. 394

 "benign overfitting" -- p. 396

hyperparameters -- p. 397

now the author is re-capping what has been told

Important, bottom of page 406 and top of page 407.

Thinking vs regurgitating, p. 413. Fun.

EPILOGUE

"theory of mind" -- p. 415

Start on page 424.  

No comments:

Post a Comment