HN Reader

The maths you need to start understanding LLMs

609

119

Way back when, I did a masters in physics. I learned a lot of math: vectors, a ton of linear algebra, thermodynamics (aka entropy), multi-variable and then tensor calculus.

This all turned out to be mostly irrelevant in my subsequent programming career.

Then LLMs came along and I wanted to learn how they work. Suddenly the physics training is directly useful again! Backprop is one big tensor calculus calculation, minimizing… entropy! Everything is matrix multiplications. Things are actually differentiable, unlike most of the rest of computer science.

It’s fun using this stuff again. All but the tensor calculus on curved spacetime, I haven’t had to reach for that yet.

4 days agoby libraryofbabel

For me, working through Karpathy's video series (instead of just "watching" them) helped me tremendously to understand how LLMs work and gave me the confidence to read through more advanced material, if I feel like it. But to be honest, the knowledge I gained through his videos are already enough for me. It's kind of like learning how a CPU works in general and ignoring all the fancy optimization steps that I'm not interested in.

Thanks Andrej for the time and effort you put into your videos.

4 days agoby spinlock_

Anyone else read the book that the author mentions, Build a Large Language Model (from Scratch) [0]? After watching Karpathy's video [1] I've been looking for a good source to do a deeper dive.

[0] https://www.manning.com/books/build-a-large-language-model-f...

[1] https://www.youtube.com/watch?v=7xTGNNLPyMI

4 days agoby rsanek

This is not about _Large_ Language models though. This explains math for word vectors and token embeddings. I see this is the source of confusion for many people. They think LLMs just do this to statistically predict the next word. That was pre-2020s. They ignore the 1.8+ Trillion-parameter Transformer network. Embeddings are just the input of that giant machine. We don't know what is going on exactly in those trillions of parameters.

4 days agoby ozgung

One of the most interesting mathematical aspects to me are the fact that LLMs are logit emitters. And associated with this output is uncertainty. Lot of ppl talk about networks of agents. But what you are doing is accumulating uncertainty - every model in the chain introduces its own uncertainty on top of what it inherits. In some situations I've seen a complete collapse after 3 LLM calls chained together. Hence why lot of people recommend "human in the loop" as much as possible to try and reduce that uncertainty (shift the posterior if you will); or they recommend more of a workflow approach - where you have a single orchestrator that decides which function to call, and most of the emphasis (and context engineering) is placed on that orchestrator. But it all ties together in the maths of LLMs.

4 days agoby armcat

These are technical details of computations that are performed as part of LLMs.

Completely pointless to anyone who is not writing the lowest level ML libraries (so basically everyone). This does now help anyone understand how LLMs actually work.

This is as if you started explaining how an ICE car works by diving into chemical properties of petrol. Yeah that really is the basis of it all, but no it is not where you start explaining how a car works.

4 days agoby InCom-0

Working through Karpathy's series builds a foundational understanding of LLMs, providing enough confidence to explore further. A key insight is that LLMs are logit emitters, and their inherent uncertainty compounds dangerously in multi-agent chains, often requiring a human-in-the-loop or a single orchestrator to manage it. Crucially, people confuse word embeddings with the full LLM; embeddings are just the input to a vast, incomprehensible trillion-parameter transformer. The underlying math of these networks is surprisingly simple, built on basic additions and multiplications. The real mystery isn't the math but why they work so well. Ultimately, AI research is a mix of minimal math, extensive data engineering, massive compute power, and significant trial and error.

4 days agoby Rubio78

The title is misleading. The maths explained in the blog is the math that is used to build an LLM (how it internally does calculations to do inference etc.). The math to understand LLMs, i.e. that explains in mathematical rigor why LLMs work, is not fully developed yet. That is what the LLM Explainability is about, the effort to understand and clarify the complex, "black-box" decision-making processes of Large Language Models (LLMs) in human-interpretable terms.

4 days agoby erdehaas

Well, in short - basic linear algebra, basic probability, analysis (functions like exp), gradient.

At some point I tried to create an introduction step-by-step, where people can interact with these concepts and see how to express it in PyTorch:

https://github.com/stared/thinking-in-tensors-writing-in-pyt...

4 days agoby stared

I'm currently working through Mathematics for Machine Learning and Data Science Specialization from Deeplearning.AI. It's been the best into to Linear Algebra I've found. It's worth the $50 a month just for the quizzes, labs, etc. I'm simultaneously working through the book Math and Architectures of Deep Learning, which is helping re-inforce and flesh out the ideas from the course.

[0] https://www.coursera.org/specializations/mathematics-for-mac... [1] https://www.manning.com/books/math-and-architectures-of-deep...

4 days agoby ryanchants

The steps in this article are also the same process for doing RAG as well.

You computer an embedding vector for your documents or chunks of documents. And then you compute the vector for your users prompt, and then use the cosine distance to find the most semantically relevant documents to use. There are other tricks like reranking the documents once you find the top N documents relating to the query, but that’s basically it.

Here’s a good explanation

http://wordvec.colorado.edu/website_how_to.html

4 days agoby kingkongjaffa

It appears that the "softmax" is found (as I hypothesized by looking at the results, before clicking the link) by exponentiating each value and normalizing to a sum of 1. It would be worthwhile to be explicit. The exponential function is also "high-school maths", and an explanation like that is much easier to follow than the Wikipedia article (since not a lot of rigour is required here).

4 days agoby zahlman

Additions and multiplications. People are making it sound like it's complicated, but NNs have the most basic and simple maths behind

The only thing is that nobody understand why they work so well. There are a few function approximation theorems that apply, but nobody really knows how to make them behave as we would like

So basically AI research is 5% "maths", 20% data sourcing and engineering, 50% compute power, and 25% trial and error

4 days agoby oulipo2

I’m sure no one will read this but I was on the team that invented a lot of this early pre-LLM math at Google.

It was a really exciting time for me as I had pushed the team to begin looking at vectors beyond language (actions and other predictable perimeters we could extract from linguistic vectors.)

We had originally invented a lot of this because we were trying to make chat and email easier and faster, and ultimately I had morphed it into predicting UI decisions based on conversations vectors. Back then we could only do pretty simple predictions (continue vector strictly , reverse vector strictly or N vector options on an axis) but we shipped it and you saw it when we made hangouts, gmail and allo predict your next sentence. Our first incarnation was interesting enough that eric Schmidt recognized it and took my work to the board as part of his big investment in ML. From there the work in hangouts became all/gmail etc.

Bizarrely enough though under sundar, this became the Google assistant but we couldn’t get much further without attention layers so the entire project regressed back to fixed bot pathing.

I argued pretty hard with the executives that this was a tragedy but sundar would hear none of it, completely obsessed with Alexa and having a competitor there.

I found some sympathy with the now head of search who gave me some budget to invest in a messaging program that would advance prediction to get to full action prediction across the search surface and UI. We launched and made it a business messaging product but lost the support of executives during the LLM panic.

Sundar cut us and fired the whole team, ironically right when he needed it the most. But he never listened to anyone who worked on the tech and seemed to hold their thoughts in great disdain.

What happened after that is of course well known now as sundar ignored some of the most important tech in history due to this attitude.

I don’t think I’ll ever fully understand it.

4 days agoby tsunamifury

Here is the bible on deep learning, ”Deep learning with Python” written by Francois Chollet, the creator of Keras.

https://www.manning.com/books/deep-learning-with-python

3 days agoby lazarus01

I think the author did a sufficient job caveating his post without being verbose.

While reading through past posts I stumbled on a multi part "Writing an LLM from scratch" series that was an enjoyable read. I hope they keep up writing more fun content.

4 days agoby d_sem

You need virtually no maths to deeply and intuitively understand embeddings: https://sgnt.ai/p/embeddings-explainer/

4 days agoby petesergeant

> Actually coming up with ideas like GPT-based LLMs and doing serious AI research requires serious maths.

Does it ? I don't think so. All the math involved is pretty straightforward.

4 days agoby apwell23

just wanna plug https://mathacademy.com/courses/mathematics-for-machine-lear....

happy customer and have found it to be one of the best paid resources for learning mathematics in general. wish I had this when I was a student.

4 days agoby cultofmetatron

I recently did a livestream on trying to understand attention mechanism (K, Q, V) in LLM.

I think it went pretty well (was able to understand most of the logic and maths), and I touched on some of these terms.

https://youtube.com/live/vaJ5WRLZ0RE?feature=share

4 days agoby paradite

Apologies for the metacomment, but HN is a funny place. There is a certain type of learning that is deemed good ('math for AI') and a certain type of learning that is deemed bad ('leetcode for AI').

4 days agoby 11101010001100

ML is interesting, but honestly I have trouble knowing the future of it, to see if I should learn the techniques to land a job or not be too obsolete.

There is certainly some hype, a lot of what is the market is just not viable.

4 days agoby jokoon

I keep having had the best time with Andrej Karpathy's Youtube intros into LLM math. But I haven't compared scope or quality to this submission

4 days agoby kekebo

Ah, I was hoping this would teach me the maths to start understanding the economics surrounding LLMs. That’s the really impossible stuff.

4 days agoby nativeit

Thanks for sharing!

4 days agoby orionuni

nothing about vector calculus to minimize loss functions or needing to find Hessians to do Newton's method.

4 days agoby fnord77

Here are the building blocks for any deep learning system and a little bit about llm towards the end.

Graphs - It all starts with computational graphs. These are data structures that include element wise operations, usually matrix multiplication, addition, activation functions and loss function. The computations are differential, resulting in a smooth continuous space, appropriate for continuous optimization (gradient descent), which is covered later.

Layers - Layers are modules comprised of graphs that apply some computation and store the results in a state, referred to as the learned weights. Each Layer learns a deeper, more meaningful representation from the dataset, ultimately learning a latent manifold, which is a highly structured, lower dimensional space, that interpolates between samples, achieving generalization for predictions.

Different machine learning problems and data types use different layers, e.g. Transformers for sequence to sequence learning and convolutions for computer vision models, etc.

Models - Organize stacks of layers for training. Includes a loss function that sends a feedback signal to an optimizer to adjust learned weights during training. Models also include an evaluation metric for accuracy, independent of the loss function.

Forward pass - For training or inference, when an input sequence passes through all the network layers and a geometric transformation is applied producing an output.

Backpropagation - Durring training, after the forward pass, gradients are calculated for each weight with respect to the loss, gradients are just another word for derivatives. The process for calculating the derivatives is called automatic differentiation, which is based on the chain rule of derivation.

Once the derivatives are calculated the optimizers intelligently updates the weights, with respect to the loss. This is the process called “Learning” often referred to as gradient descent.

Now for Large Language Models.

Before models are trained for sequence to sequence learning, the corpus of knowledge must be transformed into embeddings.

Embeddings are dense representations of language that includes a multidimensional space that can capture meaning and context for different combinations of words that are part of sequences.

LLMs use a specific network layer called transformers, that includes something called an attention mechanism.

The attention mechanism uses the embeddings to dynamically update the meaning of words when they are brought together in a sequence.

The model uses three different representations of the input sequence, called the key, query and value matrices.

Using dot product, an attention score is created to identify the meaning of the reference sequence, then a target sequence is generated

The output sequence is predicted one word at a time, based on a sampling distribution of the target sequence, using a softmax function.

4 days agoby lazarus01