Here are the building blocks for any deep learning system and a little bit about llm towards the end.
Graphs - It all starts with computational graphs. These are data structures that include element wise operations, usually matrix multiplication, addition, activation functions and loss function. The computations are differential, resulting in a smooth continuous space, appropriate for continuous optimization (gradient descent), which is covered later.
Layers - Layers are modules comprised of graphs that apply some computation and store the results in a state, referred to as the learned weights. Each Layer learns a deeper, more meaningful representation from the dataset, ultimately learning a latent manifold, which is a highly structured, lower dimensional space, that interpolates between samples, achieving generalization for predictions.
Different machine learning problems and data types use different layers, e.g. Transformers for sequence to sequence learning and convolutions for computer vision models, etc.
Models - Organize stacks of layers for training. Includes a loss function that sends a feedback signal to an optimizer to adjust learned weights during training. Models also include an evaluation metric for accuracy, independent of the loss function.
Forward pass - For training or inference, when an input sequence passes through all the network layers and a geometric transformation is applied producing an output.
Backpropagation - Durring training, after the forward pass, gradients are calculated for each weight with respect to the loss, gradients are just another word for derivatives. The process for calculating the derivatives is called automatic differentiation, which is based on the chain rule of derivation.
Once the derivatives are calculated the optimizers intelligently updates the weights, with respect to the loss. This is the process called “Learning” often referred to as gradient descent.
Now for Large Language Models.
Before models are trained for sequence to sequence learning, the corpus of knowledge must be transformed into embeddings.
Embeddings are dense representations of language that includes a multidimensional space that can capture meaning and context for different combinations of words that are part of sequences.
LLMs use a specific network layer called transformers, that includes something called an attention mechanism.
The attention mechanism uses the embeddings to dynamically update the meaning of words when they are brought together in a sequence.
The model uses three different representations of the input sequence, called the key, query and value matrices.
Using dot product, an attention score is created to identify the meaning of the reference sequence, then a target sequence is generated
The output sequence is predicted one word at a time, based on a sampling distribution of the target sequence, using a softmax function.