Transformers¶
Transformers architecture come from RNNs which:
- Introduced a feedback loop for propagating information forward
- Are useful for modeling sequential things for example a sequence of words (or tokens)
Then attention was introduced, creating a hidden state for each token and a concept of relationship between words arise. But RNNs are still sequential and can not parallelize it
So using feed-forward neural networks (FFNNs) it was possible to do the processing in parallel that uses a mechanism called self-attention, each token carries it's relationship with the other tokens as attention weights and the position of where it is.
So once is possible to parallelize the training is possible to train the neural network in the entire content of the internet.
Self-Attention¶
- Each encoder or decoder has a list off embeddings (vectors) for each token (words are tokenized in a numerical representation). The space between the vectors represent how similar the tokens are to each other, so words with same meaning are closer in the multi dimensional space.
- Self-attention produces a weighted average of all token embeddings.
- This results in tokens being tied to other tokens that are important for its contexts, and a new embedding that captures its meaning in context.
- Three matrices of weights are learned through back-propagation:
- Query (wq)
- Key (Wk)
- Value (Wv)
- Every token gets a query, key and value vector by multiplying its embedding against these matrices
- Compute a score for each token by multiplying (dot product) its query with each key
- Then softmax is applied to the scores to normalize them
Masked Self-Attention¶
- A mask can be applied to prevent tokens from seeing next tokens (words)
Applications¶
- Chat
- Question and answering
- Text classification
- Named entity recognition
- Summarization
- Translation
- Code generation
- Text generation
Transfer Learning (Fine Tuning)¶
- Add additional training data to the model
- Freeze specific layers, re-train others
- Add a new layer on top of the pre-trained model
Glossary¶
- Tokens: numerical representations of words or parts of words
- Embeddings: mathematical representations (vectors) that encode the "meaning" of a token
- Top P: Threshold probability for token to be included in the output (higher the top P, more random result will be)
- Top K: Alternate to
Top P, where K is the amber of possible token candidates to be included in the output (Higher the K, more random will be, because there are more choices to choose from) - Temperature: Level of randomness in selecting next word output from the tokens
- Context window: The number of tokens a LLM can process at once
- Max tokens: Limit for total number of tokens (input or output)