Convolutional neural networks excel at processing grid-like data (such as images) by detecting local patterns. However, CNNs are less effective at capturing global relationships within the data. Transformers overcome this by using self-attention to weigh the importance of different parts of the input data as part of the greater whole. While CNNs are primarily used for tasks like image recognition, transformers have been adapted for both text and image processing, providing a more versatile set of solutions.
- Input embeddings
- Positional encoding
- Transformer block
- Linear/softmax blocks
- The input sequence is transformed into numerical representations called embeddings, which capture the semantic meaning of the tokens.
- Positional encoding adds unique signals to each token's embedding to preserve the order of tokens in the sequence.
- The multi-head attention mechanism processes these embeddings to capture different relationships between tokens.
- Layer normalization and residual connections stabilize and speed up the training process.
- The output from the self-attention layer passes through feedforward neural networks for non-linear transformations.
- Multiple transformer blocks are stacked, each refining the output of the previous layer.
- In tasks like translation, a separate decoder module generates the output sequence.
- The model is trained using supervised learning to minimize the difference between predictions and ground truth.
- During inference, the trained model processes new input sequences to generate predictions or representations.
- Natural language processing
- Machine translation
- Speech recognition
- Image generation
- DNA sequence analysis
- Protein structure analysis