2 (E)- Transformer Architecture

1. What is the Transformer and How It Works:

The Transformer is a new kind of neural network designed by Google in 2017. It’s a big leap for NLP because it uses attention mechanisms to understand how words relate in sentences, without needing older techniques like RNNs or CNNs.

2. How Transformer Handles Words:

Instead of looking at words one after another, the Transformer sees the whole sentence at once. It uses self-attention to figure out which words are important and how they fit together.

Example:

  • Input Sentence: “The quick brown fox jumps over the lazy dog.”

3. Steps in a Transformer:

Encoder:

  • Embedding: First, it changes each word into a number to understand it better.
  • Self-Attention: It checks how much each word in the sentence relates to all the others.
  • Feed-Forward: Then, it adjusts these connections to make a clearer picture of the sentence.

Decoder (for translating or generating text):

  • Embedding: Converts the words seen so far into numbers.
  • Self-Attention: Checks how these words relate to each other.
  • Cross-Attention: Looks at how the words being translated relate to the original sentence.
  • Feed-Forward: Adjusts all these connections again to predict the next words in the translated sentence.

4. Example Using PyTorch:

Here’s how a part of a Transformer block is coded in PyTorch:

import torch
import torch.nn as nn

class TransformerDecoderBlock(nn.Module):
def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
super().__init__()
self.self_attn = nn.MultiheadAttention(embed_dim, num_heads, dropout=dropout)
self.cross_attn = nn.MultiheadAttention(embed_dim, num_heads, dropout=dropout)
self.ff = nn.Sequential(
nn.Linear(embed_dim, ff_dim),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(ff_dim, embed_dim)
)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.norm3 = nn.LayerNorm(embed_dim)
self.dropout = nn.Dropout(dropout)

def forward(self, tgt, memory, tgt_mask=None, memory_mask=None):
tgt2 = self.self_attn(tgt, tgt, tgt, attn_mask=tgt_mask)[0]
tgt = tgt + self.dropout(tgt2)
tgt = self.norm1(tgt)
tgt2 = self.cross_attn(tgt, memory, memory, attn_mask=memory_mask)[0]
tgt = tgt + self.dropout(tgt2)
tgt = self.norm2(tgt)
tgt2 = self.ff(tgt)
tgt = tgt + self.dropout(tgt2)
tgt = self.norm3(tgt)
return tgt

# Example usage
embed_dim = 512
num_heads = 8
ff_dim = 2048
decoder_block = TransformerDecoderBlock(embed_dim, num_heads, ff_dim)
tgt = torch.rand(64, 20, 512) # batch_size, tgt_seq_len, embed_dim
memory = torch.rand(64, 30, 512) # batch_size, src_seq_len, embed_dim
output = decoder_block(tgt, memory)
print(output.shape) # Output: torch.Size([64, 20, 512])

5. Real-World Applications of Transformers:

Transformers are used for:

  • Machine Translation: Helping to translate text between languages accurately.
  • Text Summarization: Making short summaries of long articles.
  • Language Modeling: Predicting what words come next in a sentence.

6. Conclusion:

The Transformer is a game-changer in NLP because it can understand how all words in a sentence relate to each other, making it very powerful for tasks like translation and summarization. It’s used in many advanced models today, making computers understand human language better than ever before.

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *