Streamlit + Ollama + Gemma: The Local LLM Stack.

When I first read “Attention Is All You Need,” I felt like someone had opened a window into the future of language modeling. That paper introduced the Transformer, and everything changed. GPT, BERT, T5 — all of them descend from the same elegant idea: attention. Unlike RNNs or CNNs, Transformers don’t have to read text sequentially. They can look at every word at once, figure out what matters, and understand context in a way that feels almost human.

1. Attention: The Core Idea

Think of attention as the model’s ability to “focus” on what matters most. In a sentence, not all words are equally important — attention lets the model weigh each word’s relevance. Every word is converted into three vectors:

  • Q (Query) – what this word is searching for

  • K (Key) – what the word represents

  • V (Value) – the actual information carried by the word

The model compares Queries (Q) to Keys (K) to measure relevance and then mixes the Values (V) accordingly. The formula looks intimidating but is actually simple:

Attention(Q, K, V) = softmax((Q × Kᵀ) / √dₖ) × V

I love this example: in the sentence “The cat chased the mouse because it was hungry,” attention helps the model figure out that “it” refers to “the cat,” not “the mouse.” Suddenly, pronouns make sense!

2. Encoder and Decoder: Two Sides of the Same Coin

Transformers have two main parts:

  • Encoder – reads and understands the input
  • Decoder – generates new text based on the encoder’s understanding

Depending on the task, models use different architectures:

  • Encoder-only (BERT): Excellent at understanding text. Ideal for classification or Q&A, but doesn’t generate text.
  • Decoder-only (GPT): Reads left-to-right and predicts the next word. Perfect for story writing, summarization, or chat.
  • Encoder–Decoder (T5/BART): Encoder digests input, decoder produces output. Great for translation or summarization.

My favorite moment experimenting with T5 was giving it a paragraph and watching it produce a perfect, human-like summary in seconds — it felt magical.

3. Multi-Head Attention: Seeing From Multiple Angles

Transformers don’t just pay attention once — they do it many times in parallel. Each “head” of multi-head attention focuses on different aspects of the text:

  • One head tracks which words are subjects or objects
  • Another tracks distant dependencies, like pronouns referring to nouns far away

Combining heads gives the model a nuanced understanding of language. It’s like reading a sentence with multiple sets of eyes, each focusing on a different relationship.

4. Transformers in Action: Examples

Encoder-Only Model (BERT)

Task: Question answering
Input: “The Eiffel Tower is located in Paris. What city is the Eiffel Tower in?”
Output: “Paris”
Key point: Encoder-only models are like careful readers — they understand text but don’t write new sentences.

Decoder-Only Model (GPT)

Task: Story generation
Input: “Once upon a time, there was a little cat named Whiskers. She loved to…”
Output: “…play with yarn in the sunny garden every morning.”
Key point: Decoder-only models excel at creating fluent, human-like text.

Encoder–Decoder Model (T5/BART)

Task: Summarization
Input: “The city council announced a new plan to reduce traffic in downtown areas. The plan includes new bike lanes, improved public transport, and restrictions on certain roads during peak hours. Citizens are encouraged to use alternative transport.”
Output: “The city council plans to reduce downtown traffic by adding bike lanes, improving public transport, and restricting certain roads.”
Key point: Encoder-decoder models combine deep understanding with fluent generation, making them ideal for translation and summarization.

5. How Transformers Really Work: Step by Step

Sentence: “The cat chased the mouse.”

  • Tokenization: ["The", "cat", "chased", "the", "mouse"]
  • Embedding: Convert words to vectors representing meaning
  • Q, K, V: Transform embeddings into Query, Key, Value vectors
  • Attention Scores: Compare Queries to Keys to find relevance
  • Weight Values: Mix Values based on attention scores
  • Multi-Head Attention: Multiple heads capture different relationships
  • Feed-Forward Layers: Further refine representation
  • Output: Encoders produce contextual vectors; decoders generate text

Result: The model understands “the cat chased the mouse” correctly — the cat is performing the action, not the mouse.

6. Why Transformers Changed Everything

Attention allows models to understand long-range dependencies and contextual relationships. Parallelization makes training faster than sequential RNNs. With just one mechanism — attention — Transformers can summarize, translate, answer questions, and generate text, all with the same architecture.

Personally, experimenting with these models has been eye-opening. Watching a Transformer generate text that’s coherent, context-aware, and sometimes even witty feels almost magical. And to think it all comes from this simple Q, K, V mechanism — attention really is all you need.