Skip to main content

Text Generation & Summarization

Build systems that generate, summarize, and translate text — from seq2seq architectures to modern Hugging Face pipelines and evaluation metrics like BLEU and ROUGE.

~50 min
Listen to this lesson

Text Generation & Summarization

So far we've focused on understanding text — classification, entity extraction, parsing. Now we tackle the other side of NLP: generating text. This includes:

  • Summarization — condense a long document into key points
  • Translation — convert text between languages
  • Text completion — generate continuations of a prompt
  • Paraphrasing — rewrite text in different words
  • All of these tasks share a common framework: given an input sequence, produce an output sequence. This is the sequence-to-sequence (seq2seq) paradigm.

    Sequence-to-Sequence Architecture

    The classic seq2seq model has two components:

    1. Encoder — reads the input sequence and compresses it into a fixed-size context vector 2. Decoder — generates the output sequence one token at a time, conditioned on the context

    Think of it like a human translator: you read the entire French sentence (encoder), form an understanding in your mind (context vector), then produce the English translation word by word (decoder).

    python
    1import tensorflow as tf
    2from tensorflow.keras import layers, models
    3
    4# --- Simple Seq2Seq with LSTM ---
    5# Encoder
    6encoder_inputs = layers.Input(shape=(None,), name="encoder_input")
    7encoder_embedding = layers.Embedding(input_dim=10000, output_dim=256)(encoder_inputs)
    8encoder_lstm = layers.LSTM(256, return_state=True)
    9encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
    10# state_h, state_c = the "context" passed to the decoder
    11
    12# Decoder
    13decoder_inputs = layers.Input(shape=(None,), name="decoder_input")
    14decoder_embedding = layers.Embedding(input_dim=10000, output_dim=256)(decoder_inputs)
    15decoder_lstm = layers.LSTM(256, return_sequences=True, return_state=True)
    16# Initialize decoder with encoder's final hidden state
    17decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=[state_h, state_c])
    18decoder_dense = layers.Dense(10000, activation="softmax")
    19output = decoder_dense(decoder_outputs)
    20
    21model = models.Model([encoder_inputs, decoder_inputs], output)
    22model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
    23model.summary()
    24
    25# The encoder compresses input into (state_h, state_c)
    26# The decoder uses that state to generate output token by token

    The Bottleneck Problem

    The basic seq2seq model compresses the entire input into a single fixed-size vector. For long sequences, this bottleneck loses information. **Attention** solves this by letting the decoder look back at all encoder hidden states at each generation step, focusing on the most relevant parts of the input.

    Attention in Seq2Seq

    Attention in the seq2seq context works as follows:

    1. The encoder produces a hidden state for every input token (not just the final one) 2. At each decoder step, compute attention scores between the current decoder state and all encoder states 3. Use these scores to create a weighted sum (context vector) of encoder states 4. Concatenate this context with the decoder state to make the prediction

    This means the decoder can "look back" at the input, focusing on different parts at each step. When translating "Le chat noir" to "The black cat", the decoder focuses on "chat" when generating "cat" and on "noir" when generating "black".

    Decoding Strategies: Greedy vs. Beam Search

    Once we have a model that produces probability distributions over the vocabulary, how do we actually generate text?

    Greedy Decoding

    At each step, pick the single most probable token. Fast but often produces suboptimal sequences.

    Step 1: P("The") = 0.6, P("A") = 0.3 → pick "The"
    Step 2: P("cat") = 0.5, P("dog") = 0.4 → pick "cat"
    Result: "The cat"
    

    Beam Search

    Keep track of the top-k (beam width) most probable sequences at each step. Explores more possibilities.

    Beam width = 2:
    Step 1: Keep ["The" (0.6), "A" (0.3)]
    Step 2: Expand both:
      "The cat" (0.6 × 0.5 = 0.30)
      "The dog" (0.6 × 0.4 = 0.24)
      "A small"  (0.3 × 0.7 = 0.21)
      "A big"    (0.3 × 0.2 = 0.06)
    Keep top 2: ["The cat" (0.30), "The dog" (0.24)]
    

    Beam search finds better overall sequences because a locally suboptimal choice (like "A" instead of "The") might lead to a globally better sequence.

    python
    1from transformers import pipeline, set_seed
    2
    3# --- Text generation with different strategies ---
    4generator = pipeline("text-generation", model="gpt2")
    5set_seed(42)
    6
    7prompt = "Artificial intelligence will"
    8
    9# Greedy decoding
    10greedy = generator(prompt, max_length=30, do_sample=False)
    11print("Greedy:", greedy[0]["generated_text"])
    12
    13# Beam search (num_beams > 1)
    14beam = generator(prompt, max_length=30, num_beams=5, do_sample=False)
    15print("Beam:  ", beam[0]["generated_text"])
    16
    17# Sampling with temperature (more creative)
    18sampled = generator(prompt, max_length=30, do_sample=True, temperature=0.7, top_p=0.9)
    19print("Sample:", sampled[0]["generated_text"])
    20
    21# Lower temperature = more focused/deterministic
    22# Higher temperature = more random/creative
    23# top_p (nucleus sampling) = only sample from tokens whose cumulative probability ≤ p

    Summarization: Extractive vs. Abstractive

    There are two fundamentally different approaches to summarization:

    ApproachHow It WorksProsCons
    ExtractiveSelect the most important existing sentencesAlways grammatically correct, faithful to sourceCan be choppy, limited to source vocabulary
    AbstractiveGenerate new text that captures the key ideasMore natural, can rephrase and compressMay hallucinate or distort facts
    Modern systems often combine both: use extractive methods to identify key content, then use abstractive models to rephrase it fluently.

    python
    1from transformers import pipeline
    2import numpy as np
    3
    4# --- Extractive Summarization (simple TF-IDF approach) ---
    5from sklearn.feature_extraction.text import TfidfVectorizer
    6from sklearn.metrics.pairwise import cosine_similarity
    7
    8def extractive_summarize(text, num_sentences=3):
    9    """Select the most representative sentences using TF-IDF."""
    10    sentences = text.split(". ")
    11    sentences = [s.strip() for s in sentences if len(s.strip()) > 10]
    12
    13    if len(sentences) <= num_sentences:
    14        return ". ".join(sentences)
    15
    16    # Compute TF-IDF for each sentence
    17    vectorizer = TfidfVectorizer()
    18    tfidf_matrix = vectorizer.fit_transform(sentences)
    19
    20    # Score each sentence by similarity to the overall document
    21    doc_vector = tfidf_matrix.mean(axis=0)
    22    scores = cosine_similarity(tfidf_matrix, doc_vector).flatten()
    23
    24    # Select top sentences (maintain original order)
    25    top_indices = sorted(np.argsort(scores)[-num_sentences:])
    26    summary = ". ".join([sentences[i] for i in top_indices])
    27    return summary
    28
    29# --- Abstractive Summarization (Hugging Face) ---
    30abstractive = pipeline("summarization", model="facebook/bart-large-cnn")
    31
    32article = """
    33The global semiconductor shortage that began in 2020 has had far-reaching
    34consequences across multiple industries. Automakers were among the hardest
    35hit, with major manufacturers like Toyota, Ford, and Volkswagen forced to
    36cut production by millions of vehicles. The shortage was triggered by a
    37perfect storm of factors: pandemic-driven factory shutdowns, a surge in
    38demand for consumer electronics as people worked from home, and the
    39inherently long lead times required to build new chip fabrication plants.
    40Governments responded with massive investment programs. The US passed the
    41CHIPS Act, allocating $52 billion for domestic semiconductor manufacturing.
    42The European Union announced a similar European Chips Act worth 43 billion
    43euros. These investments aim to reduce dependence on Asian manufacturers,
    44particularly Taiwan's TSMC, which produces over 50% of the world's advanced
    45chips. Industry analysts expect the shortage to fully resolve by 2025, but
    46the geopolitical implications of semiconductor supply chain concentration
    47will persist for decades.
    48"""
    49
    50# Extractive
    51print("=== Extractive Summary ===")
    52print(extractive_summarize(article, num_sentences=2))
    53
    54# Abstractive
    55print("\n=== Abstractive Summary ===")
    56result = abstractive(article, max_length=80, min_length=30)
    57print(result[0]["summary_text"])

    Evaluation Metrics for Generated Text

    How do we measure the quality of generated text? Several metrics exist, each with different strengths:

    BLEU (Bilingual Evaluation Understudy)

  • Originally designed for machine translation
  • Measures precision — how much of the generated text appears in the reference
  • Computes n-gram overlap (unigrams, bigrams, trigrams, 4-grams)
  • Includes a brevity penalty to penalize overly short outputs
  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

  • Designed for summarization
  • Measures recall — how much of the reference text is captured in the generated text
  • Variants: ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE-L (longest common subsequence)
  • BERTScore

  • Uses BERT embeddings to compute semantic similarity
  • Can detect paraphrases that BLEU/ROUGE miss
  • More aligned with human judgment but computationally expensive
  • MetricFocusBest ForLimitation
    BLEUPrecision (n-gram)TranslationPenalizes valid paraphrases
    ROUGERecall (n-gram)SummarizationDoesn't capture meaning
    BERTScoreSemantic similarityAny generationComputationally expensive

    BLEU for Translation, ROUGE for Summarization

    BLEU measures precision — is the generated translation correct? This matters for translation where you want exact word choices. ROUGE measures recall — did the summary capture the key content from the reference? This matters for summarization where you want to ensure important information isn't missed. In practice, always report both precision and recall variants.
    python
    1# --- BLEU Score ---
    2from nltk.translate.bleu_score import sentence_bleu, corpus_bleu
    3
    4reference = [["the", "cat", "sat", "on", "the", "mat"]]
    5candidate = ["the", "cat", "is", "on", "the", "mat"]
    6
    7# BLEU with different n-gram weights
    8bleu_1 = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))    # Unigrams only
    9bleu_2 = sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0))  # Up to bigrams
    10bleu_4 = sentence_bleu(reference, candidate)  # Default: 1-4 grams equally weighted
    11
    12print(f"BLEU-1: {bleu_1:.4f}")
    13print(f"BLEU-2: {bleu_2:.4f}")
    14print(f"BLEU-4: {bleu_4:.4f}")
    15
    16# --- ROUGE Score ---
    17from rouge_score import rouge_scorer
    18
    19scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    20
    21reference_text = "The cat sat on the mat near the window"
    22generated_text = "A cat was sitting on the mat"
    23
    24scores = scorer.score(reference_text, generated_text)
    25for metric, score in scores.items():
    26    print(f"{metric}: Precision={score.precision:.4f}, Recall={score.recall:.4f}, F1={score.fmeasure:.4f}")
    27
    28# --- BERTScore ---
    29# pip install bert-score
    30from bert_score import score as bert_score
    31
    32references = ["The cat sat on the mat"]
    33candidates = ["A feline rested on the rug"]  # Paraphrase!
    34
    35P, R, F1 = bert_score(candidates, references, lang="en")
    36print(f"\nBERTScore — P: {P.mean():.4f}, R: {R.mean():.4f}, F1: {F1.mean():.4f}")
    37# BERTScore will give high similarity despite different words (captures semantics)

    Translation Pipelines

    Modern translation is straightforward with pre-trained models:

    python
    1from transformers import pipeline
    2
    3# Translation with Hugging Face
    4translator_en_fr = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
    5translator_en_de = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
    6
    7text = "Machine learning is transforming how we process natural language."
    8
    9fr_result = translator_en_fr(text)
    10de_result = translator_en_de(text)
    11
    12print(f"English:  {text}")
    13print(f"French:   {fr_result[0]['translation_text']}")
    14print(f"German:   {de_result[0]['translation_text']}")
    15
    16# --- Evaluate translation quality ---
    17from nltk.translate.bleu_score import sentence_bleu
    18
    19# If we have reference translations
    20reference_fr = "L'apprentissage automatique transforme notre façon de traiter le langage naturel".split()
    21generated_fr = fr_result[0]["translation_text"].split()
    22
    23bleu = sentence_bleu([reference_fr], generated_fr)
    24print(f"\nBLEU score: {bleu:.4f}")

    Limitations of Automatic Metrics

    No automatic metric perfectly correlates with human judgment. BLEU and ROUGE rely on exact n-gram matching, so they miss valid paraphrases ("car" vs. "automobile"). BERTScore handles paraphrases but can be fooled by grammatically incorrect text that uses the right words. Always supplement automatic metrics with human evaluation for critical applications.

    Practical Text Generation with Hugging Face

    Here's a comprehensive example showing different generation tasks:

    python
    1from transformers import pipeline
    2
    3# --- Summarization ---
    4summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    5
    6article = """
    7Scientists have discovered a new species of deep-sea fish in the Mariana
    8Trench. The fish, named Pseudoliparis swirei, was found at a depth of
    98,178 meters, making it the deepest-living fish ever recorded. The
    10discovery was made using autonomous underwater vehicles equipped with
    11cameras and traps. The fish has a translucent body and lacks scales,
    12adaptations that help it survive the extreme pressure at such depths.
    13Researchers believe studying this species could provide insights into
    14how life adapts to extreme environments.
    15"""
    16
    17summary = summarizer(article, max_length=60, min_length=20)
    18print("Summary:", summary[0]["summary_text"])
    19
    20# --- Question Answering ---
    21qa = pipeline("question-answering")
    22result = qa(question="At what depth was the fish found?", context=article)
    23print(f"\nAnswer: {result['answer']} (confidence: {result['score']:.4f})")
    24
    25# --- Text Generation (completion) ---
    26generator = pipeline("text-generation", model="gpt2")
    27prompt = "The future of artificial intelligence depends on"
    28output = generator(prompt, max_length=50, num_return_sequences=2, temperature=0.8)
    29print("\nGenerated continuations:")
    30for i, seq in enumerate(output):
    31    print(f"  {i+1}. {seq['generated_text']}")