Skip to main content

Transformers Library Quickstart

Pipeline API, AutoModel/AutoTokenizer, inference patterns, and batch processing

~40 min
Listen to this lesson

Transformers Library Quickstart

Hugging Face's transformers library is the most popular open-source library for working with pre-trained language models. It provides a unified API for thousands of models across NLP, computer vision, audio, and multimodal tasks.

Installation

pip install transformers torch datasets accelerate

The library supports PyTorch, TensorFlow, and JAX backends. We'll use PyTorch throughout this module.

The Transformers Philosophy

Hugging Face Transformers provides three levels of abstraction: (1) Pipeline API for quick inference, (2) AutoModel + AutoTokenizer for flexible model usage, and (3) direct model classes for full control. Start simple and go deeper only when needed.

The Pipeline API

The pipeline() function is the simplest way to use pre-trained models. It handles tokenization, model inference, and post-processing in a single call.

Sentiment Analysis

from transformers import pipeline

Create a sentiment analysis pipeline

classifier = pipeline("sentiment-analysis")

Single prediction

result = classifier("I love learning about AI!") print(result)

[{'label': 'POSITIVE', 'score': 0.9998}]

Batch prediction

results = classifier([ "This movie was terrible.", "The food was absolutely delicious!", "I'm not sure how I feel about this." ]) for r in results: print(f"{r['label']}: {r['score']:.4f}")

Text Summarization

summarizer = pipeline("summarization")

article = """ Hugging Face has become the central hub for machine learning models. Founded in 2016, the company initially built a chatbot app before pivoting to become the GitHub of machine learning. Their Transformers library supports over 200,000 models and is used by thousands of organizations. The platform hosts models, datasets, and Spaces for demo applications. """

summary = summarizer(article, max_length=50, min_length=20) print(summary[0]['summary_text'])

Named Entity Recognition (NER)

ner = pipeline("ner", aggregation_strategy="simple")

text = "Elon Musk founded SpaceX in Hawthorne, California." entities = ner(text) for entity in entities: print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.3f})")

Elon Musk: PER (0.998)

SpaceX: ORG (0.995)

Hawthorne: LOC (0.993)

California: LOC (0.997)

Question Answering

qa = pipeline("question-answering")

context = """ The transformer architecture was introduced in the 2017 paper 'Attention Is All You Need' by Vaswani et al. It replaced recurrent layers with self-attention mechanisms, enabling massive parallelization and leading to models like BERT and GPT. """

answer = qa( question="Who introduced the transformer architecture?", context=context ) print(f"Answer: {answer['answer']} (score: {answer['score']:.3f})")

Answer: Vaswani et al (score: 0.892)

Zero-Shot Classification

zero_shot = pipeline("zero-shot-classification")

result = zero_shot( "I just got promoted to senior engineer!", candidate_labels=["career", "health", "sports", "technology"] ) print(f"Labels: {result['labels']}") print(f"Scores: {[f'{s:.3f}' for s in result['scores']]}")

Labels: ['career', 'technology', 'sports', 'health']

Scores: ['0.891', '0.067', '0.024', '0.018']

Translation

translator = pipeline("translation_en_to_fr")
result = translator("Machine learning is transforming every industry.")
print(result[0]['translation_text'])

L'apprentissage automatique transforme chaque industrie.

Text Generation

generator = pipeline("text-generation", model="gpt2")

output = generator( "The future of artificial intelligence", max_new_tokens=50, num_return_sequences=1, temperature=0.7 ) print(output[0]['generated_text'])

Specifying Models

Every pipeline uses a default model, but you can specify any compatible model from the Hub: pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment"). This lets you swap models without changing your code.

AutoModel and AutoTokenizer

When you need more control than the pipeline provides, use AutoModel and AutoTokenizer directly. This is the standard approach for production code.

The Three Auto Classes

from transformers import AutoTokenizer, AutoModel, AutoConfig

model_name = "bert-base-uncased"

Load just the config (no weights downloaded)

config = AutoConfig.from_pretrained(model_name) print(f"Hidden size: {config.hidden_size}") # 768 print(f"Num layers: {config.num_hidden_layers}") # 12 print(f"Num heads: {config.num_attention_heads}") # 12

Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

Load the model

model = AutoModel.from_pretrained(model_name)

Manual Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name)

Tokenize input

text = "I absolutely love this product!" inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

print(f"Input IDs shape: {inputs['input_ids'].shape}") print(f"Attention mask shape: {inputs['attention_mask'].shape}") print(f"Tokens: {tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])}")

Run inference (no gradient computation needed)

model.eval() with torch.no_grad(): outputs = model(**inputs)

Process logits

logits = outputs.logits probabilities = torch.softmax(logits, dim=-1) predicted_class = torch.argmax(probabilities, dim=-1).item()

labels = model.config.id2label print(f"Prediction: {labels[predicted_class]}") print(f"Confidence: {probabilities[0][predicted_class]:.4f}")

AutoModel Variants

Use the right AutoModel subclass for your task: - AutoModel: Base model, returns hidden states - AutoModelForSequenceClassification: Text classification - AutoModelForTokenClassification: NER, POS tagging - AutoModelForQuestionAnswering: Extractive QA - AutoModelForCausalLM: Text generation (GPT-style) - AutoModelForSeq2SeqLM: Translation, summarization (T5-style)

Batch Processing

For efficiency, always batch your inputs when processing multiple texts:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) model.eval()

texts = [ "This is fantastic!", "Terrible experience.", "Pretty average, nothing special.", "Best purchase I've ever made!", "Would not recommend to anyone." ]

Tokenize as a batch - padding ensures uniform length

inputs = tokenizer( texts, return_tensors="pt", padding=True, # Pad to longest in batch truncation=True, # Truncate if over max length max_length=128 )

with torch.no_grad(): outputs = model(inputs) probs = torch.softmax(outputs.logits, dim=-1) predictions = torch.argmax(probs, dim=-1)

for text, pred, prob in zip(texts, predictions, probs): label = model.config.id2label[pred.item()] confidence = prob[pred.item()].item() print(f"[{label} {confidence:.2f}] {text}")

Device Management

Move models and inputs to GPU for faster inference:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Using device: {device}")

model_name = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)

Inputs must also be on the same device

text = "Great movie!" inputs = tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad(): outputs = model(inputs)

For pipelines, use the device argument

classifier = pipeline( "sentiment-analysis", device=0 # GPU index, or -1 for CPU )

Memory Considerations

Large models can exhaust GPU memory. Use model.half() for FP16 inference to halve memory usage, or use device_map='auto' with accelerate to automatically split a model across multiple GPUs or offload to CPU/disk.