How to deploy a Hugging Face model
· Category: AI & Machine Learning
Short answer
Hugging Face offers multiple deployment pathways from serverless APIs to containerized inference endpoints and interactive demos.
Steps
- Push your model to the Hugging Face Hub with the appropriate model card documentation.
- Use Inference API for immediate serverless access without infrastructure management.
- Deploy dedicated inference endpoints for production workloads requiring autoscaling.
- Build interactive demos with Gradio or Streamlit inside Hugging Face Spaces.
- Containerize with the transformers library for deployment on Kubernetes or cloud platforms.
Tips
- Quantize models to int8 or float16 to reduce latency and memory on endpoints.
- Use text-generation-inference or optimum for optimized serving containers.
- Enable automatic model caching on endpoints to speed up cold starts.
- Monitor endpoint logs and latency to right-size compute resources.
Common issues
- Cold start latency on serverless endpoints for large models.
- Token limits and timeout restrictions on free-tier Inference API.
- Version mismatches between transformers library and model configuration.
- High costs from keeping large GPU endpoints always running.
Example
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
inputs = tokenizer('Hello world', return_tensors='pt')
outputs = model(**inputs)
This example loads a pretrained BERT model and tokenizer, then runs a forward pass on sample text using PyTorch tensors.