How to deploy a Hugging Face model

· Category: AI & Machine Learning

Short answer

Hugging Face offers multiple deployment pathways from serverless APIs to containerized inference endpoints and interactive demos.

Steps

  1. Push your model to the Hugging Face Hub with the appropriate model card documentation.
  2. Use Inference API for immediate serverless access without infrastructure management.
  3. Deploy dedicated inference endpoints for production workloads requiring autoscaling.
  4. Build interactive demos with Gradio or Streamlit inside Hugging Face Spaces.
  5. Containerize with the transformers library for deployment on Kubernetes or cloud platforms.

Tips

  • Quantize models to int8 or float16 to reduce latency and memory on endpoints.
  • Use text-generation-inference or optimum for optimized serving containers.
  • Enable automatic model caching on endpoints to speed up cold starts.
  • Monitor endpoint logs and latency to right-size compute resources.

Common issues

  • Cold start latency on serverless endpoints for large models.
  • Token limits and timeout restrictions on free-tier Inference API.
  • Version mismatches between transformers library and model configuration.
  • High costs from keeping large GPU endpoints always running.

Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
inputs = tokenizer('Hello world', return_tensors='pt')
outputs = model(**inputs)

This example loads a pretrained BERT model and tokenizer, then runs a forward pass on sample text using PyTorch tensors.