Production LLM Deployment
Deploy and optimize LLMs for production — inference, latency, cost, and scaling. Deploying LLMs at scale is challenging.
Level: Advanced · Category: MLOps · Estimated time: 6 hours
Prerequisites
- Transformers & NLP
- MLOps & Model Deployment
Lessons
- LLM Inference Basics — Autoregressive generation, KV cache, and bottlenecks.
- Quantization for Inference — INT8, INT4, GPTQ, and AWQ.
- Continuous Batching — Dynamic batching for variable-length sequences.
- vLLM & TGI — Serving with vLLM and Hugging Face TGI.
- Speculative Decoding — Draft models and verification for faster generation.
- Production Patterns — Load balancing, fallbacks, monitoring, and cost.
Topics covered
llm, deployment, vllm, inference, optimization