Production LLM Deployment

Deploy and optimize LLMs for production — inference, latency, cost, and scaling. Deploying LLMs at scale is challenging.

Level: Advanced · Category: MLOps · Estimated time: 6 hours

Prerequisites

Transformers & NLP
MLOps & Model Deployment

Lessons

LLM Inference Basics — Autoregressive generation, KV cache, and bottlenecks.
Quantization for Inference — INT8, INT4, GPTQ, and AWQ.
Continuous Batching — Dynamic batching for variable-length sequences.
vLLM & TGI — Serving with vLLM and Hugging Face TGI.
Speculative Decoding — Draft models and verification for faster generation.
Production Patterns — Load balancing, fallbacks, monitoring, and cost.

Topics covered

llm, deployment, vllm, inference, optimization

Browse all neo-ai courses · neo-ai home