Large-Scale ML Systems
Design and operate ML systems at scale — distributed training, serving, and infrastructure. Build ML systems that handle millions of users and petabytes of data.
Level: Advanced · Category: MLOps · Estimated time: 7 hours
Prerequisites
- MLOps & Model Deployment
- PyTorch Mastery
Lessons
- ML Systems Architecture — Components, data flow, and design patterns.
- Distributed Training — Data parallelism, model parallelism, and pipeline parallelism.
- Model Serving at Scale — Batching, auto-scaling, and load balancing.
- Feature Stores — Feature computation, storage, and serving.
- Kubernetes for ML — Deploying training and serving on K8s.
- Cost & Resource Optimization — Spot instances, mixed precision, and efficiency.
Topics covered
distributed-training, mlops, kubernetes, serving, scale