Technical AI Safety
Alignment, scalable oversight, evaluation and red-teaming, and catastrophic-risk framing for frontier ML systems. This course complements ethics and fairness work with the technical side of AI safety.
Level: Advanced · Category: Safety & Ethics · Estimated time: 6 hours
Prerequisites
- Machine Learning Basics
Lessons
- Specification Gaming & Misspecified Objectives — How proxy rewards, Goodhart effects, and environment quirks produce unintended behavior.
- Inner Alignment & Mesa-Optimization — Why learned optimizers can pursue different goals than the training objective suggests.
- Scalable Oversight — Supervising systems that may exceed human ability—iteration, decomposition, and debate-style ideas.
- Evaluations, Capabilities & Threat Modeling — Measuring autonomous capability, time horizons, and what rigorous eval suites try to capture.
- Red-Teaming Frontier Models — Adversarial probing, jailbreak dynamics, and what pre-release red teaming aims to find.
- Scaling, Transformers & Risk Framing — How modern LLM stacks work at a high level and why capability scaling shifts safety priorities.
Topics covered
ai-safety, alignment, inner-alignment, scalable-oversight, red-teaming, evals, frontier-models, catastrophic-risk