Technical AI Safety

Alignment, scalable oversight, evaluation and red-teaming, and catastrophic-risk framing for frontier ML systems. This course complements ethics and fairness work with the technical side of AI safety.

Level: Advanced · Category: Safety & Ethics · Estimated time: 6 hours

Prerequisites

Machine Learning Basics

Lessons

Specification Gaming & Misspecified Objectives — How proxy rewards, Goodhart effects, and environment quirks produce unintended behavior.
Inner Alignment & Mesa-Optimization — Why learned optimizers can pursue different goals than the training objective suggests.
Scalable Oversight — Supervising systems that may exceed human ability—iteration, decomposition, and debate-style ideas.
Evaluations, Capabilities & Threat Modeling — Measuring autonomous capability, time horizons, and what rigorous eval suites try to capture.
Red-Teaming Frontier Models — Adversarial probing, jailbreak dynamics, and what pre-release red teaming aims to find.
Scaling, Transformers & Risk Framing — How modern LLM stacks work at a high level and why capability scaling shifts safety priorities.

Topics covered

ai-safety, alignment, inner-alignment, scalable-oversight, red-teaming, evals, frontier-models, catastrophic-risk

Browse all neo-ai courses · neo-ai home