An optimizer that separates weight decay from Adam’s normal update step.
AdamW is like a coach with two scoreboards. Practice points go on one. Donut penalties go on another.
It is used to train and fine-tune models. It helps learning stay steady and helps models memorize less.
Adam
AdamW keeps Adam’s adaptive updates, but fixes how weight decay is applied.
Weight Decay
AdamW separates Weight Decay from the normal gradient update.
Optimization
AdamW is a common optimizer for training neural networks.
Fine-tuning
AdamW is often used as the default optimizer for fine-tuning large models.