supervised fine-tuning / rlhf / pretrained models

The article explains how pretrained language models are aligned through two stages: supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). It details SFT as training on human-written prompt–response pairs using backpropagation to make models more helpful, polite, and responsive, but notes SFT datasets are much smaller than pretraining corpora and can cause overfitting and poor generalization. The piece argues RLHF is needed because scaling supervised datasets by hand

Latest Changes

Series explains converting SFT models into reward models to train RLHF pipelines.

OpenAI-style RLHF reward training uses loss based on reward differences between preferred and less-preferred responses.

Reward models are actively used to fine-tune original models with RL starting from prompts outside SFT data.

Timeline

2026-05-19 — Article outlines two-stage alignment: supervised fine-tuning then RLHF, noting SFT dataset limits and overfitting risks.

2026-05-24 — Explains process for converting a supervised fine-tuned model into a reward model for RLHF training.

2026-05-26 — Describes OpenAI 2022 approach where reward model loss is based on reward differences for preferred versus less-preferred outputs.

2026-05-27 — Details how trained reward models are used to improve original models via RLHF, starting from new prompts outside SFT data.

Recent News (4)

Understanding Reinforcement Learning with Human Feedback Part 6: How the Reward Model Trains the Original Model

The article explains how a trained reward model is used to improve an original language model via reinforcement learning from human feedback (RLHF). Starting with new prompts outside supervised fine-tuning data, the base model generates responses that the reward model scores; low scores signal unhelpful or misaligned outputs. Those reward signals guide RL training so the model learns to produce more polite, useful, and aligned responses, which then receive higher rewards. The piece concludes that RLHF yields a trained, aligned model better matching human preferences and notes the series will continue with new topics. It also briefly promotes an installation tool called Installerpedia.

10pts

Dev.torijultp1h ago

Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions

OpenAI’s 2022 RLHF approach trains a reward model without predefining target reward values by using a loss built from the reward difference between preferred and less-preferred responses. The model computes Reward_better - Reward_worse, passes that difference through a sigmoid (squashing values to 0–1), applies a log, and multiplies by -1 to form a loss that gradient-based optimizers can minimize. This encourages the reward model to assign higher scores to preferred outputs organically, rather than forcing absolute reward defaults. Once trained, the reward model supervises further policy optimization of the original model beyond supervised fine-tuning. The article previews using the reward model to continue training the main model.

10pts

supervised fine-tuning / rlhf / pretrained models

Why It Matters

Latest Changes

Timeline

What to Watch

Recent News (4)