What is RLHF (Reinforcement Learning from Human Feedback)?

Training

RLHF (Reinforcement Learning from Human Feedback)

A training technique that uses human preferences to fine-tune AI models, making them more helpful, harmless, and honest. Humans rank model outputs, and the model learns to produce responses that align with those preferences.

Reinforcement learning from human feedback (RLHF) is the key technique that transformed raw language models into the helpful AI assistants we use today. A base language model trained on internet text can generate coherent text, but it has no inherent preference for being helpful, truthful, or safe. RLHF provides that alignment by training the model to produce outputs that humans prefer.

The RLHF process has three stages. First, supervised fine-tuning (SFT): human annotators write ideal responses to a set of prompts, and the model is fine-tuned on these demonstrations. Second, reward model training: the model generates multiple responses to the same prompt, humans rank them from best to worst, and a separate "reward model" is trained to predict these human preferences. Third, reinforcement learning: the language model is further trained using PPO (Proximal Policy Optimization) or similar algorithms to maximize the reward model's score while staying close to its original behavior.

RLHF was popularized by OpenAI with InstructGPT and later ChatGPT, and it is now standard practice across the industry. Anthropic uses a variant called RLAIF (RL from AI Feedback) combined with Constitutional AI, where an AI assistant helps generate preference data according to a set of principles. Meta, Google, and other labs use similar approaches. The quality of the human feedback data — who the annotators are, how they are instructed, and what they prioritize — significantly impacts the final model's behavior.

Recent advances have introduced alternatives and improvements to RLHF. Direct Preference Optimization (DPO) skips the reward model step entirely, directly optimizing the language model on preference pairs. Other approaches like ORPO, KTO, and iterative DPO aim to reduce the complexity and instability of the RL training loop. Despite these alternatives, RLHF remains the most widely used and well-understood alignment technique, and the term is often used loosely to refer to any preference-based training approach.

Reasoning Models

System Prompt

Explore more AI concepts in the glossary

Browse Full Glossary