Reinforcement learning from human feedback (RLHF) is a form of reinforcement learning that is done to LLMs following self-supervised pretraining and fine tuning, typically toward the objective of following instructions and avoiding unwanted behaviors. The reward model used is typically another copy of the base model and is trained from human judgments. KL penalties or other methods may be used to ameliorate mode collapse and overfitting.

models trained with RLHF

Deep reinforcement learning from human preferences