RLHF

Aligning AI with human values

§ Technical Analysis

Reinforcement Learning from Human Feedback (RLHF) aligns LLMs to be helpful, harmless, and honest. Pipeline: (1) collect human comparisons, (2) train a reward model on preferences, (3) fine-tune the LLM with PPO to maximize reward while staying close to the original model (KL penalty).

You've Reached the Deepest Level

No sub-concepts below this level in the prototype. In the full platform, this expands into advanced research topics, papers, and open problems.