๐ฌ Specialist
๐ค
RLHF
Aligning AI with human values
ยง Technical Analysis
Reinforcement Learning from Human Feedback (RLHF) aligns LLMs to be helpful, harmless, and honest. Pipeline: (1) collect human comparisons, (2) train a reward model on preferences, (3) fine-tune the LLM with PPO to maximize reward while staying close to the original model (KL penalty).
You've Reached the Deepest Level
No sub-concepts below this level in the prototype. In the full platform, this expands into advanced research topics, papers, and open problems.