
OpenAI on Wednesday (July 24) unveiled a new method to use artificial intelligence (AI) models to ensure safety without extensive human data collection called rule-based rewards (RBRs).
Traditionally, the fine-tuning of language models was contributed to by reinforcement learning from human feedback (RLHF) which has been the standard for ensuring accurate instruction-following.
RBRs according to OpenAI will align model behaviour with safe practices through clear, step-by-step rules. These rules evaluate if the model’s outputs meet safety standards and are integrated into the RLHF pipeline.
This helps balance helpfulness and harm prevention, ensuring safe and effective model behaviour without the inefficiencies of ongoing human input. RBRs have been part of the safety stack since the launch of GPT-4, including the GPT-4 mini, and are planned for future models.

OpenAI says, “To align AI systems with human values, desired behaviours are defined, and human feedback is collected to train a “reward model.” This model guides the AI by indicating desirable actions. However, collecting human feedback for repetitive tasks can be inefficient, and changes in safety policies can render previous feedback obsolete, necessitating new data collection.”
Experiments show that RBR-trained models demonstrate safety performance comparable to those trained with human feedback while reducing the need for extensive human data. This makes the training process faster and more cost-effective. As model capabilities and safety guidelines evolve, RBRs can be quickly updated, eliminating the need for extensive retraining.
But RBRs can be challenging to apply to subjective tasks, thus human feedback is still needed to balance these challenges and address more nuanced tasks.
Comments