You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To enhance the efficiency and adaptability of DeepSeek R1’s GRPO (Group Relative Policy Optimization) algorithm, we propose integrating Budget Forcing—a test-time compute control method introduced in the Simple Test-Time Scaling framework (arXiv:2501.19393v1#S3)26. This addition aims to dynamically adjust the model’s reasoning process during inference, aligning computational expenditure with predefined budgets while improving performance on resource-constrained tasks like mathematical reasoning and complex problem-solving.
Proposed Feature Details
Budget Forcing Mechanism
Dynamic Compute Allocation:
Budget Forcing operates by either:
Terminating Early: If the model exceeds a token limit (e.g., MAX_TOKENS_THINKING=32000), append an end-of-thinking token to force an answer.
Extending Reasoning: If the model attempts to terminate prematurely, suppress the end token and append "Wait" to encourage further exploration, effectively increasing test-time compute.
User-Defined Budget Parameters: Allow users to specify computational constraints (e.g., token limits, FLOPs) as input, enabling fine-grained control over inference costs.
Integration with GRPO
Pre-GRPO Application: Apply Budget Forcing during the inference phase before GRPO’s policy optimization. This ensures that GRPO’s group-based reward calculations (which rely on multiple candidate solutions) are performed under controlled computational conditions.
Compatibility with Group Sampling: GRPO generates multiple responses per query for relative reward evaluation. Budget Forcing can be applied to each candidate solution individually, ensuring resource limits per output while maintaining group diversity.
Achieve up to 27% performance gains on MATH and AIME24 benchmarks when paired with GRPO, as shown in the s1-32B model.
Improved Robustness:
Force models to "double-check" answers via extended reasoning (e.g., appending "Wait"), correcting errors in intermediate steps.
Scalability Across Environments:
Adapt seamlessly to high- and low-resource settings, making GRPO-based models more practical for real-world deployments.
Implementation Steps
Parameterization:
Introduce budget_mode (e.g., strict, flexible) and max_thinking_tokens as configurable hyperparameters.
Token Management:
Modify the generation loop to track token counts and enforce termination/extension based on budget rules. Code snippets from the s1 GitHub repo demonstrate how to handle stop_token_ids and append "Wait" tokens dynamically.
GRPO Workflow Adjustment:
During GRPO’s group sampling phase, apply Budget Forcing to each candidate solution. For example:
The current approach forces early termination if a threshold is exceeded. However, abrupt stopping might lead to incomplete reasoning. Introduce a confidence-based termination metric, where the model assesses its certainty before stopping (e.g., via entropy of token probabilities).
To enhance the efficiency and adaptability of DeepSeek R1’s GRPO (Group Relative Policy Optimization) algorithm, we propose integrating Budget Forcing—a test-time compute control method introduced in the Simple Test-Time Scaling framework (arXiv:2501.19393v1#S3)26. This addition aims to dynamically adjust the model’s reasoning process during inference, aligning computational expenditure with predefined budgets while improving performance on resource-constrained tasks like mathematical reasoning and complex problem-solving.
Proposed Feature Details
Budget Forcing Mechanism
Budget Forcing operates by either:
MAX_TOKENS_THINKING=32000
), append an end-of-thinking token to force an answer.Integration with GRPO
Key Benefits
Implementation Steps
budget_mode
(e.g.,strict
,flexible
) andmax_thinking_tokens
as configurable hyperparameters.stop_token_ids
and append "Wait" tokens dynamically.Expected Outcomes
References to Code/Data:
The text was updated successfully, but these errors were encountered: