Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Integrate Budget Forcing as a Test-Time Compute Control Mechanism Prior to GRPO Execution #252

Open
novohool opened this issue Feb 4, 2025 · 1 comment

Comments

@novohool
Copy link

novohool commented Feb 4, 2025

To enhance the efficiency and adaptability of DeepSeek R1’s GRPO (Group Relative Policy Optimization) algorithm, we propose integrating Budget Forcing—a test-time compute control method introduced in the Simple Test-Time Scaling framework (arXiv:2501.19393v1#S3)26. This addition aims to dynamically adjust the model’s reasoning process during inference, aligning computational expenditure with predefined budgets while improving performance on resource-constrained tasks like mathematical reasoning and complex problem-solving.

Proposed Feature Details

  1. Budget Forcing Mechanism

    • Dynamic Compute Allocation:
      Budget Forcing operates by either:
      • Terminating Early: If the model exceeds a token limit (e.g., MAX_TOKENS_THINKING=32000), append an end-of-thinking token to force an answer.
      • Extending Reasoning: If the model attempts to terminate prematurely, suppress the end token and append "Wait" to encourage further exploration, effectively increasing test-time compute.
    • User-Defined Budget Parameters: Allow users to specify computational constraints (e.g., token limits, FLOPs) as input, enabling fine-grained control over inference costs.
  2. Integration with GRPO

    • Pre-GRPO Application: Apply Budget Forcing during the inference phase before GRPO’s policy optimization. This ensures that GRPO’s group-based reward calculations (which rely on multiple candidate solutions) are performed under controlled computational conditions.
    • Compatibility with Group Sampling: GRPO generates multiple responses per query for relative reward evaluation. Budget Forcing can be applied to each candidate solution individually, ensuring resource limits per output while maintaining group diversity.

Key Benefits

  1. Enhanced Resource Efficiency:
    • Prioritize critical reasoning steps by terminating low-value computations early, reducing wasted resources.
    • Achieve up to 27% performance gains on MATH and AIME24 benchmarks when paired with GRPO, as shown in the s1-32B model.
  2. Improved Robustness:
    • Force models to "double-check" answers via extended reasoning (e.g., appending "Wait"), correcting errors in intermediate steps.
  3. Scalability Across Environments:
    • Adapt seamlessly to high- and low-resource settings, making GRPO-based models more practical for real-world deployments.

Implementation Steps

  1. Parameterization:
    • Introduce budget_mode (e.g., strict, flexible) and max_thinking_tokens as configurable hyperparameters.
  2. Token Management:
    • Modify the generation loop to track token counts and enforce termination/extension based on budget rules. Code snippets from the s1 GitHub repo demonstrate how to handle stop_token_ids and append "Wait" tokens dynamically.
  3. GRPO Workflow Adjustment:
    • During GRPO’s group sampling phase, apply Budget Forcing to each candidate solution. For example:
      # Pseudocode for GRPO + Budget Forcing integration  
      for query in batch:  
          candidates = generate_candidates(query, num_samples=G)  
          for candidate in candidates:  
              apply_budget_forcing(candidate, max_tokens=32000)  
          rewards = evaluate_candidates(candidates)  
          update_policy_using_grpo(candidates, rewards)  
  4. Validation Metrics:
    • Measure performance gains (e.g., accuracy on MATH benchmarks) and compute savings (e.g., average tokens per query) to validate efficacy.

Expected Outcomes

  1. Performance-Resource Trade-off Optimization:
    • Enable users to balance accuracy and computational cost, as demonstrated by s1-32B’s improvement from 50% → 57% on AIME24 with extended compute.
  2. Broader Applicability:
    • Extend GRPO’s utility to scenarios requiring strict resource constraints (e.g., edge devices, real-time systems).

References to Code/Data:

@manoj633
Copy link

manoj633 commented Feb 4, 2025

The current approach forces early termination if a threshold is exceeded. However, abrupt stopping might lead to incomplete reasoning. Introduce a confidence-based termination metric, where the model assesses its certainty before stopping (e.g., via entropy of token probabilities).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants