Less is more: UC Berkeley and Google unlock LLM potential through simple sampling

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

A new paper by researchers from Google Research and the University of California, Berkeley, demonstrates that a surprisingly simple test-time scaling approach can boost the reasoning abilities of large language models (LLMs). The key? Scaling up sampling-based search, a technique that relies on generating multiple responses and using the model itself to verify them.

The core finding is that even a minimalist implementation of sampling-based search, using random sampling and self-verification, can elevate the reasoning performance of models like Gemini 1.5 Pro beyond that of o1-Preview on popular benchmarks. The findings can have important implications for enterprise applications and challenge the assumption that highly specialized training or complex architectures are always necessary for achieving top-tier performance.

The limits of current test-time compute scaling

The current popular method for test-time scaling in LLMs is to train the model through reinforcement learning to generate longer responses with chain-of-thought (CoT) traces. This approach is used in models such as OpenAI o1 and DeepSeek-R1. While beneficial, these methods usually require substantial investment in the training phase.

Another test-time scaling method is “self-consistency,” where the model generates multiple responses to the query and chooses the answer that appears more often. Self-consistency reaches its limits when handling complex problems, as in these cases, the most repeated answer is not necessarily the correct one.

Sampling-based search offers a simpler and highly scalable alternative to test-time scaling: Let the model generate multiple responses and select the best one through a verification mechanism. Sampling-based search can complement other test-time compute scaling strategies and, as the researchers write in their paper, “it also has the unique advantage of being embarrassingly parallel and allowing for arbitrarily scaling: simply sample more responses.”

More importantly, sampling-based search can be applied to any LLM, including those that have not been explicitly trained for reasoning.

How sampling-based search works

The researchers focus on a minimalist implementation of sampling-based search, using a language model to both generate candidate responses and verify them. This is a “self-verification” process, where the model assesses its own outputs without relying on external ground-truth answers or symbolic verification systems.

*Search-based sampling Credit: VentureBeat*

The algorithm works in a few simple steps:

1—The algorithm begins by generating a set of candidate solutions to the given problem using a language model. This is done by giving the model the same prompt multiple times and using a non-zero temperature setting to create a diverse set of responses.

2—Each candidate’s response undergoes a verification process in which the LLM is prompted multiple times to determine whether the response is correct. The verification outcomes are then averaged to create a final verification score for the response.

3— The algorithm selects the highest-scored response as the final answer. If multiple candidates are within close range of each other, the LLM is prompted to compare them pairwise and choose the best one. The response that wins the most pairwise comparisons is chosen as the final answer.

The researchers considered two key axes for test-time scaling:

Sampling: The number of responses the model generates for each input problem.

Verification: The number of verification scores computed for each generated solution

How sampling-based search compares to other techniques

The study revealed that reasoning performance continues to improve with sampling-based search, even when test-time compute is scaled far beyond the point where self-consistency saturates.

At a sufficient scale, this minimalist implementation significantly boosts reasoning accuracy on reasoning benchmarks like AIME and MATH. For example, Gemini 1.5 Pro’s performance surpassed that of o1-Preview, which has explicitly been trained on reasoning problems, and Gemini 1.5 Flash surpassed Gemini 1.5 Pro.

“This not only highlights the importance of sampling-based search for scaling capability, but also suggests the utility of sampling-based search as a simple baseline on which to compare other test-time compute scaling strategies and measure genuine improvements in models’ search capabilities,” the researchers write.

It is worth noting that while the results of search-based sampling are impressive, the costs can also become prohibitive. For example, with 200 samples and 50 verification steps per sample, a query from AIME will generate around 130 million tokens, which costs $650 with Gemini 1.5 Pro. However, this is a very minimalistic approach to sampling-based search, and it is compatible with optimization techniques proposed in other studies. With smarter sampling and verification methods, the inference costs can be reduced considerably by using smaller models and generating fewer tokens. For example, by using Gemini 1.5 Flash to perform the verification, the costs drop to $12 per question.

Effective self-verification strategies

There is an ongoing debate on whether LLMs can verify their own answers. The researchers identified two key strategies for improving self-verification using test-time compute:

Directly comparing response candidates: Disagreements between candidate solutions strongly indicate potential errors. By providing the verifier with multiple responses to compare, the model can better identify mistakes and hallucinations, addressing a core weakness of LLMs. The researchers describe this as an instance of “implicit scaling.”

Task-specific rewriting: The researchers propose that the optimal output style of an LLM depends on the task. Chain-of-thought is effective for solving reasoning tasks, but responses are easier to verify when written in a more formal, mathematically conventional style. Verifiers can rewrite candidate responses into a more structured format (e.g., theorem-lemma-proof) before evaluation.

“We anticipate model self-verification capabilities to rapidly improve in the short term, as models learn to leverage the principles of implicit scaling and output style suitability, and drive improved scaling rates for sampling-based search,” the researchers write.

Implications for real-world applications

The study demonstrates that a relatively simple technique can achieve impressive results, potentially reducing the need for complex and costly model architectures or training regimes.

This is also a scalable technique, enabling enterprises to increase performance by allocating more compute resources to sampling and verification. It also enables developers to push frontier language models beyond their limitations on complex tasks.

“Given that it complements other test-time compute scaling strategies, is parallelizable and allows for arbitrarily scaling, and admits simple implementations that are demonstrably effective, we expect sampling-based search to play a crucial role as language models are tasked with solving increasingly complex problems with increasingly large compute budgets,” the researchers write.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link