Planning
One drawback for CoT-style reasoning is that LLMs have to greedily decode a path towards an answer. This is problematic for complex problems like math questions or games, since it is hard to predict a path without trial-and-error. In 2023, the community made some progress on this issue with new frameworks that enable planning with LLMs.
➡️ If we conceptualize CoT as “system 1” reasoning — characterized by its automatic, unconscious nature — then a question arises: Is it feasible to replicate the more conscious “system 2” reasoning of humans using LLMs? This query finds relevance in two methodologies: reasoning-via-planning (RAP) and tree-of-thoughts (ToT). Both empower LLMs to navigate through possible reasoning steps, and to search for the optimal reasoning chain based on specific evaluations. RAP additionally prompts an LLM as a “world model”, which predicts the next states following actions. This enables the LLM to operate within a self-simulated world, as opposed to interacting with an external environment. Both algorithms are available in the LLM Reasoners library now!
Self series
Self series are a family of techniques that replace human efforts with LLM predictions in the loop of LLM development. The year of 2023 has witnessed quite a few papers on this track. Let’s take a closer look at some representative works.
➡️ Many people have the experience that ChatGPT doesn’t provide the desired output on the first trial, and this sometimes can be fixed by pointing out its mistake. Self-debugging and self-refinement automate this procedure by replacing human feedback with machine feedback. The feedback either comes from a program executor or an LLM that compares the generation with the explanation of the problem. One key observation is that the performance of self-refine depends on the quality of the feedback, where stronger base models that provide better feedback benefit more. Such iterative refinement methods have also been shown to be super effective in pose estimation and protein structure prediction, where it is difficult to predict the structure in a single run.
➡️ In the memory-of-thought (MoT) framework from Li and Qiu, the authors ask an LLM to generate CoT rationales on an unlabelled dataset and use them for RAG. You may ask how this can be useful given that the generated rationales often contain errors. The key trick is to filter the rationales based on majority vote or entropy minimization (a similar idea is used in Wan et al. to filter rationales). Once we have good rationales on the unlabelled dataset, we dynamically retrieve few-shot examples based on the test question, which is shown to be much better than fixed few-shot examples. MoT can be interpreted as converting a parametric model to a non-parametric model without additional supervision.
➡️ Going beyond MoT, Yasunaga et al. proposed analogical prompting that eliminates the need of dumping rationales on an unlabeled dataset. Analogical prompting asks an LLM to recall relevant exemplars based on the question, and thereby generates dynamic few-shot exemplars from scratch. In fact, the authors found that analogical prompting is an emergent ability in large language models, similar to previous works on open-domain question answering. Larger-scale LLMs can self-generate better exemplars compared to standard RAG solutions. Besides, this work provides a cool trick to fuse multi-step generations into a single prompt with markdown grammar — a godsend for prompt engineers with a tight budget! 💡
➡️ Are self-refine and self-generate the limit of LLM reasoning? Yang et al. show a more advanced usage of the reasoning abilities of LLMs — to optimize a prompt based on the history of generated prompts. This is a cool reinvention of the famous meta-learning paper “Learning to learn by gradient descent by gradient descent”, but all the steps here are performed by LLMs on text. At each step, an LLM is prompted with previous solutions and corresponding performance metrics and tries to predict a new solution. Notably, even without telling the LLM how to perform optimization, the LLM can gradually find better solutions that maximize the metric. Maybe this work brings prompt engineers one step closer to unemployment?
🔁 Probably the most eye-opening 👀 work in self series is the self-taught optimizer (STOP) by Zelikman et al. We know LLMs are guided by textual prompts, take texts as input and output texts. While these these texts are usually separate variables, what will happen if we model them as a single variable? In STOP, the authors draw inspiration from self-modifying code and use a self-improvement prompt to improve itself.
While the seed prompt isn’t more complicated than a random search algorithm, with a strong LLM, one can discover many advanced meta-heuristic algorithms. Interestingly, GPT-4 discovers many prompting strategies that are published after the training cutoff for GPT-4, including ToT and Parsel. It seems that the day when LLMs conduct research for themselves is approaching. One step in this direction is a recent work by Huang et al. showing that LLMs are capable of designing ML models for common benchmarks and even Kaggle challenges.
Evaluations and observations
➡️ Kandpal et al. conducted a systematic study on the memorization ability of LLMs. They asked an LLM about factual questions from Wikipedia and found that the accuracy is highly correlated with the frequency of questioned entities in the pretraining documents, regardless of the scale of the model. By extrapolating the trend, the authors estimate that a model with 10¹⁸ is needed to match human performance on long-tail entities — which is way bigger than today’s LLMs. Hence an important takeaway is to use LLM reasoning for tasks related to frequent knowledge, and consider RAG or other tools for tasks related to long-tail knowledge.
➡️ As the community tries to build bigger mixtures for training LLMs, one concern is that LLMs may not learn to actually reason but simply to memorize the solutions from the training distribution, just like humans in teaching to the test. Wu et al. answers this concern by comparing the performance of GPT-4 with zero-shot CoT on 11 different tasks, each with a default setting and a counterfactual setting. They observe that despite LLMs performing better than random in the counterfactual settings, their performance is consistently behind that in the default settings. It remains an open question how we can train models to focus more on reasoning rather than memorization.
➡️ Saparov et al. extended a synthetic dataset PrOntoQA to OOD setting to test generalization ability of LLMs on deductive reasoning with controlled depth, width, compositional structure, etc. The authors found that CoT can generalize to compositional and longer proofs. This is in contrast with previous conclusions on compositional semantic parsing, possibly because deductive reasoning only requires composing deduction steps, while semantic parsing additionally deals with growing outputs. While LLMs are able to use most deduction rules, they require explicit demonstrations of proof by cases and proof by contradiction. There are also counterintuitive qualitative differences between in-context learning and supervised learning.
➡️ Regarding the parametric knowledge in LLMs, Berglund et al. found a phenomenon they called the reversal curse. That is, LLMs trained to memorize “A is B” do not know that “B is A” in closed-book question answering, despite the fact that they can be prompted to perform deductive reasoning. This indicates that LLMs lack certain kinds of symmetry in its parametric knowledge, and it is crucial to endow them with such symmetry to enable better generalization. Actually, the knowledge graph community has been a leader in this area, with works like double permutation equivariance and relational rotation. It would be interesting to see how these ideas are adapted to LLMs.