Tree of Thought Prompting: Teaching LLMs to Think Slowly

Contents

Large Language models (LLMs) such as ChatGPT often behave as what Daniel Kahneman—winner of the 2002 Nobel Memorial Prize in Economic Sciences—defines as System 1: it is fast, confident and almost effortless. You are right: there is also a System 2, a slower, more effortful mode of thinking. Researchers have been developing methods these years to bring System 2-style thinking into LLMs through better prompting strategies. Tree-of-Thought (ToT) prompting is one the most outstanding techniques and it allows the model to follow multiple reasoning paths leading to potentially better decisions. In this blog post, I’ll do a case study where a ToT-powered LLM agent plays the classic game Minesweeper — not by guessing, but by reasoning. Just like you would. CoT and ToT Case study Minesweeper ToT LLM Agent Play Conclusion

“Slow thinking (system 2) is effortful, infrequent, logical, calculating, conscious.”

— Daniel Kahneman, Thinking, Fast and Slow

CoT and ToT

We might start our story with Chain-of-Thought (CoT) prompting — a technique that guides LLMs to reason step by step. Look at the following example:

Q: I bought 10 apples in the supermarket. Then I gave 2 apples to the neighbor and another 2 to my pigs. Not enough apples for me now!!I then bought 5 more apples from a grocery but 1 was rotten. How many apples now?
A: You used to have 10 apples. You gave 2 apples to the neighbor and 2 to the pigs. Now you have 6 apples. Then you bought 5 more apples, so you had 11 apples. Finally, you ate 1 apple, so you still have 10 apples in the end.

As you can see, guided to think step-by-step, the LLM simulates better reasoning.

Tree-of-Thought (ToT) prompting expands on CoT. As the name suggests, It organizes reasoning like a tree where each node is a potential “thought,” and branches are possible paths. It does not follow a linear process like CoT.

In practice, ToT creates a tree structure with branches, each with sub-steps leading to a final resolution. The model then evaluates each step, assigning each idea a classification as “sure”, “likely” or “impossible”. Next, ToT goes through the entire problem space by applying search algorithms such as Breadth-First Search (BFS) and Depth-First Search (DFS) in order to choose the best paths.

Here is an example in an office building: I have an office with a max capacity of 20 people, but 28 people are coming this week. I have the following three branches as possible solutions:

Branch 1: Move 8 people
- Is there another room nearby? → Yes
- Does it have space for 8 more people? → Yes
- Can we move people without any administrative process? → Maybe
- Evaluation: Promising!
Branch 2: Expand the Room
- Can we make the room larger? → Maybe
- Is this allowed under safety? → No
- Can we request an exception in the building? → No
- Evaluation: It wont work!
Branch 3: Split the group into two
- Can we divide these people into two groups? → Yes
- Can we let them come on different days? → Maybe
- Evaluation: Good potential!

As you can see, this process mimics how we solve hard problems: we don’t think in a straight line. Instead, we explore, evaluate, and choose.

Case study

Minesweeper

Minesweeper: wikipedia

It is almost impossible but In case you don’t know, minesweeper is a simple video game.

The board is divided into cells, with mines randomly distributed. The number on a cell shows the number of mines adjacent to it. You win when you open all the cells. However, if you hit a mine before opening all the cells, the game is over and you lose.

We are applying ToT in Minesweeper which requires some logic rules and reasoning under constraints.

We simulate the game with the following code：

# --- Game Setup ---
def generate_board():
   board = np.zeros((BOARD_SIZE, BOARD_SIZE), dtype=int)
   mines = set()
   while len(mines) < NUM_MINES:
       r, c = random.randint(0, BOARD_SIZE-1), random.randint(0, BOARD_SIZE-1)
       if (r, c) not in mines:
           mines.add((r, c))
           board[r][c] = -1  # -1 represents a mine


   # Fill in adjacent mine counts
   for r in range(BOARD_SIZE):
       for c in range(BOARD_SIZE):
           if board[r][c] == -1:
               continue
           count = 0
           for dr in [-1, 0, 1]:
               for dc in [-1, 0, 1]:
                   if 0 <= r+dr < BOARD_SIZE and 0 <= c+dc < BOARD_SIZE:
                       if board[r+dr][c+dc] == -1:
                           count += 1
           board[r][c] = count
   return board, mines

You can see that we generate a BOARD_SIZE*BOARD_SIZE size board with NUM_MINES mines.

ToT LLM Agent

We are now ready to build our ToT LLM agent to solve the puzzle of minesweeper. First, we need to define a function that returns thought on the current board by using an LLM such as GPT-4o.

def llm_generate_thoughts(board, revealed, flagged_mines, known_safe, k=3):

   board_text = board_to_text(board, revealed)

   valid_moves = [[r, c] for r in range(BOARD_SIZE) for c in range(BOARD_SIZE) if not revealed[r][c] and [r, c] not in flagged_mines]

   prompt = f"""

You are playing a 8x8 Minesweeper game.

- A number (0–10) shows how many adjacent mines a revealed cell has.

- A '?' means the cell is hidden.

- You have flagged these mines: {flagged_mines}

- You know these cells are safe: {known_safe}

- Your job is to choose ONE hidden cell that is least likely to contain a mine.

- Use the following logic:

 - If a cell shows '1' and touches exactly one '?', that cell must be a mine.

 - If a cell shows '1' and touches one already flagged mine, other neighbors are safe.

 - Cells next to '0's are generally safe.

You have the following board:

{board_text}

Here are all valid hidden cells you can choose from:

{valid_moves}

Step-by-step:

1. List {k} possible cells to click next.

2. For each, explain why it might be safe (based on adjacent numbers and known info).

3. Rate each move from 0.0 to 1.0 as a safety score (1 = definitely safe).

Return your answer in this exact JSON format:

[

 {{ "cell": [row, col], "reason": "...", "score": 0.95 }},

 ...

]

"""

   try:

       response = client.chat.completions.create(

           model="gpt-4o",

           messages=[{"role": "user", "content": prompt}],

           temperature=0.3,

       )

       content = response.choices[0].message.content.strip()

       print("\n[THOUGHTS GENERATED]\n", content)

       return json.loads(content)

   except Exception as e:

       print("[Error in LLM Generation]", e)

       return []

This might look a little long but the essential part of the function is the prompt part which not only explains the rules of the game to the LLM (how to understand the board, which moves are valid, etc. ) and also the reasoning behind each valid move. Moreover, it tells how to assign a score to each possible move. These construct our branches of thoughts and finally, our tree ToT. For example, we have a step by step guide:

1. List {k} possible cells to click next.

2. For each, explain why it might be safe (based on adjacent numbers and known info).

3. Rate each move from 0.0 to 1.0 as a safety score (1 = definitely safe).

These lines guide the LLM to propose multiple moves and to justify each of these moves based on the current state; it then has to evaluate each of these possible moves by a score ranging from 0 to 1. The agent will use these scores to find the best option.

We now build an LLM agent using these generated thoughts to move a “real” move. Consider the following code:

def tot_llm_agent(board, revealed, flagged_mines, known_safe):

   thoughts = llm_generate_thoughts(board, revealed, flagged_mines, known_safe, k=5)

   if not thoughts:

       print("[ToT] Falling back to baseline agent due to no thoughts.")

       return baseline_agent(board, revealed)

   thoughts = [t for t in thoughts if 0 <= t["cell"][0] < BOARD_SIZE and 0 <= t["cell"][1] < BOARD_SIZE]

   thoughts.sort(key=lambda x: x["score"], reverse=True)

   for t in thoughts:

       if t["score"] >= 0.9:

           move = tuple(t["cell"])

           print(f"[ToT] Confidently choosing {move} with score {t['score']}")

           return move

   print("[ToT] No high-confidence move found, using baseline.")

   return baseline_agent(board, revealed)

The agent first calls the LLM to suggest several possible next moves with the confidence score. If the LLM fails to return any thought, the agent will fall back to a baseline agent defined earlier and it can only make random moves. If we are fortunate enough to get several moves proposed by the LLM, the agent will don a first filter to exclude invalid moves such those which fall out of the board. It will then sort the valid thoughts according to the confidence score in a descending order and returns the best move if the score is higher than 0.9. If none of the suggestions are higher than this threshold, it falls back to the baseline agent.

Play

We will now try to play a standard 8×8 Minesweeper board game with 10 hidden mines. We played 10 games and reached an accuracy of 100%! Please check the notebook for complete codes.

Conclusion

ToT prompting gives LLMs such as GPT-4o more reasoning ability, going beyond fast and intuitive thinking. We have applied ToT to the Minesweeper game and received good results. This example shows that the ToT can transform LLMs from chat assistants to complicated problem solvers with real logic and reasoning ability.