Introduction
Recent large language models (LLMs) — such as OpenAI’s o1/o3, DeepSeek’s R1 and Anthropic’s Claude 3.7 — demonstrate that allowing the model to think deeper and longer at test time can significantly enhance model’s reasoning capability. The core approach underlying their deep thinking capability is called chain-of-thought (CoT), where the model iteratively generates intermediate reasoning steps and appends them to the current context until producing the final answer.
However, as tasks become increasingly complex, the steps needed to solve them grow dramatically. For instance, consider solving NP-hard problems using CoT — the reasoning trace would inevitably span exponential steps, assuming a fixed-size Transformer as the base model and P ≠ NP. This raises an important question:
Will CoT-based test-time scaling hit hard ceilings?
Unfortunately, probably yes. Various limitations will emerge for harder tasks: (1) chains will inevitably exceed model’s context windows, (2) critical information becomes buried and nearly impossible to retrieve from numerous preceding tokens, and (3) the self-attention complexity makes generating each new token prohibitively expensive.

In this article, we challenge the conventional “write-only” CoT reasoning paradigm that dominates current LLM architectures, from both theoretical and practical perspectives. Furthermore, we will explore a fundamentally different reasoning approach that allows LLM to not only generate thoughts, but also erase thoughts. This capacity for thought erasure not only offers significant practical benefits in performance and efficiency, but proves fundamental for achieving optimal reasoning efficiency from a computational theory perspective.
This post is based on the paper C. Yang et al., “PENCIL: Long thoughts with short memory” accepted in International Conference on Machine Learning 2025, a collaboration with Nathan Srebro, David McAllester, Zhiyuan Li. Code is also available.

Not Everything Needs to Be Remembered
The idea of selectively discarding information has deep roots in computer science history, from the earliest computational models to modern systems. The classic Turing machine overwrites symbols on its tape rather than preserving every state; programming languages reclaim memory through stack frames that are automatically released when functions complete their execution; and modern garbage collectors continuously identify and remove objects no longer accessible to the program. These mechanisms weren’t merely efficiency optimizations — they were essential design choices that made complex computation possible within finite resources.
This idea also applies to human reasoning. In theorem proving, once a lemma is established, we discard its detailed derivation while preserving the result; when exploring problem-solving approaches, we simply mark unproductive paths as “failed” without retaining their full traces. Throughout complex reasoning, we naturally compress information, retaining conclusions while discarding the scaffolding used to reach them.
PENCIL: A New Reasoning Paradigm
Therefore, we propose PENCIL, a new reasoning paradigm for LLMs. Unlike
CoT that only generates thoughts, PENCIL recursively generates and erases thoughts until reaching the final answer. It maintains only the minimal context required for generating future thoughts, so the model can think longer and deeper to solve harder tasks using shorter working memory. The following figure illustrates how PENCIL works

How Do Models Erase Thoughts?
PENCIL’s erasure mechanism draws on two classical ideas. First, from rewriting rules in logic and classical automated theorem proving, which continuously apply predefined rules to simplify complex logical or arithmetic expressions into canonical forms until reaching a final answer. Second, from functional programming languages, which creates stack frames to store local variables when calling functions and releases corresponding memory when functions return, automatically discarding intermediate states that are no longer needed.
Specifically, we introduce three special tokens, called [CALL], [SEP], and [RETURN], and use the following reduction rule to implement erasure:

where C stands for context, T stands for intermediate thoughts, and A stands for answer. Whenever the generated sequence completely matches the pattern on the left, PENCIL triggers the reduction rule, erasing thoughts and merging the answer back into the context. It is important to note that C, T and A can themselves contain special tokens, thereby supporting recursive structures similar to nested function calls — for example, C may contain another [CALL] token, indicating that a new thinking subroutine has been initiated.
How to Use PENCIL?
PENCIL’s erasure mechanism flexibly supports various reasoning patterns, such as:
1⃣ Task Decomposition: Using [CALL] to initiate subproblems, generate intermediate results, and then use [SEP] and [RETURN] to merge outputs and erase subproblem reasoning details;
2⃣ Branch and Backtrack: Using a [CALL], [SEP], [RETURN] triplet to manage an exploration branch in a search tree, erasing invalid paths upon conflicts or failures.
3⃣ Summarization / Tail Recursion: Condensing a lengthy reasoning trace into concise summary, similar to tail recursion optimization in programming:

where T represents the original complex reasoning process (or a more difficult problem), and T’ represents the summarized or simplified content (or an equivalent, more tractable problem).
Example on a NP-Complete Task
For example, consider a classic NP-Complete problem Boolean Satisfiability (SAT): given a Boolean formula, determine whether there exists a variable assignment that makes it true. This problem is (widely believed to) require exponential time but only polynomial space to solve, with the simplest approach being traversing a binary search tree of depth n.
Traditional CoT would accumulate intermediate calculations, causing the context length to grow proportionally with the number of nodes in the search tree, which is exponential time complexity of O(2^n). In comparison, PENCIL can recursively branch to try True/False for a variable, backtracking upon conflict and erasing all thoughts within that branch. This thus keeps the context length proportional to the search depth, which is space complexity of only O(n).
The following figure compares the maximum context length of the vanilla CoT without reduction (blue) and PENCIL with reduction (red). As problem complexity increases, PENCIL achieves dramatic space efficiency, notably reducing context length from 151,192 to just 3,335 tokens for Einstein’s Puzzle.

Training and Experiments
The core difference between CoT and PENCIL during training is the calculation of the loss function:


For CoT, the loss for each new token is based on the complete historical context; for PENCIL, after each “write-erase” iteration, the model calculates loss for new tokens only on the reduced sequence. Although both generate the same number of tokens, PENCIL significantly shortens the context length corresponding to each token and thus is more efficient.
It’s also worthwhile to note that after each reduction, the KV cache for the shared prefix C can be directly reused, with only the cache for the shorter part A needing recalculation.
Experimental Results
Our experiments focus on three inherently hard reasoning tasks: 3-SAT (NP-Complete), QBF (PSPACE-Complete), and Einstein’s Puzzle (natural language reasoning). For each task, we wrote a generator to generate a training set where special tokens are included. We train a small transformer (SAT/QBF with 10.6M parameters; Einstein’s Puzzle with 25.2M parameters) starting with random initialization for these tasks.
Compared to CoT, we found PENCIL can solve larger-scale reasoning problems. As shown in the figure below, in SAT (left) and QBF (right) tasks, when problem size is small, both CoT and PENCIL perfectly solve problems; but as size increases, traditional CoT accuracy drops significantly (e.g., only about 50% for SAT at n=10), while PENCIL maintains high accuracy ≥ 99%. This is primarily because CoT’s context sequence length explodes exponentially, whereas PENCIL avoids explosion by dynamic reduction.

Additionally, PENCIL significantly saves computational resources. As shown in the figure, for QBF (n=3–6) tasks, we compared the convergence speed of CoT (blue) and PENCIL (red) under the same FLOPs budget. PENCIL quickly reaches 100% accuracy while CoT, due to continuously expanding context length, requires more FLOPs to approach optimality. As the problem size increases, the gap between the two becomes more pronounced.

to 6). Circles and vertical lines indicate the first time each method reaches optimal performance.
We further considered a very difficult logical reasoning problem: Einstein’s Puzzle. Each problem consists of 5 houses and 5 attribute categories of people living in them — color, nationality, drink, cigarette, and pet (e.g., Red/Green/Blue, Brit/German/Swede, Bird/Dog/Fish, etc.). Given clues like “the green house is right next to the bird owner’s” and “the dog owner lives in the red house,” the task is to deduce “who owns the fish?” This problem presents an extreme challenge for existing LLMs: even GPT-4 struggles to solve it. The figure below shows a simplified version with only 3 houses and 3 attribute categories:

As shown below, for this problem that even large models struggle with, PENCIL achieves 97% accuracy using only a small 25.2M parameter model, while traditional CoT achieves only 25% accuracy (close to random guessing).

Theory: Universal Efficient Computation
We further demonstrate PENCIL’s fundamental advantage over traditional CoT from the theoretical expressive power perspective: PENCIL is Turing complete with optimal space complexity, and thus can solve arbitrary computable tasks efficiently. This is something fundamentally impossible for CoT!
Main Results
Specifically, we prove: Using a fixed, finite-sized Transformer, PENCIL can simulate any Turing machine with optimal time and space complexity, thereby efficiently solving all computable problems.

In other words, for any Turing machine running in T time and S space, PENCIL requires only O(T) tokens while maintaining a maximum context length of O(S) to produce identical results. While previous work established that traditional CoT can make Transformers Turing complete, it demands O(T) context length with each token representing an intermediate computation step. This distinction between maximum context length becomes crucial because for most algorithms, space complexity S is substantially smaller than time complexity T, especially for harder problems.
Consider NP-Complete problems like Traveling Salesman or Hamiltonian Circuit, which are widely believed to require exponential time but solvable in polynomial space. Traditional CoT cannot solve these within polynomial context length constraints, and requires at least exponential length that exceeds practical memory limitations of any real system. PENCIL, in contrast, can solve them using only polynomial maximum context length, making previously intractable reasoning tasks feasible.
Proof Sketch
We now briefly introduce our proof idea, where the key insight is to have PENCIL use a series of “Simulation-Summarization” iterations to clean the memory.

Step 1: Using CoT to Encode Turing Machine Transitions As illustrated in the left part of the figure above, we encode each Turing machine state transition as a token encoding “new state”, “written symbol”, and “head movement direction” triplet in the embedding. The model can use self-attention to calculate the current head position and determine the symbol at this position. Without reduction, this process generates T tokens with context length O(T).
Step 2: Alternating “Simulation-Summarization” PENCIL achieves space/time optimality through alternating:
- Simulation: Continuously generate Turing machine state transition tokens, simulating multiple computation steps;
- Summarization: When new tokens exceed twice the space needed, summarize the computation using S tokens. The reduction rule then discards previous thoughts, keeping only the latest Turing machine state for the next round.
This strategy maintains total token generation at O(T) while limiting context length to O(S).
Step 3: Transformer Implementation To prove this process can be implemented by Transformers, we developed the Full-Access Sequence Processing (FASP) programming language and proved that any algorithm written in FASP can be implemented by a fixed-sized Transformer. In a FASP program, each variable corresponds to a Transformer sub-module, and each line of code transforms existing variables to a new variable through predefined functions, which is equivalent to constructing a more complex Transformer based on sub-modules. The variable returned by the program is the desired Transformer that encodes the algorithm. We wrote a FASP program that implements the “Simulation-Summarization” operation, which implies there exists a constant-sized Transformer that can perform the same function
Conclusion
In conclusion, we propose a new reasoning paradigm PENCIL, which alternates between generation and erasure, and enables models to think deeper to solve more complicated problems. Theoretically, we prove that PENCIL achieves Turing completeness with optimal time and space efficiency and thus can efficiently solve any computable problems. Looking forward, a promising direction would be to fine-tune LLMs to incorporate PENCIL’s memory-efficient reasoning capabilities. We hope these findings will inspire reexamining current reasoning models from the perspective of theory of computation.

The post Empowering LLMs to Think Deeper by Erasing Thoughts appeared first on Towards Data Science.