Faith and Fate: Limits of Transformers on Compositionality

Transformer language models like GPT-4 and ChatGPT have demonstrated remarkable capabilities across a wide range of tasks, sparking both admiration and concern about their potential impact. However, a recent paper titled "Faith and Fate: Limits of Transformers on Compositionality" by researchers from Allen Institute for AI, University of Washington, University of Southern California and University of Chicago takes a critical look at the limitations of these models in tasks requiring multi-step compositional reasoning.

The researchers investigated transformer performance on three representative compositional tasks: multi-digit multiplication, logic grid puzzles, and a classic dynamic programming problem. By formulating these tasks as computation graphs, they were able to systematically quantify the level of complexity and break down reasoning steps.

Limitations of Transformers

Zero-shot and Few-shot Settings

In the zero-shot and few-shot settings, the performance of transformers deteriorates significantly from near perfection to zero as the task complexity increases, measured by either problem size or average parallelism in the computation graph. This indicates that pre-training alone is not sufficient to teach models how to combine basic operations to solve compositional problems, especially as complexity grows.

Question-Answer Training

To address the lack of task-specific data during pre-training, the researchers exhaustively finetuned GPT-3 with question-answer pairs. However, while the model achieved high accuracy on in-distribution examples (i.e., problem sizes seen during training), it sharply declined on out-of-domain (OOD) examples with unseen problem sizes. This suggests that systematic problem-solving capabilities do not emerge even with extensive training on task-specific data.

Explicit Scratchpad Training

The researchers also tested whether explicitly teaching the required computational operations via scratchpads (step-by-step computation graphs) could improve performance. However, even with this direct guidance, GPT-3 still failed to generalize to OOD cases with wider or deeper computation graphs. This indicates that the autoregressive nature of transformers, which forces them to tackle problems sequentially, presents a fundamental challenge that cannot be resolved by instructing the model to generate step-by-step solutions.

Grokking

Grokking, a phenomenon where extended training beyond overfitting leads to improved generalization, was also explored. However, even after training GPT-3 with question-answer pairs for 420K steps and question-scratchpad pairs for 30K steps (far beyond the point of in-domain accuracy plateauing), no improvement in OOD generalization was observed. The absence of grokking may be due to the high difficulty of the tasks, which impedes learning well-structured representations. Even if grokking were to emerge with more prolonged training, it would be inefficient and unscalable.

Breaking Down Successes and Failures

Information Gain Explains Partial Success

Using relative information gain, the researchers predicted surface patterns that models are likely to learn. For multiplication, the first and last digits of the output highly correlate with the corresponding digits of the input numbers. Empirically, models indeed learn these predicted patterns, as well as others like the order of magnitude and number of trailing zeros, even without scratchpads. This suggests that transformers can recognize single-input-output correlations during training and map them directly during testing, giving a false illusion of compositional reasoning.

Reducing Multi-Step Reasoning to Subgraph Matching

The researchers hypothesized that transformers rely on pattern matching rather than learning underlying algorithms. They found that correctly predicted test examples had significantly higher frequencies of their computation subgraphs appearing in the training data compared to incorrectly predicted examples. This high correlation suggests that pattern matching, not general reasoning, drives correct outputs. While effective for low-complexity tasks, this approach fails for increasingly complex out-of-domain problems.

Error Types at Different Reasoning Depths

Analyzing error types at different computation graph layers revealed that models can perform single-step reasoning, likely due to memorizing single-step operations during training, but struggle to compose multiple steps for correct overall reasoning. The presence of restoration errors (correct outputs despite incorrect computations) in dynamic programming and logic puzzles also points to memorization. Even when restoration errors are near zero, most correct answers for unseen problem sizes still contained computation graph errors, possibly due to high frequency of input-output pairs in pretraining data.

Key Takeaways

Transformers seem to solve compositional tasks by reducing multi-step reasoning into linearized pattern matching, rather than developing systematic problem-solving skills. Their performance rapidly decays as task complexity increases.
Even with explicit step-by-step training via computation graphs, transformers fail to generalize to out-of-domain examples of higher complexity than seen during training. Near-perfect in-domain performance doesn't translate to robust compositional reasoning abilities.
Transformers can memorize single-step operations but struggle to compose them into correct multi-step reasoning paths. They rely more on shallow pattern matching than deep task understanding.
Theoretical arguments show that autoregressive generation is inherently prone to exponential error accumulation as the number of composition steps increases. The empirical tasks analyzed are real-world instantiations of these theoretical abstractions.

The authors suggest that the current transformer architecture, with its reliance on next-word prediction, may have fundamental limitations in mastering certain complex compositional operations. However, they propose some practical strategies to leverage transformers' strengths while managing their weaknesses:

Employ transformers for tasks requiring only a few composition steps
Use them where some leniency in evaluation is acceptable
Augment them with planning modules or iterative refinement methods

As transformers continue to advance and impact society, it's crucial that we soberly examine both their remarkable successes and surprising failures. This paper makes an important contribution by rigorously investigating the limits of transformers in compositional reasoning - an essential building block of intelligence. The insights can help guide the development of more robust and reliable AI systems.

While the authors acknowledge compute and access limitations in this study, they invite the broader research community to further push the boundaries of this investigation. Achieving fundamentally new innovations to enable transformers to master complex compositional reasoning remains an important open challenge. This thoughtful paper brings us one step closer by clearly identifying current limitations and suggesting pragmatic paths forward.

References

Faith and Fate: Limits of Transformers on Compositionality

Created 2024-04-16T13:23:24-07:00