LLMs (Large Language Models) are unique in their ability to harness intelligible natural language in a manner resembling human-like reasoning. To take a simple example, consider the following response from Chat-GPT4o:

Query: “If Johnny has three green apples, and Belle has twice as many green apples as Johnny, as well as two red apples, how many apples do they have combined?"

Response:

Unlike other neural networks, LLMs of a certain scale appear to produce "human-like reasoning." Their use of natural language often resembles human cognition, and this is significant. Labeling these models as "reasoning systems" encourages people to take their responses at face value and trust them to a greater extent than the classic black-box deep neural networks.

To better guide human-AI interactions, it's crucial to validate our instincts and ask what kind of reasoning these systems employ under the hood. In the case of mathematics, the above example suggests that LLMs might be employing a formal mathematical reasoning system, akin to what we learn in school. These types of formal systems are considered trustworthy because they lead to correct outcomes and can generalize across similar problems.

Understanding Current LLM Reasoning

The problem, however, is that the current state of LLMs doesn't truly engage in formal systems of reasoning. Instead, given their very large corpus of data and numerous parameters, they generate realistic-looking outputs via a process of brute statistical learning. Essentially, when you say "2x2=4," statistical learning doesn’t involve summing two twos; it merely learns that after "2x2=" the likely next token is "4."

Of course, to generalize as these systems do, there must be some kind of deeper pattern recognition, or else the model would need to have seen every possible multiplication problem to answer correctly. However, whether the system has formally "learned to multiply" remains an open question.

There have been attempts to encourage systems to develop these kinds of abilities. One of the most prevalent examples is Chain-of-Thought (CoT) prompting, which encourages LLMs to write out their reasoning step by step. The theory is that each line of reasoning cues the next, allowing a form of logic-like reasoning to take place.

CoT is used by Chat-GPT-o1 to achieve impressive results on complex reasoning tasks. Recently, it even placed in the top 500 students in the USA Math Olympiad qualifier. The challenge, however, is that it remains unclear whether this is a truly different kind of reasoning—that is, the kind humans use—or merely an incredibly effective statistical process.

A recent paper published by Apple gives us reason to doubt the former. The researchers found that formally irrelevant features, such as the names of people in math problems, specific numbers used, or even adding semantically irrelevant but coherent sentences, caused wide variability in accuracy for state-of-the-art LLMs, including those using Chain-of-Thought prompting. The fact that such superficial features can affect accuracy suggests that LLM reasoning is not akin to formal reasoning.

Current LLM benchmarks may also overstate the abilities of state-of-the-art models due to data contamination. Results from novel, unseen benchmarks vary significantly from performance on other benchmarks, suggesting that LLMs may overfit these benchmarks by memorizing questions. Overfitting is a plausible explanation for why seemingly irrelevant features, such as specific names, affect outcomes, even when the overall structure of the question remains the same.

Improving Reasoning in LLMs

More recent attempts to improve reasoning have focused on task-specific structures. For instance, researchers at DeepMind have worked on training LLMs to recognize task-specific reasoning structures for particular tasks and follow these structures accordingly (link). It is yet to be tested for robustness against benchmarks like Apple's, and it remains unclear whether task-specific reasoning can enable LLMs to reach formal reasoning. The much-vaunted goal of developing a robust AI reasoner continues to elude researchers.

Understanding Opacity

Given these limitations, and the possibility that AI models might inaccurately present invalid reasoning as sound, we face an acute problem: for the most part, the way deep neural networks process information is impossible to interpret. This problem of opacity, where neural networks are treated as "black boxes," compounds the challenge of understanding machine reasoning. It also makes diagnosing the cause of errors and recognizing erroneous or biased outputs especially difficult.

There have been some attempts to address this issue. The field of Explainable AI (xAI) and the broader discipline of machine interpretability aim to either reverse-engineer existing systems to explain their operation or change the system design to promote greater interpretability. For example, some recent work was able to infer exactly how a neural network learned to perform formal math tasks, showing how internal representations shifted from a list-like configuration to a deeper understanding of the mathematical operation. This study is particularly encouraging because it allows us to observe the internal workings of a model as it shifts toward a general formal reasoning approach.

This shift toward a more robust and generalizable algorithm, often described as grokking, provides hope to AI researchers. Grokking is a phenomenon where neural networks suddenly improve in accuracy, seemingly shifting from overfitting to understanding formal rules (link). Applying this to LLMs has had some success; for instance, there has been a successful attempt to "grok" a simpler LLM model, which then displayed state-of-the-art performance in some tasks (link).

Understanding Bias and Hallucination

Another flaw in LLM reasoning is the occurrence of hallucinations, where models confidently produce incorrect information. In scenarios where formal logic is less applicable, AI can still exhibit worrying biases that could reinforce discrimination and inequality if misused.

Examples abound: AI loan approval algorithms appear to discriminate against ethnic minorities, facial recognition systems struggle to identify Black faces accurately, and LLMs mischaracterize certain minority groups. Attempts to address these issues have also very publicly backfired.

To some political commentators, these failed attempts represent an obsession with identity politics within tech organizations. However, the fact that these issues arise in the first place highlights the biases baked into digital data. Language, images, and opinions fed into these models are often biased, and these biases can be reflected in their output.

Bias isn't limited to social issues—a large portion of training data is fictitious. For example, Tolkien's dragons might exist in Middle-earth, but an LLM shouldn't advise someone to bring fire-retardant gear on a hike. Beyond fiction, misinformation is rampant on the internet and can easily contaminate LLMs.

Effective Strategies for LLM Implementation

A clear approach to cleaning data, tailoring prompts, and applying reinforcement learning can decrease biases. Companies must keep these solutions in mind, particularly for applications requiring high accuracy and frequent user interactions.

Conclusion

To effectively implement LLMs, it is vital to understand their reasoning processes and potential flaws. A truly robust AI reasoner would be revolutionary—even beyond the current technological shift we're experiencing. However, until we achieve that, those enthusiastic about AI implementation must rigorously design and monitor these systems to ensure the ideal user experience. By understanding LLM limitations, we can more effectively leverage them while mitigating the risks inherent to their use.