Discover our other sites:

You might be looking for...

THIS CONTENT IS BROUGHT TO YOU BY UiT The Arctic University of Norway - read more

What affects a chatbot's ability to solve logical problems?

How well do large language models solve logical problems, and what affects their ability to reason? Researchers have developed a new method to better understand when and why their reasoning falls short.

A new method could give new insight into how well large language models like chatbots solve logical problems, and what affects their reasoning capabilities.

(Stock photo: Mostphotos)

Petter Bjørklund Petter Bjørklund COMMUNICATIONS ADVISER

Presented by: Presented by: UiT The Arctic University of Norway

Published 5 March 2026 - 00:01

Exercise tips, dinner suggestions, or help with school or work.

Large language models (LLMs) like ChatGPT, DeepSeek, and Gemini are designed to help us with many different tasks and problems.

But how good are they at solving logical problems? And what affects their reasoning capabilities?

This is what researcher Daniel Kaiser explores in a recently published study.

He has developed a method to examine the logical problem-solving and reasoning skills of LLMs.

Uncovers hidden limitations in LLMs

While LLMs have become a useful technology with several benefits, they are known to make mistakes. Sometimes in a catastrophic manner.

In 2025, ChatGPT made up 11 out of 18 sources which Tromsø municipality used for a school structure report – a typical example of a hallucination.

Daniel Kaiser researches LLMs as part of his doctoral project at UiT's Machine Learning Group and Integreat – Norwegian Centre for Knowledge-Driven Machine Learning.

“One should never blindly trust what LLMs say, even if it appears true or convincing. It's important to always double-check and verify their answers,” Kaiser warns.

He believes the method, named CogniLoad, can help detect and understand the limitations models face when solving logical problems.

“It's made to help us understand why certain LLMs excel or fall short on different tasks,” he says.

Not all LLMs are good at the same task

An LLM's design, such as its model size and training data, determines its ability to help us with a particular problem.

In other words, not every LLM is equally suited for the same task.

“There are huge differences in what different LLMs are capable of. Big models like ChatGPT's GPT-5 model tend to excel at advanced problems, while smaller models like Meta's LLaMA models are more suited for easier ones,” Kaiser explains.

But it isn't always obvious what certain LLMs are best or worst at.

The models' complex structure also makes it difficult to understand where potential mistakes come from.

Valuable knowledge

It's therefore important to find out what different LLMs can and cannot do, regardless of how advanced they are.

Even the most advanced models can still make mistakes, despite their confident tone.

“A test like CogniLoad can help pinpoint where a model's reasoning breaks down. This makes it possible to examine what kinds of logical mistakes LLMs make,” Kaiser says.

This knowledge is valuable in many different ways.

“We can use this information to understand what these models struggle the most with. Developers can use this to adjust their models to make them better,” he says.

LLMs are given a logical riddle

CogniLoad involves giving LLMs a logical riddle. It starts by describing a situation with several people and facts about them, like what they are wearing or what music they last listened to.

Then the model is given a series of statements which repeatedly changes what this situation looks like. At the end, the chatbot is asked one specific question about a person, like what colour their socks are.

“To get it right, the chatbot has to keep track of all these changes from start to finish without making any mistakes,” Kaiser explains.

Kaiser can adjust the riddle to make it more challenging, such as by increasing its length or complexity, or by adding more irrelevant information. This tunability is designed to reveal what aspects of the riddle affect the LLM's ability to solve it.

The method is based on Cognitive Load Theory; which states that how hard our brain has to work affects our ability to solve different tasks.

“When we have too much to keep in mind at once, it becomes harder to reason carefully and avoid mistakes. Since AI systems like LLMs are designed to imitate human intelligence, we wanted to look at how different types of cognitive load affect an LLM's reasoning abilities,” he says.

Tested on ChatGPT, DeepSeek, and Gemini

Kaiser tested the method on 22 different LLMs – both on open and commercial models like ChatGPT, DeepSeek, and Gemini.

“The point was to see what kinds of pressure these different models handle well, and what kinds make them struggle,” he says.

Findings show that the method can provide unique insight into how these LLMs process and solve logical problems – regardless of their size.

“They show that we can apply this method on all these different models to understand what affects their reasoning capabilities,” says Kaiser.

Similarities to human intelligence

The results reveal some interesting similarities between how humans and LLMs process information.

“We see that factors such as length, complexity, and noise do in fact affect the LLMs' ability to solve logical problems. Just like when humans are exposed to different forms of cognitive load,” Kaiser says.

Even the biggest LLMs struggled when the task was made longer or more difficult.

“It's a reminder that even when the best chatbots sound confident and fluent, they can still lose track of important details and end up wrong,” he says.

Model size plays an important role

Adjusting the riddle's length caused the most issues for the LLMs.

But the size of the models also play an important role.

“The longer the puzzle got, the harder it became for many models to give an accurate answer. We see that smaller models tended to struggle sooner, while bigger models could follow the chain for longer,” he says.

But eventually, the best models started making more mistakes when the task became quite lengthy.

Kaiser observes a similar pattern when tuning the riddle's complexity.

“Accuracy also drops off when the statements became more detailed and harder to follow,” he says.

Kaiser explans that CogniLoad is not meant to measure what LLMs already know, but to study how well they reason when encountering new information.

“It's not a test of knowledge where we quiz the LLMs about facts they're supposed to remember. In this case, we look specifically at how well the models do when facing a problem they've never seen before,” he says.

Is artificial general intelligence closer than we think?

AI systems develop rapidly, and some people fear they will soon match or surpass human intelligence – achieving so-called artificial general intelligence (AGI).

While CogniLoad doesn't provide a clear answer about the future, Kaiser's research still suggest that this imagined scenario is far beyond the horizon.

“Even puzzles that sound simple can become difficult for today's models when you make them longer and harder to follow. The riddle should actually be pretty simple for an LLM to solve, so it's quite fascinating to see that even the most advanced LLMs found it challenging when we increased the difficulty,” he says.

Both small and more advanced LLMs still have plenty of room for improvement.

“In a way, it shows how far away even today's AI models are from achieving AGI,” Kaiser laughs.

Reference:

Kaiser et al. CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density (PDF), The Fourteenth International Conference on Learning Representations (ICLR 2026), 2026. DOI: 10.48550/arXiv.2509.18458