Nope, machines still can’t think. Where does AI go now?

Exploring Apple’s Illusion of Thinking, the gap between AI marketing and reality, and data monetisation opportunities in AI.

Jun 26, 2025

Nope, machines still can’t think. Where does AI go now?

This post was written by Jessica Li Gebert, who helps Neudata’s clients unlock the hidden value of their data – especially in cutting-edge AI and emerging data use cases.

If you’ve read Turing’s Computing Machinery and Intelligence (1950), you’ll realise he never answered his famous question, “Can machines think?”. Turing thought that machine and think could not be defined and instead focused on a more practical question, “Can a machine mimic human tasks such that its performance is indistinguishable from that of humans?”. And thus was born the imitation game (aka the Turing Test). 

75 years on, are we closer to creating machines that can think? In this piece, I’ll dissect Apple’s latest findings in The Illusion of Thinking and discuss the implications for our broader AI and data ecosystem.

The Illusion of Thinking

Fast forward to 2024, a UC San Diego study demonstrated that GPT-4 has passed the Turing test! But does that mean AI is thinking? 

Not quite. In fact, Apple’s The Illusion of Thinking revealed the limits of current models at ‘thinking’ and raised important questions about where we’re headed with artificial general intelligence (AGI).

The Illusion study analysed the ability of three Large Reasoning models (LRM)1 - Claude-3.7 thinking, DeepSeek R-1, and GPT-o1/o3 - and their non-thinking counterparts (i.e. regular large language models (LLMs)) at solving math puzzles and concluded that:

  • For simple puzzles, regular LLMs performed as well as, if not better than, LRMs in terms of accuracy2. In fact, LRMs had a tendency to overthink - they continued to reason after reaching correct answers early on in their reasoning traces2, thus wasting compute2, i.e. LRMs are inefficient at solving simple puzzles.
  • As the complexity of puzzles increased, LRMs’ ability to use chain-of-thought reasoning gave them an advantage over regular LLMs.
  • But when complexity was dialled up further, LRMs simply gave up - they used drastically fewer thinking tokens2 and their accuracy and pass@k2 dropped to zero. They refused to ‘think’, suggesting LRMs might have a fundamental inability or barrier to achieving generalisable reasoning.

The study also noted our current benchmark for ‘thinking’ or intelligence3 is narrowly defined by a model’s performance at answering math questions without assessing its reasoning traces. 

In other words, we can’t ascertain if reasoning models actually think or simply use probability (prediction) and pattern recognition (regurgitation)! In human-speak, rote learning is not critical thinking. 

Now, if we understand thinking as the exercise of intelligence, what does Apple’s Illusion paper mean for the broader AI and data ecosystem?

-----

This section explains the technical terms used above. Feel free to skip to Implication 1, if you're already familiar

1LRMs are a type of frontier model developed based on LLMs. LRMs are meant for problem-solving and logical reasoning, whereas LLMs are meant for textual understanding only. Reasoning models are interchangeably referred to as thinking models and inference models. 

2These are some common evaluation metrics used in AI benchmarking:

  • Accuracy (of solution) measures the correctness of a model’s performance, i.e. the number of correct answers out of total number of attempts.
  • Reasoning trace refers to the thought process behind a model’s response. By analyzing reasoning traces to locate where the correct answer appears in the thought process allows us to ascertain if a model is overthinking or oversimplifying. 
  • Compute refers to the hardware and software resources required to run a model. 
  • Thinking tokens measure how much compute a model uses in its reasoning, i.e., how much effort a model puts into reasoning.
  • Pass@k measures the probability of a model getting at least 1 correct solution out of k solutions generated. This metric reflects how we use AI in practice where we may generate multiple outputs before picking one we find most suitable. 

3 François Chollet’s formulation of ‘intelligence’ is currently my favorite. I’ve been following his ARC-AGI benchmark for a while and will discuss it in the future.  

-----

Implication 1: We are a long way from AGI and we may not even be on the right path

To date, most AGI research has been built on LLMs, which represent just one narrow branch of the broader AI landscape. LLMs could be a path to AGI, or not. At this time, they just happen to be the most commercially available and visible form of AI. As Karen Hao puts it in Empire of AI:

Nothing about this form of AI [LLMs and other GenAI applications] coming to the fore or even existing at all is inevitable; it is the culmination of thousands of subjective choices, made by the people who had the power to be in the decision-making room.”

Maybe those powerful people made wrong decisions? Maybe our path to AGI is yet to be written?

We shall see. 

Moreover, have you ever wondered why we are constantly inundated with talks about AGI? While it captures headlines and imaginations, it’s also a convenient PR narrative.

AGI diverts our attention from what Big Tech isn't saying. Behind the scenes, the real business of AI raises tough questions – from environmental impact to labour concerns. I’ll explore the ethics behind it all in a future post.

Implication 2: Artificial narrow intelligence is where market growth lies! 

Artificial narrow intelligence (ANI) is built for specific tasks, such as an LLM for text generation, image generators (i.e. generative AI or GenAI), resume screening tools, speech and facial recognition.

ANI may not sound as cool as AGI, but it’s where market growth lies because ANI is technologically and commercially attainable, and far from reaching market saturation. Since November 2022, when OpenAI launched ChatGPT, enterprise AI adoption has grown rapidly. According to McKinsey’s latest Global Survey on AI, as of July 2024, 78% of the respondents have implemented AI in at least one business function, up from 55% in 2023. 

Going forward, I expect to see a continued upward trajectory and here’s why:

ANI - especially LLM - has become more reliable and usable

Chief among ANI models, LLM is highly relevant to enterprise use cases as most of our jobs require communication in natural languages. So, unsurprisingly, improvements in LLMs due to high-quality training datasets and advanced architecture mean wider enterprise AI adoption. 

For context, take GPT-3.5 Turbo (March 2023) and GPT-4.5 (February 2025). GPT-3.5 Turbo scored 69.8% on the MMLU benchmark4, significantly lower than GPT-4.5’s 90.8%, signifying the LLM’s drastically improved ability to understand our languages. (Benchmark source: llm-stats.com)  

4 The MMLU benchmark stands for Massive Multitask Language Understanding, a commonly used LLM benchmark in the industry. It measures a model’s reasoning and general language understanding abilities.

More cost-efficient LLMs

The costs of implementing an enterprise LLM solution include LLM tokens5, model hosting, security, implementation and support. 

While the latest LLM versions have costlier tokens, their improved reasoning abilities mean fewer generations/tokens are now required to get a desired response. This means newer LLMs are more cost-efficient. Moreover, the AI hosting landscape has evolved, too. In early 2023, the enterprise AI hosting market was dominated by Microsoft Azure and AWS. Today, we also have Google Cloud Vertext AI, Hugging Face, Fireworks, and more to democratise the hosting market and bring down hosting costs. 

With more value per token and decreasing hosting costs, the overall implementation cost has gone down, making it more affordable for enterprises to adopt AI solutions.

5 Token refers to the unit of information that AI models process. Every bit of input and output data is expressed in the form of tokens. An AI model is priced in terms of price per 1k or 1m tokens.  

The good ol’ FOMO

By now, most of us will have experienced firsthand the productivity gain from AI tools. Enterprises that don’t implement AI will risk losing out in the long-run. 

The growth of ANI adoption

As to where future ANI adoption lies? I expect to see AI grow in two ways: Increasing automation in business functions and domain-specific AI applications. 

Automation trends in business functions:

  • Risk, legal, compliance and finance: automated research, structured report writing, note and minutes-taking
  • Software development: data cleaning and coding
  • Strategy, sales/marketing, market research, business intelligence: multimodal analysis by analytical AI models 
  • Operations and manufacturing: operational analytics, such as maintenance predictive analytics, demand forecasting, resource planning
  • Sales and customer success: better chatbots with diverse and localised language understanding
  • Agentic AI: an AI that does not just answer one question at a time like the current state of LLM, but is capable of running a sequence of actions to perform a pre-defined job. 

Industries that rely on structure but require high levels of meticulous analysis are leading the way in domain-specific AI development:

  • Healthcare: drug development, medical imaging, diagnostics
  • Finance: fundamental investment analysis, fraud detection/KYC/AML, risk management, personal wealth management
  • Legal: legal research, contract analysis
  • Supply chain and logistics: demand forecasting, route planning
  • Consumer and retail: inventory management, fulfilment, demand analysis, consumer AI products
  • Telecom: network optimisation

Next steps for business leaders

For business leaders, this means two things for your AI+data strategy:

  • If you haven’t implemented AI in those functions, you have to start now or risk falling behind!
  • If you have domain-specific enterprise data in these industries, you might just become the most popular kid in town! If you recall my AI data deals mid-year review, model makers and application developers are hungry for training data!

If you're unsure how to make the most out of your AI+data strategy you can reach out to consulting@neudata.co, to discuss how to turn your cost centre into a revenue stream! 

Blog suggestion

Suggest a topic for the Neudata blog

Suggest a blog topic

Complete Neudata's annual trends survey

Get early access to results | Win a $500 gift certificate (terms and conditions apply)

Complete Neudata's annual trends survey

Get early access to results | Win a $500 gift certificate (terms and conditions apply)