A rose by any other name
October 14, 2024 1:39 AM   Subscribe

Apple Intelligence engineers don't think that LLM's reason, and have described their tests to see if the reasoning is reliable and predictable (and useful for an Apple-style 'it just works' product). Here's a summary rolled up from exTwitter by Threadreader, and this is the arXiv.org pre-print. More analysis at why this is a bad thing at Gary Marcus' substack: LLMs don’t do formal reasoning - and that is a HUGE problem.

In particular, adding irrelevant data to the input or swapping names of things (rose, schmose) caused unintended variability in the LLM's response.
posted by k3ninho (39 comments total) 23 users marked this as a favorite
 
That which we call a rose
By any other name would confuse an LLM
posted by Pallas Athena at 1:48 AM on October 14 [7 favorites]


I love the new attack vectors created by LLMs:

Imagine some legal entety Lary reads contracts by asking an LLM U to turn the contract into a bullet points. An adversary Eve creates a sufficently different LLM V and exploits the LLMs being largely linear to model x such that U(x) says good bullet points, and V(x) says bad bullet points. As a human court only cares what x itself says, Eve selects some x which looks passibly human generated and better expresses the bad points.

Now Lary signs the harmful contract x. As Eve's work craeting V and x remain secret, Lary cannot convince the court that Eve created x in bad faith.  In essence, LLMs have wide ranging dog whistles that're undetectable by any human.
posted by jeffburdges at 2:08 AM on October 14 [2 favorites]


The coverage of this paper is super frustrating. It's like we have an anti-LLM hype to go with the LLM hype (see Gary Marcus headline) and they're both tiresome. What's most interesting here is that they've demonstrated that the GSM8k benchmark is flawed. That's a big deal. But no one should be remotely surprised that you can produce different results by varying the inputs.
posted by hoyland at 3:22 AM on October 14 [4 favorites]


So something that is worth mentioning if you didn't read the paper (which is extremely readable): Not all models do worse under stress testing! The "best" models did fairly well through each of the attempted modifications. The example from the fpp (ex. name or number swapping) has almost no affect on the accuracy of some of the models. So while you can take away that all LLMs are smoke and mirrors, something *else* actually emerges from the data that is potentially scarier: through engineering, models are advancing to remove the weakness demonstrated in this paper!

Note that I'm not making a value judgement on whether model advances are good or bad, but the take away from this paper should not be "LLMs cannot do elementary school math reasoning", but instead "some LLMs cannot do elementary school math reasoning", with the added note that dedicated engineering appears to improve model capabilities.

Some of the weakness demonstrated in the paper, like adding useless information ("NoOp-NoOp" test), are skills that humans need to be taught. There is some evidence in this paper that it appears to be possible to teach these models to do just that. Trendlines might be more insightful here than point in time measurements.

The authors of the paper touch this in the appendix (A.5) saying: "they still share similar limitations with the open models". However comparing the trend lines of the last column / last 4 rows of Table 1 tells a different story than what they wrote :/
posted by litghost at 3:51 AM on October 14 [2 favorites]


No, the frustrating thing is people who think that what is essentially a regression analysis and prediction engine is engaged in anything like "thinking" or "reasoning". All an LLM is doing is predicting the next word in a sequence, based on a complex statistical model and the inputs it's already seen. It is no more "reasoning" than fitting a line to a data set and extrapolating for theoretical inputs is "reasoning", and the fact that people keep talking about them like they're something more than just statistical models of language is aggravating and, frankly, dangerous.
posted by parm at 3:55 AM on October 14 [28 favorites]


Ok, fine, so they can't necessarily be trusted, they're unreliable, sure I get it. But what about China?? We can't let China win the race for AI.... I mean, I guess what I'm trying to say is:

THERE MUST NOT BE A MINE SHAFT GAP!
posted by Smedly, Butlerian jihadi at 4:09 AM on October 14 [2 favorites]


language models model language. we really need a paper for this?
posted by AlbertCalavicci at 4:15 AM on October 14 [2 favorites]


It's like we have an anti-LLM hype to go with the LLM hype

The difference is that anti-LLM hype isn't going to get hundreds of billion of dollars in funding; distort global power markets; put millions of people out of work; be used exclusively for pointless toys and automated authoritarianism; poison the already murky well of things generally accepted to be true and freeze in place the morés and social structures of early twentieth century (mostly) Americans.
posted by thatwhichfalls at 4:30 AM on October 14 [35 favorites]


Re parm's comment: You all are as good as anyone to ask: What is the magical thing that "thinking" or "reasoning" is? As others have commented before, the amazing thing about LLMs isn't that they magically can do everything, it's that they do so much with such simple techniques -- plus massive scale! I would never have guessed that so much of what we would have called creativity and fluency (at least before the goalposts moved) could be done with transformers. (BTW, I don't mean to be pejorative against moving goalposts: it's a natural way for us to refine our understanding!)

As litghost mentions, there is lots of work going on to do slightly more complicated techniques. For example, OpenAI's new "reasoning" models mechanically do some of the things mentioned (internal chain of thought), for example https://platform.openai.com/docs/guides/reasoning. So back to my starter question: what is the magical thing that "thinking" is... without reductio ad hominem, to mangle a phrase?
posted by thandal at 4:38 AM on October 14 [6 favorites]


So back to my starter question: what is the magical thing that "thinking" is... without reductio ad hominem, to mangle a phrase?

I'm not trying to sell you anything, this is a question for the people trying to sell you something, isn't it?

When you find one, I want an answer to this question that's an embodied, natural consequence of evolution crossed with the cultural outcomes that sustain community and society. We can easily be tricked by what we think the drivers of both evolution and culture are, based on our own incomplete knowledge. I suspect, also, we're tricked by a culture of 'single great men in history' who bootstrapped themselved (from inheritance with hard graft) so that we're looking to a single groundbreaking ML that's independently intelligent and will grow by itself. But we're communal beings in an individuating culture, which is why I'm unsurprised we don't have more words for gestalt and zeitgeist that might help dig out unexpected solutions to describe what this magical thing that "thinking" is.
posted by k3ninho at 5:56 AM on October 14 [2 favorites]


what is the magical thing that "thinking" is... without reductio ad hominem

Seems unlikely that there's just one magical thing called "thinking" that only we can do, and that current or near-future AIs just can't. We've learned a lot of tricks, as a species, and we have a few thousand years' of written culture that we can train new humans on. We can do a lot of discrete things (add numbers together! play chess! draw pictures! write poetry!), and over time we've built machines that can do incrementally more of those things, to the same or a higher standard than we know how to do them. That process will continue - it's futile to retreat further & further back towards a special human-only trick that only we know how to do - partly because there's no wide agreement on what that special trick is, but also because as soon as we ever come to understand any of the tricks we've learned in repeatable detail, we're already half-way to the goal of modelling it algorithmically & giving it to an AI to do for us.

Instead it seems to me more likely that the special human-only trick (for now) is conscious self-awareness.
posted by rd45 at 6:00 AM on October 14 [1 favorite]


what is the magical thing that "thinking" is

I'm certainly tempted to plug this into ChatGPT and see if it spits out anything resembling Heidegger.

this is a question for the people trying to sell you something

I also find Aristotle annoying.
posted by snuffleupagus at 6:08 AM on October 14 [1 favorite]


There is no autocorrect for the human spirit.
posted by johngoren at 6:09 AM on October 14 [1 favorite]


to the same or a higher standard than we know how to do them

Just faster, mostly. Much, much faster. And without tiring. And that, at scale. Which enables things that wouldn't otherwise be practical.
posted by snuffleupagus at 6:13 AM on October 14 [2 favorites]


I'll believe AI is reliable when one of these tech billionaires uses it to do their taxes.
posted by AlSweigart at 6:24 AM on October 14 [2 favorites]


Not sure that's a great benchmark. Billionaires are less likely to get audited, at least judging by how much they already get away with, and can afford to get it wrong.

Maybe when they use only AI lawyers in their divorce proceedings and to compose their prenups.
posted by trig at 6:28 AM on October 14 [3 favorites]


rd45: Instead it seems to me more likely that the special human-only trick (for now) is conscious self-awareness.
I think plenty of animals have this, I think it's a fleshy thing that helps us survive, but we've pretty much decided against having ML fight for resources and survival -- 'cept by VC funding -- because all fleshy life would lose out.
posted by k3ninho at 6:43 AM on October 14


AlSweigart: I'll believe AI is reliable when one of these tech billionaires uses it to do their taxes.

Try this tax reporting case study by Kortical, AlSweigart. For a well-constrained set of laws and standing judicial decisions, there's a stable workflow to optimise tax obligations, so ML tools can do that reliably and repeatedly.
posted by k3ninho at 6:45 AM on October 14 [1 favorite]


"Look, Dave, I can see you’re really upset about this. I honestly think you ought to sit down calmly, take a stress pill and think things over. I know I’ve made some very poor decisions recently, but I can give you my complete assurance that my work will be back to normal. I’ve still got the greatest enthusiasm and confidence in the mission, and I want to help you!"
posted by jabah at 7:08 AM on October 14 [1 favorite]


The reason I don't participate in Metafilter LLM conversations is because they're dominated by people repeating their superficial and ax-grinding perspectives on subjects they don't understand (but think they do).

I have no idea what you are talking about.
posted by y2karl at 7:14 AM on October 14 [1 favorite]


Most of you need to go back to school

Look, I'm pretty sure there are no classes in school that teach you how to do math word problems.
posted by mittens at 7:33 AM on October 14 [2 favorites]


People keep going in about this Internet thing, calling it dumb names like 'information superhighway.' but it's really loud when you connect to it and you can't talk to other people at the same time unless you pay for an extra line and it takes like two hours to download a tiny video of a dancing baby and anyway none of my friends are on it because it's dumb. And now because of all the hype there's like twenty companies trying to do pet food delivery using the Internet somehow.

Check out this new study that says that you can't make the dancing baby download faster over a phone line.
posted by kaibutsu at 7:54 AM on October 14 [4 favorites]


"what is the magical thing that "thinking" is... without reductio ad hominem"

I find Hofstader's concept of a "Strange Loop" meaningful for explaining this.

https://philosophynow.org/issues/78/I_Am_A_Strange_Loop_by_Douglas_Hofstadter
[snip]
Perhaps Hofstadter’s most intriguing argument is that the complexity and extensibility of active symbols in the brain inevitably leads to the same kind of self-reference which Gödel proved was inherent in any complex logical or arithmetical system. In a nutshell, Gödel showed that mathematics and logic contain ‘strange loops’: propositions that not only refer to mathematical/logical truths, but also to the symbol systems expressing those truths. This recursiveness inevitably leads to the sort of paradoxes seen in statements such as ‘This statement is false’.

Hofstadter argues that the psychological self arises out of a similar kind of paradox. We are not born with an ‘I’ – the ego emerges only gradually as experience shapes our dense web of active symbols into a tapestry rich and complex enough to begin twisting back upon itself. According to this view the psychological ‘I’ is a narrative fiction – a point that Wittgenstein made when he argued that the ‘I’ is not an object in the world, but a precondition for there being a world in the first place. “It is the ‘I’, it is the ‘I’, that is deeply mysterious!” exclaimed Wittgenstein.
posted by aleph at 7:59 AM on October 14 [2 favorites]


Mod note: One comment deleted. You can express your criticism without attacking and mischaracterizing other members. Please refer to the content policy.
posted by loup (staff) at 8:36 AM on October 14 [2 favorites]


> what is the magical thing that "thinking" is

I would say that it's simulating things based on memories from the world, internal and external. That said I'd rather not think about what that then means I would have to argue about LLMs in order to be not be a philosophical hypocrite.
posted by lucidium at 8:47 AM on October 14 [1 favorite]


what is the magical thing that "thinking" is... without reductio ad hominem

I don't know the answer, but it's probably close to some of the scientific theories mentioned in the long-but-very-readable survey article from Robert Lawrence Kuhn A landscape of consciousness: Toward a taxonomy of explanations and implications. It's free to read and does a good job of covering the relevant theories.

I don't personally think that full consciousness is required to reach what most people describe as "thinking" but LLMs definitely don't meet the bar. As mentioned above, LLMs are inherently very linear so kind of get "stuck" on one train of thought and true thinking involves predicting/simulating what would happen in multiple possible worlds and picking the best path forward using those possibilities. Some of the newer OpenAI models are attempting to do this, and I think the ability to process multiple possible worlds in parallel and then make decisions based on them is key to achieving actual intelligence. Traditional logical AI tends to be better at decision making than LLMs so there must be ways of combining them together

The other big difference between all current AI models and the thinking of higher animals like humans is the integration of real time feedback. As we interact with the world our brains are being modified in real time from a large number of different sensory streams and creating a a large number of fast-acting feedback loops that we use in our thinking (I wrote up my thoughts on how the brain is closer to a game engine than traditional AI a few weeks ago). In contrast current AI models are many thousands of times slower at integrating feedback and have a very small number of "sensory" information. We're honestly still very early at understanding how the real time nature of human intelligence works (because it is very hard to study with most brain scanning technologies), but as we get better at that we will come to understand how "thinking" actually works.
posted by JZig at 8:52 AM on October 14 [3 favorites]


lucidium I like your phrase "I would say that it's simulating things based on memories from the world, internal and external." as it's a shorter version of what I was writing at the same time. There is decent evidence that human consciousness evolved in order to simulate and predict the thoughts of other people, as we have always lived in social groups where that would be useful. We got so good at simulating the memories of other people that we learned how to simulate and think about our own memories, and the parts of our brain that integrates those simulations into coherent narratives evolved into something capable of true consciousness.
posted by JZig at 8:57 AM on October 14 [2 favorites]


I think it might be useful to draw some distinct lines between "consciousness," "thinking," and "reasoning," mostly so we can push those first two away for the purpose of understanding the finding. Because I think everyone in the world is mostly agreed that the first one isn't what LLMs do, and with some convincing, most people can agree that the second isn't what they do. But the question about reasoning is different.

If you read enough math problems and their solutions, would you be able to extract some logical steps so you could solve an original math problem? It feels like the answer should be 'yes.' To use Gary Marcus' example about chess, could you absorb enough chess games to get what's going on, without anyone ever explicitly saying how a knight moves? Again...it kinda seems like it? If you remember learning chess, you remember someone trying to explain the moves to you and they probably didn't make a lot of sense at first, so what if you just removed all that, and learned by watching a lot of games?

So it's interesting that that doesn't seem to be working, because early on we had that example where an LLM could learn Othello, and it was really suggestive that there were deep things going on that would mirror our own reasoning. You'd hope this didn't come down to some simple distinction between syntax and semantics because it'd be really really neat if our language created within itself a reasoning system. Alas!!! But good for Apple!
posted by mittens at 9:07 AM on October 14 [1 favorite]



The reason I don't participate in Metafilter LLM conversations is because they're dominated by people repeating their superficial and ax-grinding perspectives on subjects they don't understand (but think they do).


The reason I get frustrated participating in any metafilter thread discussing any academic paper is that we get floods of comments from people who 1) think they know what theyre talking about but have no idea and 2) question the whole idea of scientific inquiry.


language models model language. we really need a paper for this?

did we really need this comment?
posted by MisantropicPainforest at 9:10 AM on October 14


Appreciate the responses -- and links -- that folks have posted. My personal view of the things that are most lacking (which isn't an answer to my own question, but maybe points toward an answer):
* A real hierarchy of memory (working <> short-term <> long-term <> dreaming)
* Explicit simulation with alternatives / empathy ("if that person is acting mad because they're in a hurry, then they'll want me to be quick!")
* Grounding, physical and temporal embodiment

... also the comments in this thread about thinking being communal... really rings a lot of bells.
posted by thandal at 9:36 AM on October 14


I'm not sure I'm allowed to comment here as I dont know a ton about AI. But I do know that I live on Planet Earth, and the mind numbing power consumption requirement of these chatbots supercharging climate change isn't really a good thing, cost benefits wide.

But what do I know, I guess I should just leave it to the super genius oligarchs who invented Apps and now have more money than the Pharaohs. I'm sure they have all our best interested at heart (given the 40,000 stories on how chatbots will make miracles out of medical research ..... and making crazy pictures, I guess).
posted by WatTylerJr at 9:48 AM on October 14 [3 favorites]


I've been warming to the strange loop concept. I've been trying to think about how to formalize (or represent algorithmically) a certain notion and I don't have it yet. The general idea is that in a reasoning machine, a proposition might take on a couple roles (both with their own natural representation), and it seems like the machine would be most powerful if it could translate a proposition that starts in one role into an equivalent or corresponding proposition in the other role, but ensuring that these translations are consistent may be a bit of a bear (if, Godel willing, it is even possible...)

One proposition-role is that of a labelled atom. Another proposition-role is that of a constraint on the admissible combinations of labelled atoms. If propositions are characterized by degrees of belief rather than binary true or false, that proposition-role could instead be "action to enforce consistency among a set of proposition-beliefs".
posted by a faded photo of their beloved at 9:50 AM on October 14


Strange loops was actually covered in my freshman “core sequence” class which was a basically an epistemology course, but covered a lot on the history of science and art. This was before Facebook existed, and Wikipedia was just getting going. I am curious how they teach it now.
posted by CostcoCultist at 10:07 AM on October 14


In an unfettered model, there is nothing to prevent an AI agent from going out into the wild and polling humans it trusts to supply reasoning to enhance or build on. It would represent a new form of thinking. A legal argument would be a good example.
posted by Brian B. at 11:02 AM on October 14 [1 favorite]


LLMs don't do formal reasoning, but you can train on formal statements and hook it up to a proof checker. (It's still not easy to translate from stupid human language to Lean 4 and so this probably doesn't scale to more complex problems)
posted by credulous at 12:14 PM on October 14 [1 favorite]


Yo thatwhichfalls, you forgot obtaining power by reopening nuclear disaster sites and rushing new nuclear reactor models into production.
posted by jeffburdges at 2:18 PM on October 14


I'm extremely unimpressed by high resource consumption, so right now I'd wager fungal computers from the Unconventional Computing Laboratory (UCL) turn out more transformative than LLMs, not because fungus creates better computers, but because they'll reveal more interesting & important new things.

It's true however that LLMs should cause some real breakthroughs in artificial intelegence, if only because if the scale of money being spent. That doesn't mean they'll get too close to actual reasonsing anytime soon though.
posted by jeffburdges at 3:20 PM on October 14


> I'll believe AI is reliable when one of these tech billionaires uses it to do their taxes.

Try this tax reporting case study by Kortical, AlSweigart.

Oh, I'm sure there are lots of companies offering "AI" tax services and giving all kinds of assurances.

But the quote from the Deloitte partner on that page is, "The speed of results and very high levels of accuracy achieved by Kortical through the deployment of the Kortical platform were impressive."

It is not, "I used this for my own taxes."
posted by AlSweigart at 6:48 PM on October 14 [1 favorite]


Is it reasoning. I don't know, is a cootie catcher alive? Come on, man.
posted by kittens for breakfast at 7:56 PM on October 14


« Older An honest hack   |   VFX Artists Expose AI Scams Newer »


You are not currently logged in. Log in or create a new account to post comments.