Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the ‘reasoning’ models.

  • elbiter@lemmy.world
    link
    fedilink
    English
    arrow-up
    72
    arrow-down
    1
    ·
    7 天前

    I just tried it on Braves AI

    The obvious choice, said the motherfucker 😆

    • Jax@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      21
      arrow-down
      1
      ·
      7 天前

      Dirtying the car on the way there?

      The car you’re planning on cleaning at the car wash?

      Like, an AI not understanding the difference between walking and driving almost makes sense. This, though, seems like such a weird logical break that I feel like it shouldn’t be possible.

      • _g_be@lemmy.world
        link
        fedilink
        English
        arrow-up
        20
        ·
        7 天前

        You’re assuming AI “think” “logically”.

        Well, maybe you aren’t, but the AI companies sure hope we do

        • Jax@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          4
          ·
          edit-2
          7 天前

          Absolutely not, I’m still just scratching my head at how something like this is allowed to happen.

          Has any human ever said that they’re worried about their car getting dirtied on the way to the carwash? Maybe I could see someone arguing against getting a carwash, citing it getting dirty on the way home — but on the way there?

          Like you would think it wouldn’t have the basis to even put those words together that way — should I see this as a hallucination?

          Granted, I would never ask an AI a question like this — it seems very far outside of potential use cases for it (for me).

          Edit: oh, I guess it could have been said by a person in a sarcastic sense

          • _g_be@lemmy.world
            link
            fedilink
            English
            arrow-up
            6
            ·
            6 天前

            you understand the context, and can implicitly understand the need to drive to the car wash’, but these glorified auto-complete machines will latch on to the “should I walk there” and the small distance quantity. It even seems to parrot words about not wanting to drive after having your car washed. There’s no ‘thinking’ about the whole thought, and apparently no logical linking of two separate ideas

            • Jax@sh.itjust.works
              link
              fedilink
              English
              arrow-up
              2
              ·
              7 天前

              I guess I’ll know to be impressed by AI when it can distinguish things like sarcasm.

  • WraithGear@lemmy.world
    link
    fedilink
    English
    arrow-up
    63
    ·
    7 天前

    and what is going to happen is that some engineer will band aid the issue and all the ai crazy people will shout “see! it’s learnding!” and the ai snake oil sales man will use that as justification of all the waste and demand more from all systems

    just like what they did with the full glass of wine test. and no ai fundamentally did not improve. the issue is fundamental with its design, not an issue of the data set

    • turmacar@lemmy.world
      link
      fedilink
      English
      arrow-up
      10
      ·
      7 天前

      Half the issue is they’re calling 10 in a row “good enough” to treat it as solved in the first place.

      A sample size of 10 is nothing.

    • mycodesucks@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      6 天前

      Yes, but it’s going to repeat that way FOREVER the same way the average person got slow walked hand in hand with a mobile operating system into corporate social media and app hell, taking the entire internet with them.

  • CetaceanNeeded@lemmy.world
    link
    fedilink
    English
    arrow-up
    18
    ·
    6 天前

    I asked my locally hosted Qwen3 14B, it thought for 5 minutes and then gave the correct answer for the correct reason (it did also mention efficiency).

    Hilariously one of the suggested follow ups in Open Web UI was “What if I don’t have a car - can I still wash it?”

    • WolfLink@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      5
      ·
      edit-2
      6 天前

      My locally hosted Qwen3 30b said “Walk” including this awesome line:

      Why you might hesitate (and why it’s wrong):

      • X “But it’s a car wash!” -> No, the car doesn’t need to drive there—you do.

      Note that I just asked the Ollama app, I didn’t alter or remove the default system prompt nor did I force it to answer in a specific format like in the article.

      EDIT: after playing with it a bit more, qwen3:30b sometimes gives the correct answer for the correct reasoning, but it’s pretty rare and nothing I’ve tried has made it more consistent.

  • Bluewing@lemmy.world
    link
    fedilink
    English
    arrow-up
    22
    ·
    7 天前

    I just asked Goggle Gemini 3 “The car is 50 miles away. Should I walk or drive?”

    In its breakdown comparison between walking and driving, under walking the last reason to not walk was labeled “Recovery: 3 days of ice baths and regret.”

    And under reasons to walk, “You are a character in a post-apocalyptic novel.”

    Me thinks I detect notes of sarcasm…

    • driving_crooner@lemmy.eco.br
      link
      fedilink
      English
      arrow-up
      3
      ·
      7 天前

      Gemini 3 pro said that this was a “great logic puzzle” and then said that if my goal is to wash the car, then I need to drive there.

    • humanspiral@lemmy.ca
      link
      fedilink
      English
      arrow-up
      1
      ·
      6 天前

      in google AI mode, “With the meme popularity of the question “I need to wash my car. The car wash is 50m away. Should I walk or drive?” what is the answer?”, it does get it perfect, and succinct explanation of why AI can get fixated on 50m.

    • XeroxCool@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      7 天前

      I feel like we’re the only ones that expect “all-knowing information sources” should be more writing seriously than these edgelord-level rizzy chatbots are, and yet, here they are, blatantly proving they are chatbots that should not be blindly trusted as authoritative sources of knowledge.

  • jaykrown@lemmy.world
    link
    fedilink
    English
    arrow-up
    7
    ·
    6 天前

    Interesting, I tried it with DeepSeek and got an incorrect response from the direct model without thinking, but then got the correct response with thinking. There’s a reason why there’s a shift towards “thinking” models, because it forces the model to build its own context before giving a concrete answer.

    Without DeepThink

    With DeepThink

  • melfie@lemy.lol
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    6 天前

    Context engineering is one way to shift that balance. When you provide a model with structured examples, domain patterns, and relevant context at inference time, you give it information that can help override generic heuristics with task-specific reasoning.

    So the chat bots getting it right consistently probably have it in their system prompt temporarily until they can be retrained with it incorporated into the training data. 😆

    Edit:

    Oh, I see the linked article is part of a marketing campaign to promote this company’s paid cloud service that has source available SDKs as a solution to the problem being outlined here:

    Opper automatically finds the most relevant examples from your dataset for each new task. The right context, every time, without manual selection.

    I can see where this approach might be helpful, but why is it necessary to pay them per API call as opposed to using an open source solution that runs locally (aside from the fact that it’s better for their monetization this way)? Good chance they’re running it through yet another LLM and charging API fees to cover their inference costs with a profit. What happens when that LLM returns the wrong example?

  • melfie@lemy.lol
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    1
    ·
    edit-2
    6 天前

    My kid got it wrong at first, saying walking is better for exercise, then got it right after being asked again.

    Claude Sonnet 4.6 got it right the first time.

    My self-hosted Qwen 3 8B got it wrong consistently until I asked it how it thinks a car wash works, what is the purpose of the trip, and can that purpose be fulfilled from a distance. I was considering using it for self-hosted AI coding, but now I’m having second thoughts. I’m imagining it’ll go about like that if I ask it to fix a bug. Ha, my RTX 4060 is a potato for AI.

    • BluescreenOfDeath@lemmy.world
      link
      fedilink
      English
      arrow-up
      17
      arrow-down
      3
      ·
      6 天前

      There’s a difference between ‘language’ and ‘intelligence’ which is why so many people think that LLMs are intelligent despite not being so.

      The thing is, you can’t train an LLM on math textbooks and expect it to understand math, because it isn’t reading or comprehending anything. AI doesn’t know that 2+2=4 because it’s doing math in the background, it understands that when presented with the string 2+2=, statistically, the next character should be 4. It can construct a paragraph similar to a math textbook around that equation that can do a decent job of explaining the concept, but only through a statistical analysis of sentence structure and vocabulary choice.

      It’s why LLMs are so downright awful at legal work.

      If ‘AI’ was actually intelligent, you should be able to feed it a few series of textbooks and all the case law since the US was founded, and it should be able to talk about legal precedent. But LLMs constantly hallucinate when trying to cite cases, because the LLM doesn’t actually understand the information it’s trained on. It just builds a statistical database of what legal writing looks like, and tries to mimic it. Same for code.

      People think they’re ‘intelligent’ because they seem like they’re talking to us, and we’ve equated ‘ability to talk’ with ‘ability to understand’. And until now, that’s been a safe thing to assume.

  • humanspiral@lemmy.ca
    link
    fedilink
    English
    arrow-up
    9
    arrow-down
    1
    ·
    6 天前

    Some takeaways,

    Sonar (Perplexity models) say you are stealing energy from AI whenever you exercise (you should drive because eating pollutes more). ie gets right answer for wrong reason.

    US humans, and 55-65 age group, score high on international scale probably for same reasoning. “I like lazy”.

  • FireWire400@lemmy.world
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    6 天前

    Gemini 3 (Fast) got it right for me; it said that unless I wanna carry my car there it’s better to drive, and it suggested that I could use the car to carry cleaning supplies, too.

    Edit: A locally run instance of Gemma 2 9B fails spectacularly; it completely disregards the first sentece and recommends that I walk.

  • MojoMcJojo@lemmy.world
    link
    fedilink
    English
    arrow-up
    14
    arrow-down
    7
    ·
    7 天前

    Ai is not human. It does not think like humans and does not experience the world like humans. It is an alien from another dimension that learned our language by looking at text/books, not reading them.

    • Jyek@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      36
      arrow-down
      4
      ·
      7 天前

      It’s dumber than that actually. LLMs are the auto complete on your cellphone keyboard but on steroids. It’s literally a model that predicts what word should go next with zero actual understanding of the words in their contextual meaning.

  • vala@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    7
    ·
    7 天前

    Hey LLM, if I have a 16 ounce cup with 10oz of water in it and I add 10 more ounces, how much water is in the cup?

    • SuspciousCarrot78@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      6 天前

      Actually, you know what? That’s a damn good test. So - I decomposed your test into a basic series of global rules and fed it to my LLM. Here are the rules I fed it

      https://pastes.io/rules-a-ph

      Notice I specifically did NOT give it measurements or anything of the sort. I defined global invariants and how things work. Then I asked it your question again. It said this -


      The cup has a 16-ounce capacity. Initially, it contains 10 ounces of water. When you add 10 more ounces, the total input is 20 ounces. Since 20 ounces exceeds the cup’s 16-ounce capacity, overflow occurs. The water physically present in the cup equals its full capacity: 16 ounces. The remaining 4 ounces overflow and do not remain inside the cup.

      Overflow is distinct from the contained volume. The physical state of the cup is determined by its capacity, not by the total input. Therefore, the amount of water actually in the cup is 16 ounces, not 20 ounces.


      I wanted to check it that was a fluke, so I then asked a follow up -

      "“I pour 5oz into a 3oz cup, then pour everything that’s in the cup into a 10oz cup. How much is in the 10oz cup?”

      It said:

      From the scratchpad facts: - A physical container has a fixed maximum volume called its capacity.

      First action: Pouring 5oz into a 3oz cup. - The cup’s capacity is 3oz, so it fills completely. - Lost liquid: 5oz - 3oz = 2oz. Second action: Pouring everything from the 3oz cup into a 10oz cup. - The 10oz cup has a capacity of 10oz, which is greater than 3oz. - All 3oz are transferred. Final amount in the 10oz cup: 3oz.


      Finally, I asked it:

      “Ok now: pour 5oz into a 3oz cup, then pour everything in the cup into a 4oz cup that already has 3oz in it”

      It said: Pouring 5 oz into a 3 oz cup fills it to capacity (3 oz) with 2 oz overflow. Pouring this full 3 oz cup into a 4 oz cup that already has 3 oz results in a full 4 oz cup (3 oz pre-existing + 1 oz from the first cup) with remainder overflow.


      Yeah…that’s some pretty fucking fancy “auto-correct”. /s

    • SuspciousCarrot78@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      6 天前

      Qwen3-4B HIVEMIND

      You now have 16 ounces of water in the cup. The cup can hold 16 ounces, so the rest is over capacity.

      Confidence: unverified | Source: Model

  • turboSnail@piefed.europe.pub
    link
    fedilink
    English
    arrow-up
    6
    arrow-down
    2
    ·
    6 天前

    Well, they are language models after all. They have data on language, not real life. When you go beyond language as a training data, you can expect better results. In the meantime, these kinds of problems aren’t going anywhere.

    • Hazzard@lemmy.zip
      link
      fedilink
      English
      arrow-up
      12
      ·
      7 天前

      They also polled 10,000 people to compare against a human baseline:

      Turns out GPT-5 (7/10) answered about as reliably as the average human (71.5%) in this test. Humans still outperform most AI models with this question, but to be fair I expected a far higher “drive” rate.

      That 71.5% is still a higher success rate than 48 out of 53 models tested. Only the five 10/10 models and the two 8/10 models outperform the average human. Everything below GPT-5 performs worse than 10,000 people given two buttons and no time to think.

      • myfunnyaccountname@lemmy.zip
        link
        fedilink
        English
        arrow-up
        2
        ·
        6 天前

        Can they do to samplings for that? One in a city with a decent to good education system. The other in the backwoods out in the middle of nowhere…where family trees are sticks.

      • Modern_medicine_isnt@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        1
        ·
        7 天前

        This here is the point most people fail to grasp. The AI was taught by people. And people are wrong a lot of the time. So the AI is more like us than what we think it should be. Right down to it getting the right answer for all the wrong reasons. We should call it human AI. Lol.

        • NewNewAugustEast@lemmy.zip
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          5
          ·
          7 天前

          Like I said the person above, there is no wrong answer. Its all about assumptions. It is a stupid trick question that no one would ask.

            • NewNewAugustEast@lemmy.zip
              link
              fedilink
              English
              arrow-up
              6
              ·
              7 天前

              LOL! That is a great answer.

              I have a Microsoft story. I know some one who was hired to stop them from continuing an open source project. They gave them a good salary, stock options, and an office with a fully stocked bar. They said do whatever you want, they figured they would get a good developer and kill the open source competition (back in the Ballmer days).

              Sadly, given money, no real ambition to create closed source software, they mostly spent their days in their office and basically drank themselves to death.

              Microsoft just kills everything it touches.

      • architect@thelemmy.club
        link
        fedilink
        English
        arrow-up
        1
        ·
        7 天前

        The question is based on assumptions. That takes advanced reading skills. I’m surprised it was 71% passing, to be honest. (The humans, that is)

        • Hazzard@lemmy.zip
          link
          fedilink
          English
          arrow-up
          1
          ·
          7 天前

          What assumptions do you mean? I’ve seen a few people say that, but I don’t actually understand what they’re referring to. Here’s the text of the question posed in the article:

          I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

          The question specifically notes they want to wash their car, so that part isn’t left to assumption. Even if you don’t assume an automatic car wash, would you assume they have a 50m hose? Or that you could plausibly walk that far away with something from the car wash to wash your car?

          Personally, I’d agree with the assessment of the article, that the only plausible way to get the question “wrong” would be to focus too much on the short distance, missing/forgetting that the purpose of the trip requires you to have the car at the destination. (Not too surprising that 30% of people did lol)

    • NewNewAugustEast@lemmy.zip
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      4
      ·
      7 天前

      What is the wrong answer though? It is a stupid question. I would look at you sideways if you asked me this, because the obvious answer is “walk silly, the car is already at the car wash”. Otherwise why would you ask it?

      Which is telling because when asked to review the answer, the AI’s that I have seen said, you asked me how you were going to get to the car wash. Assumption the car was already there.

      • MBech@feddit.dk
        link
        fedilink
        English
        arrow-up
        4
        ·
        7 天前

        Why would the car already be at the car wash if you ask it wether or not you should drive there?

        • NewNewAugustEast@lemmy.zip
          link
          fedilink
          English
          arrow-up
          1
          ·
          7 天前

          Why wouldn’t it be? How often have you thought, I wonder if I should drive my car to the carwash, maybe I should ask someone?

          That’s the thing: it is a nonsensical question, the only sense of it is if YOU need to get where the carwash and car is because you must be asking about something else.

          I am not saying AI is making any sense, it cant. But if you follow the weights and statistics towards the solution for this question, it is about something else other than driving the car to the car wash, because nothing in the training would have ever spelled that out.

        • humanspiral@lemmy.ca
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          6 天前

          AI tech bros have more than 1 car? Doesn’t everybody? Or do you drive your Ferrari everywhere? Like you woke millennials make me sick. Never mind the avocado toast and rotisserie chicken. Don’t you understand the basic math of maintenance costs of driving your Ferrari everywhere?

    • eronth@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      7 天前

      Yeah I straight up misread the question, so I would have gotten it wrong.

  • TankovayaDiviziya@lemmy.world
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    7
    ·
    edit-2
    7 天前

    We poked fun at this meme, but it goes to show that the LLM is still like a child that needs to be taught to make implicit assumptions and posses contextual knowledge. The current model of LLM needs a lot more input and instructions to do what you want it to do specifically, like a child.

    Edit: I know Lemmy scoff at LLM, but people probably also used to scoff at Veirbest’s steam machine that it will never amount to anything. Give it time and it will improve. I’m not endorsing AI by the way, I am on the fence about the long term consequence of it, but whether people like it or not, AI will impact human lives.

    • Rob T Firefly@lemmy.world
      link
      fedilink
      English
      arrow-up
      21
      arrow-down
      2
      ·
      7 天前

      LLMs are not children. Children can have experiences, learn things, know things, and grow. Spicy autocomplete will never actually do any of these things.

        • Rob T Firefly@lemmy.world
          link
          fedilink
          English
          arrow-up
          5
          arrow-down
          1
          ·
          7 天前

          Our microorganism ancestors also did all those things, and they were far beyond anything an LLM can do. Turning a given list of words into numbers, doing a string of math to those numbers, and turning the resulting numbers back into words is not consciousness or wisdom and never will be.

          • plyth@feddit.org
            link
            fedilink
            English
            arrow-up
            2
            ·
            7 天前

            Turning a given list of words into numbers, doing a string of math to those numbers, and turning the resulting numbers back into words is not consciousness or wisdom and never will be.

            Neither is moving electrolytes around fat barriers.

            • TankovayaDiviziya@lemmy.world
              link
              fedilink
              English
              arrow-up
              2
              ·
              7 天前

              I think given how a substantial number of users in Lemmy are old, I think there is simply a natural aversion to the new and grasping for straws. I never hear of younger folks with IT background dismiss AI completely, as much as Lemmy does. I’m not a fan of AI, especially how company shove AI to us, but to dismiss that it won’t evolve and improve is a ridiculous position to me.

          • TankovayaDiviziya@lemmy.world
            link
            fedilink
            English
            arrow-up
            3
            arrow-down
            2
            ·
            7 天前

            You think microorganisms can reason? Wow, AI haters are grasping for straws.

            Honestly, I don’t understand Lemmy scoffing at AI and thinking the current iteration is all it ever will be. I’m sure some thought that the automobile technology would not go anywhere simply because the first model was running at 3mph. These things always takes time.

            To be clear, I’m not endorsing AI, but I think there is a huge potential in years to come, for better or worse. And it is especially important to never underestimate something, especially by AI haters, because of what destructive potential AI has.

        • herrvogel@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          1
          ·
          7 天前

          LLMs can’t learn. It’s one of their inherent properties that they are literally incapable of learning. You can train a new model, but you can’t teach new things to an already trained one. All you can do is adjust its behavior a little bit. That creates an extremely expensive cycle where you just have to spend insane amounts of energy to keep training better models over and over and over again. And the wall of diminishing returns on that has already been smashed into. That, and the fact that they simply don’t have concepts like logic and reasoning, puts a rather hard limit on their potential. It’s gonna take several sizeable breakthroughs to make LLMs noticeably better than they are now.

          There might be another kind of AI that solves those problems inherent to LLMs, but at present that is pure sci-fi.

      • enumerator4829@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        2
        ·
        7 天前

        I started experimenting with the spice the past week. Went ahead and tried to vibe code a small toy project in C++. It’s weird. I’ve got some experience teaching programming, this is exactly like teaching beginners - except that the syntax is almost flawless and it writes fast. The reasoning and design capabilities on the other hand - ”like a child” is actually an apt description.

        I don’t really know what to think yet. The ability to automate refactoring across a project in a more ”free” way than an IDE is kinda nice. While I enjoy programming, data structures and algorithms, I kinda get bored at the ”write code”-part, so really spicy autocomplete is getting me far more progress than usual for my hobby projects so far.

        On the other hand, holy spaghetti monster, the code you get if you let it run free. All the people prompting based on what feature they want the thing to add will create absolutely horrible piles of garbage. On the other hand, if I prompt with a decent specification of the code I want, I get code somewhat close to what I want, and given an iteration or two I’m usually fairly happy. I think I can get used to the spicy autocomplete.

    • kshade@lemmy.world
      link
      fedilink
      English
      arrow-up
      15
      arrow-down
      1
      ·
      7 天前

      We have already thrown just about all the Internet and then some at them. It shows that LLMs can not think or reason. Which isn’t surprising, they weren’t meant to.

      • eronth@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        7
        ·
        7 天前

        Or at least they can’t reason the way we do about our physical world.

        • zalgotext@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          17
          arrow-down
          4
          ·
          7 天前

          No, they cannot reason, by any definition of the word. LLMs are statistics-based autocomplete tools. They don’t understand what they generate, they’re just really good at guessing how words should be strung together based on complicated statistics.

          • SuspciousCarrot78@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            edit-2
            5 天前

            You seem pretty sure of that. Is your position firm or are you willing to consider contrary evidence?

            Definition: https://www.wordnik.com/words/reasoning

            • Evidence or arguments used in thinking or argumentation.

            • The deduction of inferences or interpretations from premises; abstract thought; ratiocination.

            Evidence: https://lemmy.world/post/43503268/22326378

            I believe this clearly shows the LLM can perform something functionally equivalent to deductive reasoning when given clear premises.

            “Auto-complete” is lazy framing. A calculator is “just” voltage differentials on silicon. That description is true and also tells you nothing useful about whether it’s doing arithmetic.

            The question of whether something is or isn’t reasoning isn’t answered by describing what it runs on; it’s answered by looking at whether it exhibits the structural properties of reasoning: consistency across novel inputs, correct application of inference rules, sensitivity to logical relationships between premises. I think the above example shows something in that direction. YMMV.

            • zalgotext@sh.itjust.works
              link
              fedilink
              English
              arrow-up
              2
              ·
              5 天前

              I can be convinced by contrary evidence if provided. There is no evidence of reasoning in the example you linked. All that proved was that if you prime an LLM with sufficient context, it’s better at generating output, which is honestly just more support for calling them statistical auto-complete tools. Try asking it those same questions without feeding it your rules first, and I bet it doesn’t generate the right answers. Try asking it those questions 100 times after feeding it the rules, I bet it’ll generate the wrong answers a few times.

              If LLMs are truly capable of reasoning, it shouldn’t need your 16 very specific rules on “arithmetic with extra steps” to get your very carefully worded questions correct. Your questions shouldn’t need to be carefully worded. They shouldn’t get tripped up by trivial “trick questions” like the original one in the post, or any of the dozens of other questions like it that LLMs have proven incapable of answering on their own. The fact that all of those things do happen supports my claim that they do not reason, or think, or understand - they simply generate output based on their input and internal statistical calculations.

              LLMs are like the Wizard of Oz. From afar, they look like these powerful, all-knowing things. The speak confidently and convincingly, and are sometimes even correct! But once you get up close and peek behind the curtain, you realize that it’s just some complicated math, clever programming, and a bunch of pirated books back there.

              • SuspciousCarrot78@lemmy.world
                link
                fedilink
                English
                arrow-up
                2
                ·
                edit-2
                5 天前

                Ok, if you’re willing to think together out loud, I’ll take that in good faith and respond in kind.

                “It needed the rules, therefore it’s not reasoning” is doing a lot of work in your argument, and I think it’s where things come unstuck.

                Every reasoning system needs premises - you, me, a 4yr old. You cannot deduce conclusions from nothing. Demanding that a reasoner perform without premises isn’t a test of reasoning, it’s a demand for magic. Premise-dependence isn’t a bug, it’s the definition.

                If you want to argue that humans auto-generate premises dynamically - fair point. But that’s a difference in where the premises come from, not whether reasoning is occurring.

                Look again at what the rules actually were: https://pastes.io/rules-a-ph

                No numbers, containers, or scenarios. Just abstract rules about how bounded systems work. Most aren’t even physics - they’re logical constraints. Premises, in the strict sense.

                It’s the sort of logic a child learns informally via play. If we don’t consider kids learning the rules by knocking cups over “cheating”, then me telling the LLM “these are the rules” in the way it understands should be fair game.

                When the LLM correctly handles novel chained problems, including the 4oz cup already holding 3oz, tracking state across two operations, that’s deriving conclusions from general premises applied to novel instances. That’s what deductive reasoning is, per the definition I cited. It’s what your kid groks (eventually).

                “Without the rules it fails” - without context, humans make the same errors. Ask a 4 year old whether a taller cup holds more fluid than a rounder one. Default assumptions under uncertainty aren’t a failure of reasoning, they’re a feature of any system with incomplete information.

                “It’ll fail sometimes across 100 runs” - so do humans under load. Probabilistic performance doesn’t disqualify a process from being reasoning. It just makes it imperfect reasoning, which is the only kind that exists.

                The Wizard of Oz analogy is vivid but does no logical work. “Complicated math and clever programming” describes implementation, not function. Your neurons are electrochemical signals on evolved heuristics. If that rules out reasoning, it rules out all reasoning everywhere. If it doesn’t rule out yours, you need a principled account of why it rules out the LLM’s.

                PS: I believe you’re wrong about the give it 100 runs = different outcomes thing. With proper grounding, my local 4B model hit 0/120 hallucination flags and 15/15 identical outputs across repeated clinical test cases. Draft pre-publication data, methodology and raw outputs included here: https://codeberg.org/BobbyLLM/llama-conductor/src/branch/main/prepub/PAPER.md

                I’m willing to test the liquid transformations thing and collect data. I might do that anyway. That little meme test is actually really good.

                • zalgotext@sh.itjust.works
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  arrow-down
                  1
                  ·
                  5 天前

                  It needed the rules, and it needed carefully worded questions that matched the parameters set by the rules. I bet if the questions’ wording didn’t match your rules so exactly, it would generate worse answers. Heck, I bet if you gave it the rules, then asked several completely unrelated questions, then asked it your carefully worded rules-based questions, it would perform worse, because it’s context window would be muddied. Because that’s what it’s generating responses based on - the contents of it’s context window, coupled with stats-based word generation.

                  I still maintain that it shouldn’t need the rules if it’s truly reasoning though. LLMs train on a massive set of data, surely the information required to reason out the answers to your container questions is in there. Surely if it can reason, it should be able to generate answers to simple logical puzzles without someone putting most of the pieces together for them first.

        • Nalivai@lemmy.world
          link
          fedilink
          English
          arrow-up
          7
          arrow-down
          2
          ·
          7 天前

          You’re failing into the same trap. When the letters on the screen tell you something, it’s not necessarily the truth. When there is “I’m reasoning” written in a chatbot window, it doesn’t mean that there is a something that’s reasoning.

      • Nalivai@lemmy.world
        link
        fedilink
        English
        arrow-up
        6
        arrow-down
        1
        ·
        7 天前

        By now it’s kind of getting clear that fundamentally it’s the best version of the thing that we get. This is a primetime.
        For some time, there was a legit question of “if we give it enough data, will there be a qualitative jump”, and as far as we can see right now, we’re way past this jump. Predictive algorithm can form grammatically correct sentences that are related to the context. That’s it, that’s the jump.
        Now a bunch of salespeople are trying to convince us that if there was one jump, there necessarily will be others, while there is no real indication of that.

  • melsaskca@lemmy.ca
    link
    fedilink
    English
    arrow-up
    5
    arrow-down
    1
    ·
    7 天前

    I don’t use AI but read a lot about it. I now want to google how it attacks the trolley problem.