Chatbots are random processes, so it's important to run each prompt at least a few times to see how consistent it is. (Ideally we'd use the API and do many trials.) I re-ran the river crossing prompt a couple more times on GPT-4 and reproduced your results, but worded differently each time. Interestingly, on the last attempt it explicitly said it recognized the problem:
---
This is a variant of the classic River Crossing puzzle, where you need to transport everyone across without certain groups being left together. Here's a possible solution to the problem:
First, the Earthling takes the Venutian to Earth via the wormhole, then the Earthling returns alone to the Galactic Federation. This leaves the Martian and the Moonling at the Galactic Federation, who are safe together, and the Venutian alone on Earth.
Next, the Earthling takes the Martian to Earth, leaving the Moonling alone at the Galactic Federation. But, she can't leave the Martian with the Venutian on Earth, so she leaves the Martian on Earth and takes the Venutian back to the Galactic Federation.
Then, she leaves the Venutian at the Galactic Federation and takes the Moonling to Earth. This leaves the Venutian and Martian at the Galactic Federation, who are safe together.
Finally, the Earthling returns to the Galactic Federation one last time to bring the Venutian to Earth.
Now, all of them, the Earthling, the Martian, the Venutian, and the Moonling are on Earth, and at no point were the Venutian and the Martian left alone together, nor the Venutian and the Moonling. The Earthling has successfully navigated the wormhole puzzle!
---
If someone wanted to explore this further, they could try varying the prompt to see when it breaks down. If we tell it that the the wolf is a vegetarian and the goat attacks wolves, does it get confused? What if it's given an unsolvable problem?
This is a little different because it's about in-context learning, but I recently saw an interesting paper about how if you give an LLM a problem with backwards labels, it will get confused. See: thegradient.pub/in-cont…
---
Edit: In general, I think this is a good post but the headline is overstated. It seems like the boundary between reasoning and statistics gets rather fuzzy once you allow for copying with substitutions, and LLM's are all about copying with substitutions. They're usually not copying word-for-word, but it's often unclear what level of abstraction they're working at.