> We don't learn by gradient descent, but rather by experiencing an environment ...

> We don't learn by gradient descent, but rather by experiencing an environment in which we perform actions and learn what effects they have.

I'm not sure whether that's really all that different. Weights in the neural network are created by "experiencing an environment" (the text of the internet) as well. It is true that there is no try and error.

> We are not limited to text input: we have 5+ senses.

GPT-4 does accept images as input. Whisper can turn speech into text. This seems like something where the models are already catching up. They (might)for now internally translate everything into text, but that doesn't really seem like a fundamental difference to me.

> We can output a lot more than words: we can output turning a screw, throwing a punch, walking, crying, singing, and more. Also, the words we do utter, we can utter them with lots of additional meaning coming from the tone of voice and body language.

AI models do already output movement (Boston dynamics, self driving cars), write songs, convert text to speech, insert emojis into conversation. Granted, these are not the same model but glueing things together at some point seems feasible to me as a layperson.

> We have innate curiosity, survival instincts and social instincts which, like our pain and pleasure, are driven by gene survival.

That seems like one of the easier problems to solve for an LLM – and in a way you might argue it is already solved – just hardcode some things in there (for the LLM at the moment those are the ethical boundaries for example).