And connect the dots in such a way to develop this interesting scenario and at the same time consider so many different interactions between the room and the animals.
Because it knows which text is associated with which other text, in quite complex and subtle ways. It's exceptionally good at analysing text without needing to understand anything more than which words are seen in proximity to which other words, and how their placement in relation to each other affects other bits of text.
It is amazing how well this technique works. But at no point has it linked the word "cat" to real actual cats, because it has literally got no access to any real actual cats. It has had no access to any information beyond the word "cat" and its relationship to other words.
You are calling it text associated with other text.
But I think it has to have internalized it in a much more complex way to be able to produce output like that to any arbitrary configuration of different things.
This internalization and ability to apply that internalization to produce scenarios like that tells me it must have enough understanding, even if it hasn't been able to "see" a cat, it has an understanding of its characteristics and description, and how the same thing can interact with other things. To me that's understanding of the object.
Alternatively, if we are talking about a blind and deaf person, and that blind person hasn't ever seen any cats, would they be unable to understand what a cat ultimately is?
If all their input is in braille and tactile sign language?
"it has an understanding of its characteristics and description"
It has no access to the characteristics. It literally only has access to text. It knows that some text is associated with the word "cat", but it doesn't know what "fluffy" means, because it has never seen fluff. There is literally an infinite regress of meaning, where you never get out of language to actual meaning, unless you have some kind of experience of "fluff" (or something else which can be related to "fluff").
A is a bit like B which is a bit like C, which is like D, which is a mix of B and E, which is quite F-like. This never gets you to meaning, because the meaning is the sensation, or the experience. And it doesn't have that.
(Blind people would be able to understand the physicallity of a cat, because they can still experience them in a variety of ways. Whether they can understand that the cat is white is down to the kind of blindness, but if they've never seen colour then probably not.)
And for us humans, the way I understand fluffiness, is about
the sight and the touch senses. We have associated this sight input and this touch input with the label “fluffiness”. Now we throughout our life have experienced this input a lot, but technically this input in its original form is just lightwaves and a combination of our nerves physically touching the fluffy object. This could in theory be represented in many data formats, it could be represented as a combination or group of input tokens. So the understanding for us is just the association of this input to this label. I don’t see how this in particular would give us any special understanding over if just raw input was given to an LLM. It’s just a special case where we have had this input many times over and over. It seems arbitrary whether LLM has been fed this data in this format or in some other format, e.g. text as opposed to lightwaves or touch feedback.
This extra input could make those models more intelligent because it could help it uncover hidden patterns, but it doesn’t seem like it would be the difference between understanding and not understanding.
Text is just a meaningful compression of this sensory input data.
https://chatgpt.com/share/68815b5f-7570-443a-bcdf-e6ef4d233b...
And connect the dots in such a way to develop this interesting scenario and at the same time consider so many different interactions between the room and the animals.