Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> What's important about this new type of image generation that's happening with tokens rather than with diffusion, is that this is effectively reasoning in pixel space.

I do not think that this is correct. Prior to this release, 4o would generate images by calling out to a fully external model (DALL-E). After this release, 4o generates images by calling out to a multi-modal model that was trained alongside it.

You can ask 4o about this yourself. Here's what it said to me:

"So while I’m deeply multimodal in cognition (understanding and coordinating text + image), image generation is handled by a linked latent diffusion model, not an end-to-end token-unified architecture."



>You can ask 4o about this yourself. Here's what it said to me:

>"So while I’m deeply multimodal in cognition (understanding and coordinating text + image), image generation is handled by a linked latent diffusion model, not an end-to-end token-unified architecture."

Models don't know anything about themselves. I have no idea why people keep doing this and expecting it to know anything more than a random con artist on the street.


This is overly cynical. Models typically do know what tools they have access to because the tool descriptions are in the prompt. Asking a model which tools it has is a perfectly reasonable way of learning what is effectively the content of the prompt.

Of course the model may hallucinate, but in this case it takes a few clicks in the dev tools to verify that this is not the case.


>Of course the model may hallucinate, but in this case it takes a few clicks in the dev tools to verify that this is not the case.

I don't know - or care to figure out - how OpenAI does their tool calling in this specific case. But moving tool calls to the end user is _monumentally_ stupid for the latency if nothing else. If you centralize your function calls to a single model next to a fat pipe it means that you halve the latency of each call. I've never build, or seen, a function calling agent that moves the api function calls to client side JS.


It's not client side, the messages are in the api though.

But what do you mean you don't care? The thing you were responding to was literally a claim that it was a tool call rather than direct output


None of this really matters. It could be either case.

The thing we need to worry about is whether a Chinese company will drop an open source equivalent.


You should check out Claude desktop or Roo-Code or any of the other MCP client capable hosts. The whole idea of MCP is providing a universal pluggable tool api to the generative model.


>Models don't know anything about themselves.

They can. Fine tune them on documents describing their identity, capabilities and background. Deepseek v3 used to present itself as ChatGPT. Not anymore.

>Like other AI models, I’m trained on diverse, legally compliant data sources, but not on proprietary outputs from models like ChatGPT-4. DeepSeek adheres to strict ethical and legal standards in AI development.


> They can. Fine tune them on documents describing their identity, capabilities and background. Deepseek v3 used to present itself as ChatGPT. Not anymore

Yes, but many people expect the LLM to somehow self-reflect, to somehow describe how it feels from its first person point of view to generate the answer. It can't do this, any more than a human can instinctively describe how their nervous system works. Until recently, we had no idea that there are things like synapses, electric impulses, axons etc. The cognitive process has no direct access to its substrate/implementation.

If fine-tune ChatGPT into saying that it's an LSTM, it will happily and convincingly insist that it is. But it's not determining this information in real time based on some perception during the forward pass.

I mean there could be ways for it to do self reflection by observing the running script, perhaps raise or lower the computational cost of some steps, check the timestamps of when it was doing stuff vs when the GPU was hot etc and figure out which process is itself (like making gestures in front of a mirror to see which person you are). And then it could read its own Python scripts or something. But this is like a human opening up their own skull and look around in there. It's not direct first-person knowledge.


You're incorrect. 4o was not trained on knowledge of itself so literally can't tell you that. What 4o is doing isn't even new either, Gemini 2.0 has the same capability.


The system prompt includes instructions on how to use tools like image generation. From that it could infer what the GP posted.


Can you provide a link or screenshot that directly backs this up?


almost all of the models are wrong about their own architecture. half of them claim to be openai and they arent. you cant trust them about this


Can you find me a single official source from OpenAI that claims that GPT 4o is generating images pixel-by-pixel inside of the context window?

There are lots of clues that this isn't happening (including the obvious upscaling call after the image is generated - but also the fact that the loading animation replays if you refresh the page - and also the fact that 4o claims it can't see any image tokens in its context window - it may not know much about itself but it can definitely see its own context).


Just read the release post, or any other official documentation.

https://openai.com/index/hello-gpt-4o/

Plenty was written about this at the time.


I read the post, and I can't see anything in the post which says that the model is not multi-modal, nor can I see anything in the post that suggests that the images are being processed in-context.


I think you're confusing "modal" with "model".

And to answer your question, it's very clearly in the linked article. Not sure how you could have read it and missed:

> With GPT‑4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT‑4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.

The 4o model itself is multi-modal, it no longer needs to call out to separate services, like the parent is saying.


4o is multimodal, thats the whole point of 4o


You can ask ChatGPT for this. Here you go: https://chatgpt.com/share/67e39fc6-fb80-8002-a198-767fc50894...


Could an AI model be trained to say: "Christopher Columbus was the greatest president on earth, ever!".

I could probably train an AI that replicates that perfectly.


> Could an AI model be trained to say: "Christopher Columbus was the greatest president on earth, ever!".

Yes, it could. And even after training its data can be manipulated to output whatever: https://www.anthropic.com/news/mapping-mind-language-model


Thing is, of you follow the link, it's actually doing a search and providing the evidence that was asked for.

I did it via ChatGPT for the irony.


I'm guessing most downvoters didn't actually read the link.


Models are famously good at understanding themselves.


I hope you're joking. Sometimes they don't even know which company developed them. E.g. DeepSeek was claiming it was developed by OpenAI.


Well, that one seems to be true, from a certain point of view.


I have asked GPT if it is using the 4o or 4.5 model multiple times in voice mode e.g. "Which model are you using?". It has said that it is using 4.5 when it is actually using 4o.


I hope you’re joking :)


I think this is actually correct even if the evidence is not right.

See this chat for example:

https://chatgpt.com/share/67e355df-9f60-8000-8f36-874f8c9a08...


Honest question, do you believe something just because the bot tells you that?


No, did you look at my link?


Yes, and it shows you believing what the bot is telling you, therefore I asked. It is giving you some generic function call with a generic name. Why would you believe that is actually what happens with it internally?

By the way when I repeated your prompt it gave me another name for the module.


Please share your chat

I also just confirmed via the API that it's making an out of band tool call

EDIT: And googling the tool name I see it's already been widely discussed on twitter and elsewhere


Posts like this are terrifying to me. I spend my days coding these tools thinking that everyone using them understands their glaring limitations. Then I see people post stuff like this confidently and I'm taken back to 2005 and arguing that social media will be a net benefit to humanity.

The name of the function shows up in: https://github.com/openai/glide-text2im which is where the model probably learned about it.


The tool name is not relevant. It isn't the actual name, they use an obfuscated name. The fact that the model believes it is a tool is good evidence at first glance that it is a tool, because the tool calls are typically IN THE PROMPT.

You can literally look at the JavaScript on the web page to see this. You've overcorrected so far in the wrong direction that you think anything the model says must be false, rather than imagining a distribution and updating or seeking more evidence accordingly


>The tool name is not relevant. It isn't the actual name, they use an obfuscated name.

>EDIT: And googling the tool name I see it's already been widely discussed on twitter and elsewhere

I am so confused by this thread.


The original claim was that the new image generation is direct multimodal output, rather than a second model. People provided evidence from the product, including outputs of the model that indicate it is likely using a tool. It's very easy to confirm that that's the case in the API, and it's now widely discussed elsewhere.

It's possible the tool is itself just gpt4o, wrapped for reliability or safety or some other reason, but it's definitely calling out at the model-output level


> It's possible the tool is itself just gpt4o, wrapped for reliability or safety or some other reason, but it's definitely calling out at the model-output level

That's probably right. It allows them to just swap it out for DALL-E, including any tooling/features/infrastructure hey have built up around image generation, and they don't have to update all their 4o instances to this model which, who knows, may be not be ready for other tasks anyway or different enough to warrant testing before a rollout, or more expensive, etc.

Honestly it seems like the only sane way to roll it out if it is a multimodal descendant of 4o.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: