This top to bottom drawing – does this tell us anything about the underlying model architecture? AFAIK diffusion models do not work like that. They denoise the full frame over many steps. In the past there used to be attempts to slowly synthetize a picture by predicting the next pixel, but I wasn't aware whether there has been a shift to that kind of architecture within OpenAI.
Yes, the model card explicitly says it's autoregressive, not diffusion. And it's not a separate model, it's a native ability of GPT-4o, which is a multimodal model. They just didn't made this ability public until now. I assume they worked on the fine-tuning to improve prompt following.