An Analysis of Text-to-image Models of OpenAI, Stability AI, and Google

Zijie Li; Mutian Yang; Shuo Yang

doi:10.54097/a3dhhm46

Authors

Zijie Li
Mutian Yang
Shuo Yang

DOI:

https://doi.org/10.54097/a3dhhm46

Keywords:

Text-to-image generation; Multimodal artificial intelligence; Transformer; Latent Diffusion Model (LDM).

Abstract

Text-to-image generation has rapidly evolved through a series of significant models since it was first introduced in 2015. This paper examines the devel-opment of leading models: OpenAI’s DALL·E 1–3 and GPT-4o, Stability AI’s Stable Diffusion series (v1.5, XL, 3.0), and Google’s Imagen 1–3. These mod-els have an astonishing overlap in the time points when they are updated and iterated. Also, they have similar technological focuses and development tra-jectories at these time knots, although various methods have been applied, ranging from Transformer-based autoregressive designs to latent diffusion with CLIP conditioning. By setting each year from 2021 to 2024 as one of the time knots, this paper compared the techniques they used at these nodes horizontal-ly, identifying the convergence of their emphases and uses of technologies. Es-timated the progress of the models vertically, this paper has verified the neces-sity and effectiveness of model’s iterations. This study also identified some existing issues with the model, with some possible solutions.

References

[1]Mansimov E, Parisotto E, Ba JL, et al. Generating images from captions with attention. arXiv preprint arXiv:1511.02793, 2015.

[2]Ramesh A, Pavlov M, Goh G, et al. Zero-shot text-to-image generation. International Conference on Machine Learning, 2021: 8821-8831.

[3]Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 10684-10695.

[4]Saharia C, Chan W, Saxena S, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 2022, 35: 36479-36494.

[5]Ramesh A, Dhariwal P, Nichol A, et al. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.

[6]Analytics Vidhya. DALL-E 3: Key features and improvements. Analytics Vidhya, 2024. [Online]. Available: https://www.analyticsvidhya.com/blog/2024/07/dall-e3/

[7]Yan Z, Ye J, Li W, et al. Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation. arXiv preprint arXiv:2504.02782, 2025.

[8]Podell D, English Z, Lacey K, et al. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.

[9]Esser P, Kulal S, Blattmann A, et al. Scaling rectified flow transformers for high-resolution image synthesis. International Conference on Machine Learning, 2024.

[10]Baldridge J, et al. Imagen 3. arXiv preprint arXiv:2408.07009, 2024.

[11]Ho J, Saharia C, Chan W, et al. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 2022, 23(47): 1-33.

[12]OpenAI. DALL·E 3. OpenAI, 2025. [Online]. Available: https://openai.com/dall-e-3/

[13]OpenAI. DALL·E 3 System Card. OpenAI, 2023. [Online]. Available: https://openai.com/index/dall-e-3-system-card/

[14]Tirumalashetty V. Imagen 2 on Vertex AI is now generally available. Google Cloud Blog, 2023. [Online]. Available: https://cloud.google.com/blog/products/ai-machine-learning/imagen-2-on-vertex-ai-is-now-generally-available

[15]OpenAI. GPT-4o System Card. OpenAI, 2024. [Online]. Available: https://openai.com/index/gpt-4o-system-card/

[16]Ghosh D, Hajishirzi H, Schmidt L. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 2023, 36: 52132-52152.

[17]Heusel M, Ramsauer H, Unterthiner T, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 2017, 30.

[18]Lin TY, Maire M, Belongie S, et al. Microsoft coco: Common objects in context. European Conference on Computer Vision, 2014: 740-755.

[19]Radford A, Kim JW, Hallacy C, et al. Learning transferable visual models from natural language supervision. International Conference on Machine Learning, 2021: 8748-8763.

[20]Elo AE, Sloan S. The rating of chessplayers: Past and present. 1978.

[21]Shi Z, Zhou X, Qiu X, et al. Improving image captioning with better use of captions. arXiv preprint arXiv:2006.11807, 2020.

[22]Li B, Lin Z, Pathak D, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation. arXiv preprint arXiv:2406.13743, 2024.

[23]Bakr EM, Sun P, Shen X, et al. Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models. IEEE/CVF International Conference on Computer Vision, 2023: 20041-20053.

[24]Analytics Vidhya. Is Google’s Imagen 3 the future of AI image creation. Analytics Vidhya, 2024. [Online]. Available: https://www.analyticsvidhya.com/blog/2024/09/google-imagen-3/

[25]Analytics Vidhya. Stable Diffusion 3: Guide to the latest text-to-image model by Stability AI. Analytics Vidhya, 2024. [Online]. Available: https://www.analyticsvidhya.com/blog/2024/06/stable-diffusion-3/

[26]Xu S. Stable Diffusion 3: Text master, prone problems. BentoML Blog, 2024. [Online]. Available: https://www.bentoml.com/blog/stable-diffusion-3-text-master-prone-problems