Pixels Aligned with Words: Technical Route and Horizons of Text-to-Image Generation

Xiaotian Hu

doi:10.54097/q61awv61

Authors

Xiaotian Hu

DOI:

https://doi.org/10.54097/q61awv61

Keywords:

Pixels Aligned with Words; Text-to-Image; Generation.

Abstract

As a hot field in current society, artificial intelligence text-to-image gen-eration has received extensive attention in recent years. Text-to-Image generation task refers to the process of converting natural language descriptions to corresponding visual content such as pictures and illus-trations, and has demonstrated a powerful influence in fields such as education, economic models, and artistic creation. Based on different technical frameworks, the current mainstream Text-to-Image large models can be divided into diffusion-based models, generative adver-sarial networks, variational autoencoder-based models and other methods. Different technical architectures have their own advantages and characteristics. Based on the above representative frameworks, this paper introduces some of the latest technological developments, ex-pounds its innovation direction and operation process, and analyzes its shortcomings. This paper introduces a few classic datasets such as LAION-5B and COCO, and analysis the performance of representative methods on these datasets. This paper summarizes the current prob-lems in the field of Text-to-Image, looks forward to the future develop-ment direction, and hopes to bring some inspiration to future researchers.

References

[1] Zhang Han, Xu Tao, Li Hongsheng, et al. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017: 5907–5915.

[2] Ramesh Aditya, Pavlov Mikhail, Goh Gabriel, Gray Scott, Voss Chelsea, Radford Alec, Chen Mark, Sutskever Ilya. Zero-shot text-to-image generation. Proceedings of the 38th International Conference on Machine Learning, 2021, 139: 8821–8831.

[3] Ding Ming, Yang Zheng, Hong Wenyi, Zheng Wendi, Zhou Chang, Yin Da, Lin Jie, Zou Xu, Shao Zhihua, Yang Hongxia, Tang Jie. CogView: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 2021, 34: 19822–19835.

[4] Ho Jonathan, Jain Ajay, Abbeel Pieter. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 2020, 33: 6840–6851.

[5] Rombach Robin, Blattmann Andreas, Lorenz Dominik, Esser Patrick, Ommer Björn. High-resolution image synthesis with latent diffusion models. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022: 10684–10695.

[6] Ramesh Aditya, Dhariwal Prafulla, Nichol Alex, Chu Casey, Chen Mark. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint, arXiv:2204.06125, 2022.

[7] Radford Alec, Metz Luke, Chintala Soumith. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint, arXiv:1511.06434, 2015.

[8] Karras Tero, Laine Samuli, Aila Timo. A style-based generator architecture for generative adversarial networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, 2019: 4401–4410.

[9] Kingma Diederik, Welling Max. Auto-encoding variational Bayes. arXiv preprint, arXiv:1312.6114, 2013.

[10] Sohl-Dickstein Jascha, Weiss Eric, Maheswaranathan Niru, Ganguli Surya. Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning (ICML), Lille, France, 2015: 2256–2265.

[11] Hirschmüller Heiko. Accurate and efficient stereo processing by semi-global matching and mutual information. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, USA, 2005, 2: 807–814.

[12] Song Jiaming, Meng Chenlin, Ermon Stefano. Denoising diffusion implicit models. arXiv preprint, arXiv:2010.02502, 2020.

[13] Zhang Li, Agrawala Maneesh. Adding conditional control to text-to-image diffusion models. IEEE/CVF International Conference on Computer Vision (ICCV), 2023: 3836–3847.

[14] Mou Chong, Chen Tao, Zhang Yuxin, et al. T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. AAAI Conference on Artificial Intelligence, 2024, 38: 4296–4304.

[15] Hu Edward, Shen Yelong, Wallis Phillip, et al. LoRA: Low-rank adaptation of large language models. International Conference on Learning Representations (ICLR), 2022.

[16] Ruiz Nataniel, Li Yuanzhen, Jampani Varun, et al. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023: 22500–22510.

[17] Peebles William, Xie Saining. Scalable diffusion models with transformers. IEEE/CVF International Conference on Computer Vision (ICCV), 2023: 4195–4205.

[18] Chen Jie, Ge Chen, Xie Enze, Wu Yufei, Luo Ping. PixArt-Σ: Weak-to-strong training of diffusion transformer for 4K text-to-image generation. European Conference on Computer Vision (ECCV), Heidelberg: Springer, 2024: 74–91.

[19] Wang Qing, Bai Xinyu, Wang Haoran, Qin Zhen, Chen Anbang. InstantID: Zero-shot identity-preserving generation in seconds. arXiv preprint, arXiv:2401.07519, 2024.

[20] Reed Scott, Akata Zeynep, Yan Xinchen, Logeswaran Lajanugen, Schiele Bernt, Lee Honglak. Generative adversarial text to image synthesis. International Conference on Machine Learning (ICML), New York, USA, 2016: 1060–1069.

[21] Zhang Han, Xu Tao, Li Hongsheng, et al. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8): 1947–1962.

[22] Xu Tao, Zhang Han, Huang Xu, et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018: 1316–1324.

[23] Zhu Minfeng, Pan Pingbo, Chen Wei, Yang Yi. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019: 5802–5810.

[24] Tao Ming, Tang Hao, Wu Fei, Jing Xiaoyuan, Bao Bin, Xu Changsheng. DF-GAN: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint, arXiv:2008.05865, 2020.

[25] Zhang Han, Zhang Long, Xu Tao, et al. Cross-modal contrastive learning for text-to-image generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021: 833–842.

[26] Sauer Axel, Karras Tero, Laine Samuli, Geiger Andreas, Aila Timo. StyleGAN-T: Unlocking the power of GANs for fast large-scale text-to-image synthesis. International Conference on Machine Learning (ICML), 2023: 30105–30118.

[27] Kang Minguk, Jeong Jiseob, Lee Jong Chul, et al. Scaling up GANs for text-to-image synthesis. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023: 10124–10134.

[28] Lin Sheng, Wang Anlan, Yang Xiaoqing. SDXL-Lightning: Progressive adversarial diffusion distillation. arXiv preprint, arXiv:2402.13929, 2024.

[29] Huang Norman, Gokaslan Aaron, Kuleshov Volodymyr, Prabhakaran Vinodkumar. The GAN is dead; long live the GAN! A modern GAN baseline. Advances in Neural Information Processing Systems, 2024, 37: 44177–44215.

[30] Tian Kun, Jiang Yizhe, Yuan Zhe, Peng Bin, Wang Liwei. Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in Neural Information Processing Systems, 2024, 37: 84839–84865.

[31] Wu Chenfei, Ji Yong, Zhang Shuhuai, et al. Qwen-image technical report. arXiv preprint, arXiv:2508.02324, 2025.