I combine the GAN-CLS algorithm by Reed et al. [1] and the MS-GAN regulation term by Mao et al. [2] and experiment with the caltech bird-dataset. I experimented with the GAN architecture proposed by Ledig et al [3].
- Please refer to the READMEs in the folder images, captions, and word2vec_pretrained_model to obtain the necessary data.
- Run
python process_images.py
to resize and normalize the images and generate numpy arrays. - Run
python process_captions.py
to generate sentence embeddings for the captions. - Upload the generated images vectors, sentence vectors and pretrained word2vec model to a Google Drive account.
- Import the jupyter notebook
Text2Image_GAN_MS.ipynb
in Google Colab and load the data. - Run code snippets in Google Colab.
I trained the GAN model for 960 epochs with the ADAM optimizer [4] for the discriminator and generator with a learning rate of 0.000035 and beta_1=0.5. Most of the synthesized images do depict plausible colors and shapes of birds and there does seem to be a lot of diversity; however, the GAN did have some minor mode collapse problems when generating images based on made up captions as seen below.
[1] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text-to-image synthesis. In Proceedings of The 33rd International Conference on Machine Learning, 2016.
[2] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis, IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[3] Christian Ledig, Lucas Theis, Ference Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.
[4] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.