Image to Image translation using a cGAN

  • Built a Conditional Generative Adversarial Network (cGAN) to get photorealistic images from black and white sketches of human faces.
  • Fine-tuned the hyperparameters to reduce the training time and improve the quality of output image.

In this project, we use the cGAN as described in the paper Image-to-Image Translation with Conditional Generative Adversarial Networks (referred to as the pix2pix), and apply it for the purpose of synthesizing photo-realistic images from the black-and-white sketch images of a specific person. Overall this project is an empirical study on cGAN, with an eye towards finding practical applications for the technology.


  • The ground truth images are color photos of Barack Obama, manually scrapped from over the Internet.
  • The input images are gray scale images manually created from the ground truth color images using a photoshop filter.
  • Images are cropped to 256×256 pixel size.
  • Each pair of sketch and real image was combined to generate the final data. Training, validation and testing data was created using these images.

Example image given as an input for training.
Ground truth image(left) and input image of sketch(right)


ngf, ndf defines the factor that manages the number of parameters in the generator and discriminator respectively. In the original paper the default is ngf = ndf = 64. In our implementation we use 32 to reduce the number of parameters and enforce faster training time.

Since we are training with less parameters the visual outputs will be much worse than the original paper. To obtain much better results we introduce a simple trick that augments the training dataset with more images and hence create much detailed images. The idea is similar to pyramid scaling where we train on different random scales of each image. We choose a random scaling between (286, 300) then we apply random cropping back to (256, 256).

Training on different scales helps capturing details when using lesser number of parameters. This is done by transforming the input images using the additional transform function in the code.

Number of images: 50 original images + images generated by scaling the original images
Number of epochs: 2000 epochs
Time taken for training: 35 minutes

Find the entire project and the results here: