Action Unit Driven Facial Expression Synthesis from a Single Image with Patch Attentive GAN
Yong Zhao
Shaanxi Key Laboratory on Speech and Image Information Processing, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University (NPU), Xi'an, China
Audio Visual Signal Processing (AVSP) Research Laboratory, Department of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), Brussels, Belgium
Search for more papers by this authorLe Yang
Shaanxi Key Laboratory on Speech and Image Information Processing, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University (NPU), Xi'an, China
Search for more papers by this authorErcheng Pei
Shaanxi Key Laboratory on Speech and Image Information Processing, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University (NPU), Xi'an, China
Search for more papers by this authorMeshia Cédric Oveneke
Audio Visual Signal Processing (AVSP) Research Laboratory, Department of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), Brussels, Belgium
Artificial Intelligence Research Lab, Fit-For Purpose Technologies, Brussels, Belgium
Search for more papers by this authorMitchel Alioscha-Perez
Audio Visual Signal Processing (AVSP) Research Laboratory, Department of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), Brussels, Belgium
Search for more papers by this authorCorresponding Author
Dongmei Jiang
Shaanxi Key Laboratory on Speech and Image Information Processing, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University (NPU), Xi'an, China
Peng Cheng Laboratory, Shenzhen, Guangdong China
Search for more papers by this authorHichem Sahli
Audio Visual Signal Processing (AVSP) Research Laboratory, Department of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), Brussels, Belgium
Interuniversity Microelectronics Centre (IMEC), Heverlee, Belgium
Search for more papers by this authorYong Zhao
Shaanxi Key Laboratory on Speech and Image Information Processing, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University (NPU), Xi'an, China
Audio Visual Signal Processing (AVSP) Research Laboratory, Department of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), Brussels, Belgium
Search for more papers by this authorLe Yang
Shaanxi Key Laboratory on Speech and Image Information Processing, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University (NPU), Xi'an, China
Search for more papers by this authorErcheng Pei
Shaanxi Key Laboratory on Speech and Image Information Processing, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University (NPU), Xi'an, China
Search for more papers by this authorMeshia Cédric Oveneke
Audio Visual Signal Processing (AVSP) Research Laboratory, Department of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), Brussels, Belgium
Artificial Intelligence Research Lab, Fit-For Purpose Technologies, Brussels, Belgium
Search for more papers by this authorMitchel Alioscha-Perez
Audio Visual Signal Processing (AVSP) Research Laboratory, Department of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), Brussels, Belgium
Search for more papers by this authorCorresponding Author
Dongmei Jiang
Shaanxi Key Laboratory on Speech and Image Information Processing, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University (NPU), Xi'an, China
Peng Cheng Laboratory, Shenzhen, Guangdong China
Search for more papers by this authorHichem Sahli
Audio Visual Signal Processing (AVSP) Research Laboratory, Department of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), Brussels, Belgium
Interuniversity Microelectronics Centre (IMEC), Heverlee, Belgium
Search for more papers by this author[Correction added on 8 April 2021, after first online publication: Reference [ACK*17] and the in text citations to it had been mistakenly omitted and were restored.]
Abstract
Recent advances in generative adversarial networks (GANs) have shown tremendous success for facial expression generation tasks. However, generating vivid and expressive facial expressions at Action Units (AUs) level is still challenging, due to the fact that automatic facial expression analysis for AU intensity itself is an unsolved difficult task. In this paper, we propose a novel synthesis-by-analysis approach by leveraging the power of GAN framework and state-of-the-art AU detection model to achieve better results for AU-driven facial expression generation. Specifically, we design a novel discriminator architecture by modifying the patch-attentive AU detection network for AU intensity estimation and combine it with a global image encoder for adversarial learning to force the generator to produce more expressive and realistic facial images. We also introduce a balanced sampling approach to alleviate the imbalanced learning problem for AU synthesis. Extensive experimental results on DISFA and DISFA+ show that our approach outperforms the state-of-the-art in terms of photo-realism and expressiveness of the facial expression quantitatively and qualitatively.
Supporting Information
Filename | Description |
---|---|
cgf14202-sup-0001-video.mp45.7 MB | Video S1 |
Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
References
- [ACB17] Arjovsky M., Chintala S., Bottou L.: Wasserstein generative adversarial networks. In International Conference on Machine Learning (2017), pp. 214–223.
- [ACK*17] Averbuch-Elor H., Cohen-Or D., Kopf J., Cohen M. F.: Bringing portraits to life. ACM Transactions on Graphics (TOG) 36, 6 (2017), 196.
- [BBPV03] Blanz V., Basso C., Poggio T., Vetter T.: Reanimating faces in images and video. Computer Graphics Forum 22, (2003), 641–650.
- [BCB14] Bahdanau D., Cho K., Bengio Y.: Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations (ICLR) (2015).
- [BLRW17] Brock A., Lim T., Ritchie J. M., Weston N.: Neural photo editing with introspective adversarial networks. In 5th International Conference on Learning Representations (ICLR) (2017), pp. 1–15.
- [BMR15] Baltrušaitis T., Mahmoud M., Robinson P.: Cross-dataset learning and person-specific normalisation for automatic action unit detection. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) (2015), vol. 6, IEEE, pp. 1–6.
10.1109/FG.2015.7284869 Google Scholar
- [BV99] Blanz V., Vetter T.: A morphable model for the synthesis of 3d faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (1999), ACM Press/Addison-Wesley Publishing Co., pp. 187–194.
- [BZLM18] Baltrusaitis T., Zadeh A., Lim Y. C., Morency L.-P.: Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) (2018), IEEE, pp. 59–66.
10.1109/FG.2018.00019 Google Scholar
- [CCK*18] Choi Y., Choi M., Kim M., Ha J.-W., Kim S., Choo J.: Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018).
- [CUYH20] Choi Y., Uh Y., Yoo J., Ha J.-W.: Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020).
10.1109/CVPR42600.2020.00821 Google Scholar
- [CWW*16] Cao C., Wu H., Weng Y., Shao T., Zhou K.: Real-time facial animation with image-based dynamic avatars. ACM Transactions on Graphics (TOG) 35, 4 (2016), 126.
- [CWZ*14] Cao C., Weng Y., Zhou S., Tong Y., Zhou K.: Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20, 3 (2014), 413–425.
- [DSC18] Ding H., Sricharan K., Chellappa R.: Exprgan: Facial expression editing with controllable expression intensity. In Thirty-Second AAAI Conference on Artificial Intelligence (2018).
- [EJC19] Ertugrul I. O., Jeni L. A., Cohn J. F.: Pattnet: Patch-attentive deep network for action unit detection. In British Machine Vision Conference (2019).
10.3389/fcomp.2019.00011 Google Scholar
- [ELYC19] Ertugrul I. O., Le Yang L. A. J., Cohn J. F.: D-pattnet: Dynamic patch-attentive deep network for action unit detection. Frontiers in Computer Science 1 (2019).
- [FBQSM16] Fabian Benitez-Quiroz C., Srinivasan R., Martinez A. M.: Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 5562–5570.
- [FE78] Friesen E., Ekman P.: Facial action coding system: a technique for the measurement of facial movement. Palo Alto (1978).
- [Féc] Fréchet M.: On the distance of two laws of probability. Weekly Reports of Seances de L Academie des Sciences 244, 6 (1957), 689–692.
- [GAA*17] Gulrajani I., Ahmed F., Arjovsky M., Dumoulin V., Courville A. C.: Improved training of wasserstein gans. In Advances in Neural Information Processing Systems (2017), pp. 5767–5777.
- [GCDlT15] Girard J. M., Cohn J. F., De la Torre F.: Estimating smile intensity: a better way. Pattern Recognition Letters 66 (2015), 13–21.
- [GPAM*14] Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y.: Generative adversarial nets. In Advances in Neural Information Processing Systems (2014), pp. 2672–2680.
- [GSZ*18] Geng J., Shao T., Zheng Y., Weng Y., Zhou K.: Warp-guided gans for single-photo facial animation. ACM Transactions on Graphics (TOG) 37, 6 (2018), 1–12.
- [GTDUM15] Gudi A., Tasli H. E., Den Uyl T. M., Maroulis A.: Deep learning based facs action unit occurrence and intensity estimation. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) (2015), vol. 6, IEEE, pp. 1–5.
10.1109/FG.2015.7284873 Google Scholar
- [GZC*16] Garrido P., Zollhöfer M., Casas D., Valgaerts L., Varanasi K., Pérez P., Theobalt C.: Reconstruction of personalized 3d face rigs from monocular video. ACM Transactions on Graphics (TOG) 35, 3 (2016), 28.
- [HKZ*17] He Z., Kan M., Zhang J., Chen X., Shan S.: A fully end-to-end cascaded cnn for facial landmark detection. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) (2017), IEEE, pp. 200–207.
10.1109/FG.2017.33 Google Scholar
- [HRU*17] Heusel M., Ramsauer H., Unterthiner T., Nessler B., Hochreiter S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (2017), pp. 6626–6637.
- [HZRS16] He K., Zhang X., Ren S., Sun J.: Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778.
- [IS15] Ioffe S., Szegedy C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning (ICML) (2015), pp. 448–456.
- [IZZE17] Isola P., Zhu J.-Y., Zhou T., Efros A. A.: Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 1125–1134.
- [JGCDLT13] Jeni L. A., Girard J. M., Cohn J. F., De La Torre F.: Continuous au intensity estimation using localized, sparse facial feature space. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) (2013), IEEE, pp. 1–7.
10.1109/FG.2013.6553808 Google Scholar
- [JSZ*15] Jaderberg M., Simonyan K., Zisserman A., et al.: Spatial transformer networks. In Advances in neural information processing systems (2015), pp. 2017–2025.
- [KB14] Kingma D. P., Ba J.: Adam: A method for stochastic optimization. 3rd International Conference on Learning Representations (ICLR) (2015).
- [KDRZ20] Koujan M., Doukas M., Roussos A., Zafeiriou S.: Head2head: Video-based neural head synthesis. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (FG) (2020), IEEE, pp. 319–326.
10.1109/FG47880.2020.00048 Google Scholar
- [KGT*18] Kim H., Garrido P., Tewari A., Xu W., Thies J., Nießner M., Pérez P., Richardt C., Zollöfer M., Theobalt C.: Deep video portraits. ACM Transactions on Graphics (TOG) 37, 4 (2018), 163.
- [KLA19] Karras T., Laine S., Aila T.: A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 4401–4410.
- [LSC*19] Liu Z., Song G., Cai J., Cham T.-J., Zhang J.: Conditional adversarial synthesis of 3d facial action units. Neurocomputing 355 (2019), 200–208.
- [LSL*17] Liwen H., Shunsuke S., Lingyu W., Koki N., Jaewoo S., Jens F., Iman S., Carrie S., Yen-Chun C., Li H.: Avatar digitization from a single image for real-time rendering. ACM Transactions on Graphics (Proceeding of SIGGRAPH Asia 2017) 36, 4 (2017), 1–14.
- [LTWE*17] Linh Tran D., Walecki R., Eleftheriadis S., Schuller B., Pantic M., et al.: Deepcoder: Semi-parametric variational autoencoders for automatic facial action coding. In Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 3190–3199.
- [MCMC09] Mahoor M. H., Cadavid S., Messinger D. S., Cohn J. F.: A framework for automated measurement of the intensity of non-posed facial action units. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (2009), IEEE, pp. 74–80.
10.1109/CVPRW.2009.5204259 Google Scholar
- [MD19] Ma L., Deng Z.: Real-time facial expression transformation for monocular rgb video. Computer Graphics Forum 38, (2019), 470–481.
- [MLX*17] Mao X., Li Q., Xie H., Lau R. Y., Wang Z., Paul Smolley S.: Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 2794–2802.
- [MMB*13] Mavadati S. M., Mahoor M. H., Bartlett K., Trinh P., Cohn J. F.: Disfa: A spontaneous facial action intensity database. IEEE Transactions on Affective Computing 4, 2 (2013), 151–160.
- [MO14] Mirza M., Osindero S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).
- [MSM16] Mavadati M., Sanger P., Mahoor M. H.: Extended disfa dataset: Investigating posed and spontaneous facial expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2016), pp. 1–8.
- [MVJP19] Martinez B., Valstar M. F., Jiang B., Pantic M.: Automatic Analysis of Facial Actions: A Survey. IEEE Transactions on Affective Computing, 10, 3 (2019), 325–347.
- [NCZ17] Nagrani A., Chung J. S., Zisserman A.: Voxceleb: a large-scale speaker identification dataset. INTERSPEECH (2017).
- [PAM*18] Pumarola A., Agudo A., Martinez A. M., Sanfeliu A., Moreno-Noguer F.: Ganimation: Anatomically-aware facial animation from a single image. In Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 818–833.
- [PGM*19] Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al.: Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (2019), pp. 8026–8037.
- [PW08] Parke F. I., Waters K.: Computer Facial Animation. CRC Press, 2008.
10.1201/b10705 Google Scholar
- [QYJ*18] Qiao F., Yao N., Jiao Z., Li Z., Chen H., Wang H.: Emotional facial expression transfer from a single image via generative adversarial nets. Computer Animation and Virtual Worlds 29, 3–4 (2018).
- [SF79] Shrout P. E., Fleiss J. L.: Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin 86, 2 (1979), 420.
- [SHMA08] Susskind J. M., Hinton G. E., Movellan J. R., Anderson A. K.: Generating facial expressions with deep belief nets. In Affective Computing. IntechOpen, 2008.
- [Skl96] Sklar A.: Random variables, distribution functions, and copulas: a personal look backward and forward. Lecture Notes-Monograph Series (1996), 1–14.
- [SLCM18] Shao Z., Liu Z., Cai J., Ma L.: Deep adaptive attention for joint facial action unit detection and face alignment. In Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 705–720.
- [SLH*18] Song L., Lu Z., He R., Sun Z., Tan T.: Geometry guided adversarial facial expression synthesis. In 2018 ACM Multimedia Conference on Multimedia Conference (2018), ACM, pp. 627–635.
10.1145/3240508.3240612 Google Scholar
- [SLT*19] Siarohin A., Lathuilière S., Tulyakov S., Ricci E., Sebe N.: First order motion model for image animation. In Conference on Neural Information Processing Systems (NeurIPS) (December 2019).
- [SSB12] Savran A., Sankur B., Bilge M. T.: Regression-based intensity estimation of facial action units. Image and Vision Computing 30, 10 (2012), 774–784.
- [TEB*20] Tewari A., Elgharib M., Bharaj G., Bernard F., Seidel H.-P., Pérez P., Zöllhofer M., Theobalt C.: Stylerig: Rigging stylegan for 3d control over portrait images, cvpr 2020. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (june 2020), IEEE.
- [TFT*20] Tewari A., Fried O., Thies J., Sitzmann V., Lombardi S., Sunkavalli K., Martin-Brualla R., Simon T., Saragih J., Nießner M., Pandey R., Fanello S., Wetzstein G., Zhu J.-Y., Theobalt C., Agrawala M., Shechtman E., Goldman D. B., Zollhöfer M.: State of the art on neural rendering. Computer Graphics Forum 39, 2 (2020), 701–727.
- [TZS*16] Thies J., Zollhofer M., Stamminger M., Theobalt C., Nießner M.: Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 2387–2395.
- [TZS*18] Thies J., Zollhöfer M., Stamminger M., Theobalt C., Nießner M.: Headon: Real-time reenactment of human portrait videos. ACM Transactions on Graphics 2018 (TOG) 37, 4 (2018), 1–13.
- [VBPP05] Vlasic D., Brand M., Pfister H., Popović J.: Face transfer with multilinear models. In ACM Transactions on Graphics (TOG). 24, (2005), 426–433.
10.1145/1186822.1073209 Google Scholar
- [VSP*17] Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł., Polosukhin I.: Attention is all you need. In Advances in Neural Information Processing Systems (2017), pp. 5998–6008.
- [WBZB20] Wang M., Bradley D., Zafeiriou S., Beeler T.: Facial expression synthesis using a global-local multilinear framework. Computer Graphics Forum 39, (2020), 235–245.
- [WPS*17] Walecki R., Pavlovic V., Schuller B., Pantic M., et al.: Deep structured learning for facial action unit intensity estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 3405–3414.
- [WRPP16] Walecki R., Rudovic O., Pavlovic V., Pantic M.: Copula ordinal regression for joint estimation of facial action unit intensity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4902–4910.
- [WSC*19] Wang M., Shu Z., Cheng S., Panagakis Y., Samaras D., Zafeiriou S.: An adversarial neuro-tensorial approach for learning disentangled representations. International Journal of Computer Vision 127, 6-7 (2019), 743–762.
- [WSS*19] Wei S.-E., Saragih J., Simon T., Harley A. W., Lombardi S., Perdoch M., Hypes A., Wang D., Badino H., Sheikh Y.: Vr facial animation via multiview image translation. ACM Transactions on Graphics (TOG) 38, 4 (2019), 67.
- [ZISL20] Zakharov E., Ivakhnenko A., Shysheya A., Lempitsky V. S.: Fast bi-layer neural synthesis of one-shot realistic head avatars. In Proceedings of the European Conference on Computer Vision (ECCV) (2020), vol. 12357, Springer, pp. 524–540.
- [ZJW*19] Zhang Y., Jiang H., Wu B., Fan Y., Ji Q.: Context-aware feature and label fusion for facial action unit intensity estimation with partially labeled data. In Proceedings of the IEEE International Conference on Computer Vision (2019), pp. 733–742.
- [ZPIE17] Zhu J.-Y., Park T., Isola P., Efros A. A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (2017), pp. 2223–2232.
- [ZS17] Zhou Y., Shi B. E.: Photorealistic facial expression synthesis by the conditional difference adversarial autoencoder. Seventh International Conference on Affective Computing and Intelligent Interaction (ACII) (2017), pp. 370–376.
- [ZZH*19] Zhang Y., Zhang S., He Y., Li C., Loy C. C., Liu Z.: One-shot face reenactment. In British Machine Vision Conference (BMVC) (2019).