Real-Time Facial Expression Transformation for Monocular RGB Video
L. Ma
Department of Computer Science, University of Houston, Houston, TX, USA
Search for more papers by this authorZ. Deng
Department of Computer Science, University of Houston, Houston, TX, USA
Search for more papers by this authorL. Ma
Department of Computer Science, University of Houston, Houston, TX, USA
Search for more papers by this authorZ. Deng
Department of Computer Science, University of Houston, Houston, TX, USA
Search for more papers by this authorAbstract
This paper describes a novel real-time end-to-end system for facial expression transformation, without the need of any driving source. Its core idea is to directly generate desired and photo-realistic facial expressions on top of input monocular RGB video. Specifically, an unpaired learning framework is developed to learn the mapping between any two facial expressions in the facial blendshape space. Then, it automatically transforms the source expression in an input video clip to a specified target expression through the combination of automated 3D face construction, the learned bi-directional expression mapping and automated lip correction. It can be applied to new users without additional training. Its effectiveness is demonstrated through many experiments on faces from live and online video, with different identities, ages, speeches and expressions.
Supporting Information
Filename | Description |
---|---|
cgf13586-sup-0001-video1.mp492 MB | Video S1 |
Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
References
- [AAB*16] Abadi M., Agarwal A., Barham P., Brevdo E., Chen Z., Citro C., Corrado G. S., Davis A., Dean J., Devin M., Ghemawat S., Goodfellow I. J., Harp A., Irving G., Isard M., Jia Y., Józefowicz R., Kaiser L., Kudlur M., Levenberg J., Man D., Monga R., Moore S., Murray D., Olah C., Schuster M., Shlens J., Steiner B., Sutskever I., Talwar K., Tucker P., Vanhoucke V., Vasudevan V., Viegas F., Vinyals O., Warden P., Wattenberg M., Wicke M., Yu Y., Zheng X.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
- [AECOKC17] Averbuch-Elor H., Cohen-Or D., Kopf J., Cohen M. F.: Bringing portraits to life. ACM Transactions on Graphics 36, 6 (November 2017), 196:1–196:13.
- [BA83] Burt P., Adelson E.: The Laplacian pyramid as a compact image code. IEEE Transactions on Communications 31, 4 (1983), 532–540.
- [BBPV03] Blanz V., Basso C., Poggio T., Vetter T.: Reanimating faces in images and video. Computer Graphics Forum 22, 3 (2003), 641–650.
- [BJ03] Basri R., Jacobs D. W.: Lambertian reflectance and linear subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 2 (2003), 218–233.
- [Bra99] Brand M.: Voice puppetry. In SIGGRAPH '99: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (1999), ACM Press/Addison-Wesley Publishing Co., pp. 21–28.
10.1145/311535.311537 Google Scholar
- [BV99] Blanz V., Vetter T.: A morphable model for the synthesis of 3d faces. In SIGGRAPH '99: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (1999), ACM Press/Addison-Wesley Publishing Co., pp. 187–194.
- [CBF16] Cong M., Bhat K. S., Fedkiw R.: Art-directed muscle simulation for high-end facial animation. In SCA '16: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation (2016), Eurographics Association, pp. 119–127.
- [CBZB15] Cao C., Bradley D., Zhou K., Beeler T.: Real-time high-fidelity facial performance capture. ACM Transactions on Graphics 34, 4 (July 2015), 46:1–46:9.
- [CCK*17] Choi Y., Choi M., Kim M., Ha J.-W., Kim S., Choo J.: Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. arXiv preprint arXiv:1711.09020, 2017.
- [CHZ14] Cao C., Hou Q., Zhou K.: Displaced dynamic expression regression for real-time facial tracking and animation. ACM Transactions on Graphics 33, 4 (July 2014), 43:1–43:10.
- [Coo17] Cootes T.: Talking face video, 2017. URL: http://www-prima.inrialpes.fr/FGnet/data/01-TalkingFace/talking_face.html. Accessed on 1 July 2018.
- [CTFP05] Cao Y., Tien W. C., Faloutsos P., Pighin F.: Expressive speech-driven facial animation. ACM Transactions on Graphics 24, 4 (October 2005), 1283–1302.
- [CWLZ13] Cao C., Weng Y., Lin S., Zhou K.: 3d shape regression for real-time facial animation. ACM Transactions on Graphics 32, 4 (July 2013), 41:1–41:10.
- [CWZ*14] Cao C., Weng Y., Zhou S., Tong Y., Zhou K.: Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20, 3 (2014), 413–425.
- [DCFN06] Deng Z., Chiang P.-Y., Fox P., Neumann U.: Animating blendshape faces by cross-mapping motion capture data. In I3D '06: Proceedings of the 2006 Symposium on Interactive 3D Graphics and Games (2006), ACM, pp. 43–48.
10.1145/1111411.1111419 Google Scholar
- [DN08a] Deng Z., Neumann U.: Expressive speech animation synthesis with phoneme-level controls. Computer Graphics Forum 27, 8 (2008), 2096–2113.
- [DN08b] Deng Z., Noh J.: Computer facial animation: A survey. In Data-Driven 3D Facial Animation. Z. Deng and U. Neumann (Eds.). Springer, Berlin, Germany (2008), pp. 1–28.
- [DNL*06] Deng Z., Neumann U., Lewis J. P., Kim T.-Y., Bulut M., Narayanan S.: Expressive facial animation synthesis by learning speech coarticulation and expression spaces. IEEE Transactions on Visualization and Computer Graphics 12, 6 (2006), 1523–1534.
- [DSD17] Ding Y., Shi L., Deng Z.: Perceptual enhancement of emotional mocap head motion: An experimental study. In Proceedings of 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII) (2017), pp. 242–247.
10.1109/ACII.2017.8273607 Google Scholar
- [DSJ*11] Dale K., Sunkavalli K., Johnson M. K., Vlasic D., Matusik W., Pfister H.: Video face replacement. ACM Transactions on Graphics 30, 6 (December 2011), 130:1–130:10.
- [EF78] Ekman P., Friesen W.: Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists, Palo Alto, 1978.
- [EGP02] Ezzat T., Geiger G., Poggio T.: Trainable videorealistic speech animation. ACM Transactions on Graphics 21, 3 (July 2002), 388–398.
- [FJA*14] Fyffe G., Jones A., Alexander O., Ichikari R., Debevec P.: Driving high-resolution facial scans with video performance capture. ACM Transactions on Graphics 34, 1 (December 2014), 8:1–8:14.
- [GDOP17] Grinstein E., Duong N., Ozerov A., Perez P.: Audio style transfer. arXiv preprint arXiv:1710.11385, 2017.
- [GEB16] Gatys L. A., Ecker A. S., Bethge M.: Image style transfer using convolutional neural networks. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016), pp. 2414–2423.
10.1109/CVPR.2016.265 Google Scholar
- [GJ*10] Guennebaud G., Jacob B., et al.: Eigen v3. http://eigen.tuxfamily.org, 2010. Accessed on 2 August 2018.
- [GKSL16] Ganin Y., Kononenko D., Sungatullina D., Lempitsky V.: Deepwarp: Photorealistic image resynthesis for gaze manipulation. In Computer Vision – ECCV 2016 (Cham, 2016), Springer International Publishing, pp. 311–326.
10.1007/978-3-319-46475-6_20 Google Scholar
- [GPAM*14] Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y.: Generative adversarial nets. In Proceedings of Annual Conference on Advances in Neural Information Processing Systems (2014), Curran Associates, Inc., Vol. 27, pp. 2672–2680.
- [GVR*14] Garrido P., Valgaerts L., Rehmsen O., Thormaehlen T., Perez P., Theobalt C.: Automatic face reenactment. In Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition (June 2014), pp. 4217–4224.
10.1109/CVPR.2014.537 Google Scholar
- [GVS*15] Garrido P., Valgaerts L., Sarmadi H., Steiner I., Varanasi K., Pérez P., Theobalt C.: Vdub: Modifying face video of actors for plausible visual alignment to a dubbed audio track. Computer Graphics Forum 34, 2 (2015), 193–204.
- [GZC*16] Garrido P., Zollhöfer M., Casas D., Valgaerts L., Varanasi K., Pérez P., Theobalt C.: Reconstruction of personalized 3d face rigs from monocular video. ACM Transactions on Graphics 35, 3 (May 2016), 28:1–28:15.
- [HJ10] Haq S., Jackson P. J.: Multimodal emotion recognition. Machine Audition: Principles, Algorithms and Systems, 17 (2010), 398–423.
- [IKKP17] Ichim A.-E., Kadleček P., Kavan L., Pauly M.: Phace: Physics-based face modeling and animation. ACM Transactions on Graphics 36, 4 (July 2017), 153:1–153:14.
- [IKNDP16] Ichim A.-E., Kavan L., Nimier-David M., Pauly M.: Building and animating user-specific volumetric face rigs. In SCA '16: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation (Goslar Germany, Germany, 2016), Eurographics Association, pp. 107–117.
- [IZZE16] Isola P., Zhu, J.-Y., Zhou T., Efros A. A.: Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.
- [JAFF16] Johnson J., Alahi A., Fei-Fei L.: Perceptual losses for real-time style transfer and super-resolution. In Proceedings of European Conference on Computer Vision (2016), pp. 694–711.
- [KAL*17] Karras T., Aila T., Laine S., Herva A., Lehtinen J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics 36, 4 (July 2017), 94:1–94:12.
- [KPB*12] Kuster C., Popa T., Bazin J.-C., Gotsman C., Gross M.: Gaze correction for home video conferencing. ACM Transactions on Graphics 31, 6 (November 2012), 174:1–174:6.
- [LAR*14] Lewis J. P., Anjyo K., Rhee T., Zhang M., Pighin F. H., Deng Z.: Practice and theory of blendshape facial models. In Proceedings of Eurographics (State of the Art Reports) (2014).
- [LD08] Li Q., Deng Z.: Orthogonal-blendshape-based editing system for facial motion capture data. IEEE Computer Graphics and Applications 28, 6 (2008), 76–82.
- [LTW95] Lee Y., Terzopoulos D., Waters K.: Realistic modeling for facial animation. In Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques (1995), SIGGRAPH '95, ACM, pp. 55–62.
10.1145/218380.218407 Google Scholar
- [LXW*12] Li K., Xu F., Wang J., Dai Q., Liu Y.: A data-driven approach for facial expression synthesis in video. In Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition (June 2012), IEEE, pp. 57–64.
- [LZZ16] Li M., Zuo W., Zhang D.: Deep identity-aware transfer of facial attributes. arXiv preprint arXiv:1610.05586, 2016.
- [MBW*15]
Malleson C., Bazin J., Wang O., Bradley D., Beeler T., Hilton A., Sorkine-Hornung A.: Facedirector: Continuous control of facial performance in video. In Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV) (December 2015), IEEE, pp. 3979–3987.
10.1109/ICCV.2015.453 Google Scholar
- [MLD09] Ma X., Le B. H., Deng Z.: Style learning and transferring for facial animation editing. In Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (2009), SCA '09, ACM, pp. 123–132.
10.1145/1599470.1599486 Google Scholar
- [RCWS14]
Ren S., Cao X., Wei Y., Sun J.: Face alignment at 3000 fps via regressing local binary features. In Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition (June 2014), IEEE, pp. 1685–1692.
10.1109/CVPR.2014.218 Google Scholar
- [SLH*17] Song L., Lu Z., He R., Sun Z., Tan T.: Geometry guided adversarial facial expression synthesis. arXiv preprint arXiv:1712.03474, 2017.
- [SLS*12] Seol Y., Lewis J., Seo J., Choi B., Anjyo K., Noh J.: Spacetime expression cloning for blendshapes. ACM Transactions on Graphics 31, 2 (Apr. 2012), 14:1–14:12.
- [SNF05] Sifakis E., Neverov I., Fedkiw R.: Automatic determination of facial muscle activations from sparse motion capture marker data. ACM Transactions on Graphics 24, 3 (July 2005), 417–425.
- [SSKS17] Suwajanakorn S., Seitz S. M., Kemelmacher-Shlizerman I.: Synthesizing Obama: Learning lip sync from audio. ACM Transactions on Graphics 36, 4 (July 2017), 95:1–95:13.
- [SWTC14] Shi F., Wu H.-T., Tong X., Chai J.: Automatic acquisition of high-fidelity facial performances using monocular videos. ACM Transactions on Graphics 33, 6 (November 2014), 222:1–222:13.
- [TKY*17] Taylor S., Kim T., Yue Y., Mahler M., Krahe J., Rodriguez A. G., Hodgins J., Matthews I.: A deep learning approach for generalized speech animation. ACM Transactions on Graphics 36, 4 (July 2017), 93:1–93:11.
- [TPW16] Taigman Y., Polyak A., Wolf L.: Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200, 2016.
- [TZN*15] Thies J., Zollhöfer M., Nießner M., Valgaerts L., Stamminger M., Theobalt C.: Real-time expression transfer for facial reenactment. ACM Transactions on Graphics 34, 6 (October 2015), 183:1–183:14.
- [TZS*16]
Thies J., Zollhofer M., Stamminger M., Theobalt C., Nießner M.: Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016), IEEE, pp. 2387–2395.
10.1109/CVPR.2016.262 Google Scholar
- [VBPP05] Vlasic D., Brand M., Pfister H., Popović J.: Face transfer with multilinear models. ACM Transactions on Graphics 24, 3 (July 2005), 426–433.
- [WBLP11] Weise T., Bouaziz S., Li H., Pauly M.: Realtime performance-based facial animation. ACM Transactions on Graphics 30, 4 (July 2011), 77:1–77:10.
- [WSXC16] Wang C., Shi F., Xia S., Chai J.: Realtime 3d eye gaze animation using a single rgb camera. ACM Transactions on Graphics 35, 4 (July 2016), 118:1–118:14.
- [XCLT14] Xu F., Chai J., Liu Y., Tong X.: Controllable high-fidelity facial performance transfer. ACM Transactions on Graphics 33, 4 (July 2014), 42:1–42:11.
- [YBS*12]
Yang F., Bourdev L., Shechtman E., Wang J., Metaxas D.: Facial expression editing in video using a temporally-smooth factorization. In Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition (June 2012), IEEE, pp. 861–868.
10.1109/CVPR.2012.6247759 Google Scholar
- [YWS*11] Yang F., Wang J., Shechtman E., Bourdev L., Metaxas D.: Expression flow for 3d-aware face component transfer. ACM Transactions on Graphics 30, 4 (July 2011), 60:1–60:10.
- [ZPIE17] Zhu J., Park T., Isola P., Efros A. A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV) (Oct 2017), pp. 2242–2251.
10.1109/ICCV.2017.244 Google Scholar